name: senior-data-scientist description: license: MIT + Commons Clause metadata: version: 1.0.0 author: borghei category: engineering domain: data-science updated: 2026-03-31 tags: [data-science, ml, statistics, experimentation, python, mlops]
Senior Data Scientist
Expert data science for statistical modeling, experimentation, ML deployment, and data-driven decision making.
Keywords
data-science, machine-learning, statistics, a-b-testing, causal-inference, feature-engineering, mlops, experiment-design, model-deployment, python, scikit-learn, pytorch, tensorflow, spark, airflow
Quick Start
# Design an experiment with power analysis
python scripts/experiment_designer.py --input data/ --output results/
# Run feature engineering pipeline
python scripts/feature_engineering_pipeline.py --target project/ --analyze
# Evaluate model performance
python scripts/model_evaluation_suite.py --config config.yaml --deploy
# Statistical analysis
python scripts/statistical_analyzer.py --data input.csv --test ttest --output report.json
Tools
| Script | Purpose |
|---|---|
scripts/experiment_designer.py | A/B test design, power analysis, sample size calculation |
scripts/feature_engineering_pipeline.py | Automated feature generation, correlation analysis, feature selection |
scripts/statistical_analyzer.py | Hypothesis testing, causal inference, regression analysis |
scripts/model_evaluation_suite.py | Model comparison, cross-validation, deployment readiness checks |
Tech Stack
| Category | Tools |
|---|---|
| Languages | Python, SQL, R, Scala |
| ML Frameworks | PyTorch, TensorFlow, Scikit-learn, XGBoost |
| Data Processing | Spark, Airflow, dbt, Kafka, Databricks |
| Deployment | Docker, Kubernetes, AWS SageMaker, GCP Vertex AI |
| Experiment Tracking | MLflow, Weights & Biases |
| Databases | PostgreSQL, BigQuery, Snowflake, Pinecone |
Workflow 1: Design and Analyze an A/B Test
- Define hypothesis -- State the null and alternative hypotheses. Identify the primary metric (e.g., conversion rate, revenue per user).
- Calculate sample size --
python scripts/experiment_designer.py --input data/ --output results/- Specify minimum detectable effect (MDE), significance level (alpha=0.05), and power (0.80).
- Example: For baseline conversion 5%, MDE 10% relative lift, need ~31,000 users per variant.
- Randomize assignment -- Use hash-based assignment on user ID for deterministic, reproducible splits.
- Run experiment -- Monitor for sample ratio mismatch (SRM) daily. Flag if observed ratio deviates >1% from expected.
- Analyze results:
from scipy import stats # Two-proportion z-test for conversion rates control_conv = control_successes / control_total treatment_conv = treatment_successes / treatment_total z_stat, p_value = stats.proportions_ztest( [treatment_successes, control_successes], [treatment_total, control_total], alternative='two-sided' ) # Reject H0 if p_value < 0.05 - Validate -- Check for novelty effects, Simpson's paradox across segments, and pre-experiment balance on covariates.
Workflow 2: Build a Feature Engineering Pipeline
- Profile raw data --
python scripts/feature_engineering_pipeline.py --target project/ --analyze- Identify null rates, cardinality, distributions, and data types.
- Generate candidate features:
- Temporal: day-of-week, hour, recency, frequency, monetary (RFM)
- Aggregation: rolling means/sums over 7d/30d/90d windows
- Interaction: ratio features, polynomial combinations
- Text: TF-IDF, embedding vectors
- Select features -- Remove features with >95% null rate, near-zero variance, or >0.95 pairwise correlation. Use recursive feature elimination or SHAP importance.
- Validate -- Confirm no target leakage (no features derived from post-outcome data). Check train/test distribution alignment.
- Register -- Store features in feature store with versioning and lineage metadata.
Workflow 3: Train and Evaluate a Model
- Split data -- Stratified train/validation/test split (70/15/15). For time series, use temporal split (no future leakage).
- Train baseline -- Start with a simple model (logistic regression, gradient boosted trees) to establish a benchmark.
- Tune hyperparameters -- Use Optuna or cross-validated grid search. Log all runs to MLflow.
- Evaluate on held-out test set:
from sklearn.metrics import classification_report, roc_auc_score y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] print(classification_report(y_test, y_pred)) print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}") - Validate -- Check calibration (predicted probabilities match observed rates). Evaluate fairness metrics across protected groups. Confirm no overfitting (train vs test gap <5%).
Workflow 4: Deploy a Model to Production
- Containerize -- Package model with inference dependencies in Docker:
docker build -t model-service:v1 . - Set up serving -- Deploy behind a REST API with health check, input validation, and structured error responses.
- Configure monitoring:
- Input drift: compare incoming feature distributions to training baseline (KS test, PSI)
- Output drift: monitor prediction distribution shifts
- Performance: track latency P50/P95/P99 targets (<50ms / <100ms / <200ms)
- Enable canary deployment -- Route 5% traffic to new model, compare metrics against baseline for 24-48 hours.
- Validate --
python scripts/model_evaluation_suite.py --config config.yaml --deployconfirms serving latency, error rate <0.1%, and model outputs match offline evaluation.
Workflow 5: Perform Causal Inference
- Assess assignment mechanism -- Determine if treatment was randomized (use experiment analysis) or observational (use causal methods below).
- Select method based on data structure:
- Propensity Score Matching: when treatment is binary, many covariates available
- Difference-in-Differences: when pre/post data available for treatment and control groups
- Regression Discontinuity: when treatment assigned by threshold on running variable
- Instrumental Variables: when unobserved confounding present but valid instrument exists
- Check assumptions -- Parallel trends (DiD), overlap/positivity (PSM), continuity (RDD).
- Estimate treatment effect and compute confidence intervals.
- Validate -- Run placebo tests (apply method to pre-treatment period, expect null effect). Sensitivity analysis for unobserved confounding.
Performance Targets
| Metric | Target |
|---|---|
| P50 latency | < 50ms |
| P95 latency | < 100ms |
| P99 latency | < 200ms |
| Throughput | > 1,000 req/s |
| Availability | 99.9% |
| Error rate | < 0.1% |
Common Commands
# Development
python -m pytest tests/ -v --cov
python -m black src/
python -m pylint src/
# Training
python scripts/train.py --config prod.yaml
python scripts/evaluate.py --model best.pth
# Deployment
docker build -t service:v1 .
kubectl apply -f k8s/
helm upgrade service ./charts/
# Monitoring
kubectl logs -f deployment/service
python scripts/health_check.py
Reference Documentation
| Document | Path |
|---|---|
| Statistical Methods | references/statistical_methods_advanced.md |
| Experiment Design Frameworks | references/experiment_design_frameworks.md |
| Feature Engineering Patterns | references/feature_engineering_patterns.md |
| Automation Scripts | scripts/ directory |
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Sample size calculation returns unreasonably large numbers | Minimum detectable effect (MDE) is set too small relative to baseline variance | Increase MDE to a practically meaningful threshold or accept longer experiment duration |
| Feature pipeline reports high null rates across all generated features | Source data contains upstream ingestion gaps or schema drift | Validate raw data completeness before running the pipeline; check ETL logs for failed loads |
| Model AUC drops significantly on validation vs. training set | Overfitting due to high-cardinality features or insufficient regularization | Apply stronger regularization, reduce feature set, or increase training data volume |
| Experiment shows significant results but large confidence intervals | Insufficient sample size or high metric variance | Extend experiment runtime, increase traffic allocation, or switch to a variance-reduction technique (CUPED) |
| Deployed model latency exceeds P95 targets | Model complexity too high for serving infrastructure or missing batching | Quantize the model, reduce input feature count, or enable request batching on the serving layer |
| Feature importance scores are unstable across cross-validation folds | Correlated features cause importance to shift between redundant predictors | Remove highly correlated features (>0.95) before training or use permutation importance with repeated runs |
| Causal inference estimates show implausible treatment effects | Violation of parallel trends assumption (DiD) or poor covariate overlap (PSM) | Run diagnostic tests (placebo checks, overlap histograms) and consider alternative identification strategies |
Success Criteria
- Model discrimination: AUC-ROC above 0.85 on held-out test set for classification tasks
- Calibration quality: Brier score below 0.15; predicted probabilities within 5% of observed rates across decile bins
- Feature coverage: Feature importance analysis accounts for at least 90% of cumulative model importance
- Experiment power: All A/B tests designed with statistical power of 0.80 or higher at the specified MDE
- Deployment readiness: Model serving latency under 100ms at P95; error rate below 0.1%
- Reproducibility: All experiments and training runs logged with full parameter tracking; results reproducible from logged artifacts
- Data quality: Input feature pipelines maintain less than 2% null rate on critical features after imputation
Scope & Limitations
This skill covers:
- End-to-end experiment design including power analysis, randomization, and post-hoc analysis
- Feature engineering pipelines with profiling, generation, selection, and validation
- Model training evaluation including cross-validation, calibration, and fairness checks
- Production model deployment with monitoring, drift detection, and canary rollouts
This skill does NOT cover:
- Data engineering infrastructure (ETL orchestration, pipeline scheduling, data lake management) -- see
senior-data-engineer - Deep learning model architecture design and training at scale (distributed GPU training, custom layers) -- see
senior-ml-engineer - Prompt engineering, RAG systems, and LLM fine-tuning workflows -- see
senior-prompt-engineer - Computer vision pipelines (object detection, segmentation, video processing) -- see
senior-computer-vision
Integration Points
| Skill | Integration | Data Flow |
|---|---|---|
senior-data-engineer | Feature pipeline ingests data from ETL outputs; shares data quality validation patterns | Raw data stores --> feature engineering pipeline --> feature store |
senior-ml-engineer | Trained models handed off for MLOps deployment; shares model registry and serving configs | Evaluated model artifacts --> deployment pipeline --> production serving |
senior-prompt-engineer | Embedding features from LLMs feed into ML pipelines; experiment frameworks apply to prompt A/B tests | LLM embeddings --> feature vectors; experiment designs --> prompt evaluation |
senior-architect | Model serving architecture reviewed for scalability; data platform design aligned with training infrastructure | Architecture specs --> deployment topology --> monitoring dashboards |
senior-backend | Model inference endpoints integrated into backend services; API contracts defined for prediction requests | REST/gRPC model API --> backend service layer --> client applications |
senior-devops | CI/CD pipelines extended for model retraining triggers; containerized model images deployed via infrastructure-as-code | Docker images --> Kubernetes manifests --> production clusters |
Tool Reference
experiment_designer.py
Purpose: A/B test design, statistical power analysis, and sample size calculation. Validates experiment configuration and produces structured results with timestamps.
Usage:
python scripts/experiment_designer.py --input data/ --output results/
Flags/Parameters:
| Flag | Short | Required | Description |
|---|---|---|---|
--input | -i | Yes | Input path (directory or file containing experiment data) |
--output | -o | Yes | Output path (directory for results) |
--config | -c | No | Path to configuration file (YAML or JSON) |
--verbose | -v | No | Enable verbose (DEBUG-level) logging output |
Example:
python scripts/experiment_designer.py -i data/experiment_config/ -o results/power_analysis/ -c config.yaml -v
Output Format: JSON to stdout with the following structure:
{
"status": "completed",
"start_time": "2026-03-21T10:00:00.000000",
"processed_items": 0,
"end_time": "2026-03-21T10:00:01.000000"
}
feature_engineering_pipeline.py
Purpose: Automated feature generation, correlation analysis, and feature selection. Profiles raw data, generates candidate features, and validates for target leakage.
Usage:
python scripts/feature_engineering_pipeline.py --input data/ --output features/
Flags/Parameters:
| Flag | Short | Required | Description |
|---|---|---|---|
--input | -i | Yes | Input path (directory or file containing raw data) |
--output | -o | Yes | Output path (directory for generated features) |
--config | -c | No | Path to configuration file (YAML or JSON) |
--verbose | -v | No | Enable verbose (DEBUG-level) logging output |
Example:
python scripts/feature_engineering_pipeline.py -i data/raw/ -o features/v2/ -v
Output Format: JSON to stdout with the following structure:
{
"status": "completed",
"start_time": "2026-03-21T10:00:00.000000",
"processed_items": 0,
"end_time": "2026-03-21T10:00:01.000000"
}
model_evaluation_suite.py
Purpose: Model comparison, cross-validation, and deployment readiness checks. Validates serving latency, error rates, and confirms model outputs match offline evaluation.
Usage:
python scripts/model_evaluation_suite.py --input models/ --output evaluation/
Flags/Parameters:
| Flag | Short | Required | Description |
|---|---|---|---|
--input | -i | Yes | Input path (directory or file containing model artifacts) |
--output | -o | Yes | Output path (directory for evaluation results) |
--config | -c | No | Path to configuration file (YAML or JSON) |
--verbose | -v | No | Enable verbose (DEBUG-level) logging output |
Example:
python scripts/model_evaluation_suite.py -i models/xgb_v3/ -o evaluation/report/ -c prod_config.yaml
Output Format: JSON to stdout with the following structure:
{
"status": "completed",
"start_time": "2026-03-21T10:00:00.000000",
"processed_items": 0,
"end_time": "2026-03-21T10:00:01.000000"
}
Note: The Tools table references
scripts/statistical_analyzer.pybut this script does not yet exist in the repository. Statistical analysis workflows described in the SKILL.md can be performed using inline Python (scipy, statsmodels) as shown in Workflow 1 and Workflow 5.