MLOps Production Checklist
A practical checklist for taking ML models from experiment to production. Based on lessons from real deployments.
Model Development
- Experiment tracking configured (MLflow, W&B, or Neptune)
- Reproducible training: seeds fixed, environment pinned (
requirements.txtor Docker) - Offline evaluation on held-out test set before any deployment
- Baseline established (rule-based or previous model version)
- Model card documented (training data, intended use, limitations)
Data Pipeline
- Data validation at ingestion (schema checks, null rates, distribution drift)
- Feature store or consistent feature engineering between train and serve
- Training/serving skew identified and mitigated
- Data versioning (DVC, Delta Lake, or equivalent)
# Example: Great Expectations data validation
import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_csv("data.csv")
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
results = validator.validate()
Serving Infrastructure
- Model serialized in a portable format (ONNX, TorchScript, SavedModel)
- API contract defined and versioned (
/v1/predict) - Health check endpoint (
/health,/ready) - Graceful shutdown handling
- Request/response schema validated (Pydantic)
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class PredictRequest(BaseModel):
features: list[float]
class PredictResponse(BaseModel):
prediction: float
confidence: float
@app.post("/v1/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
score = model.predict([req.features])[0]
return PredictResponse(prediction=score, confidence=0.95)
Monitoring
What to Monitor
| Signal | Tool | Alert Threshold |
|---|---|---|
| Prediction latency (p99) | Prometheus + Grafana | > 500ms |
| Error rate | Same | > 1% |
| Input feature drift | Evidently, NannyML | PSI > 0.2 |
| Output distribution shift | Evidently | KL divergence > threshold |
| Business metric (downstream) | Custom | Defined per use case |
Logging
Log enough to reconstruct any prediction:
- Timestamp, request ID
- Input features (anonymized if needed)
- Model version
- Prediction and confidence
- Latency
Deployment Strategy
Shadow Mode
Run new model in parallel, log predictions, compare offline — zero user impact.
Canary Release
Route 5–10% of traffic to new model, monitor metrics, gradually increase.
A/B Testing
Split traffic intentionally, measure business metrics with statistical significance before full rollout.
CI/CD for ML
# Example GitHub Actions stage
- name: Model evaluation gate
run: |
python evaluate.py --model artifacts/model.pkl --threshold 0.85
# Fails pipeline if AUC < 0.85
Key gates before promotion:
- Unit tests on feature engineering code
- Model performance above threshold on evaluation set
- Data validation passing on latest data slice
- Integration test against staging endpoint
Common Production Incidents
| Incident | Root Cause | Prevention |
|---|---|---|
| Silent model degradation | No output monitoring | Track prediction distribution daily |
| Training/serving skew | Different preprocessing code paths | Single feature computation function used in both |
| OOM in serving | Batch size too large | Load test before deployment |
| Slow cold start | Model loaded per request | Load model at startup, not per request |