MLOps Production Checklist

A practical checklist for taking ML models from experiment to production. Based on lessons from real deployments.

Model Development

Experiment tracking configured (MLflow, W&B, or Neptune)
Reproducible training: seeds fixed, environment pinned (requirements.txt or Docker)
Offline evaluation on held-out test set before any deployment
Baseline established (rule-based or previous model version)
Model card documented (training data, intended use, limitations)

Data Pipeline

Data validation at ingestion (schema checks, null rates, distribution drift)
Feature store or consistent feature engineering between train and serve
Training/serving skew identified and mitigated
Data versioning (DVC, Delta Lake, or equivalent)

# Example: Great Expectations data validation
import great_expectations as gx

context = gx.get_context()
validator = context.sources.pandas_default.read_csv("data.csv")
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
results = validator.validate()

Serving Infrastructure

Model serialized in a portable format (ONNX, TorchScript, SavedModel)
API contract defined and versioned (/v1/predict)
Health check endpoint (/health, /ready)
Graceful shutdown handling
Request/response schema validated (Pydantic)

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class PredictRequest(BaseModel):
    features: list[float]

class PredictResponse(BaseModel):
    prediction: float
    confidence: float

@app.post("/v1/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
    score = model.predict([req.features])[0]
    return PredictResponse(prediction=score, confidence=0.95)

Monitoring

What to Monitor

Signal	Tool	Alert Threshold
Prediction latency (p99)	Prometheus + Grafana	> 500ms
Error rate	Same	> 1%
Input feature drift	Evidently, NannyML	PSI > 0.2
Output distribution shift	Evidently	KL divergence > threshold
Business metric (downstream)	Custom	Defined per use case

Logging

Log enough to reconstruct any prediction:

Timestamp, request ID
Input features (anonymized if needed)
Model version
Prediction and confidence
Latency

Deployment Strategy

Shadow Mode

Run new model in parallel, log predictions, compare offline — zero user impact.

Canary Release

Route 5–10% of traffic to new model, monitor metrics, gradually increase.

A/B Testing

Split traffic intentionally, measure business metrics with statistical significance before full rollout.

CI/CD for ML

# Example GitHub Actions stage
- name: Model evaluation gate
  run: |
    python evaluate.py --model artifacts/model.pkl --threshold 0.85
    # Fails pipeline if AUC < 0.85

Key gates before promotion:

Unit tests on feature engineering code
Model performance above threshold on evaluation set
Data validation passing on latest data slice
Integration test against staging endpoint

Common Production Incidents

Incident	Root Cause	Prevention
Silent model degradation	No output monitoring	Track prediction distribution daily
Training/serving skew	Different preprocessing code paths	Single feature computation function used in both
OOM in serving	Batch size too large	Load test before deployment
Slow cold start	Model loaded per request	Load model at startup, not per request

Model Development
Data Pipeline
Serving Infrastructure
Monitoring
- What to Monitor
- Logging
Deployment Strategy
CI/CD for ML
Common Production Incidents

Model Development​

Data Pipeline​

Serving Infrastructure​

Monitoring​

What to Monitor​

Logging​

Deployment Strategy​

Shadow Mode​

Canary Release​

A/B Testing​

CI/CD for ML​

Common Production Incidents​