Skip to main content

MLOps Production Checklist

A practical checklist for taking ML models from experiment to production. Based on lessons from real deployments.

Model Development

  • Experiment tracking configured (MLflow, W&B, or Neptune)
  • Reproducible training: seeds fixed, environment pinned (requirements.txt or Docker)
  • Offline evaluation on held-out test set before any deployment
  • Baseline established (rule-based or previous model version)
  • Model card documented (training data, intended use, limitations)

Data Pipeline

  • Data validation at ingestion (schema checks, null rates, distribution drift)
  • Feature store or consistent feature engineering between train and serve
  • Training/serving skew identified and mitigated
  • Data versioning (DVC, Delta Lake, or equivalent)
# Example: Great Expectations data validation
import great_expectations as gx

context = gx.get_context()
validator = context.sources.pandas_default.read_csv("data.csv")
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
results = validator.validate()

Serving Infrastructure

  • Model serialized in a portable format (ONNX, TorchScript, SavedModel)
  • API contract defined and versioned (/v1/predict)
  • Health check endpoint (/health, /ready)
  • Graceful shutdown handling
  • Request/response schema validated (Pydantic)
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class PredictRequest(BaseModel):
features: list[float]

class PredictResponse(BaseModel):
prediction: float
confidence: float

@app.post("/v1/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
score = model.predict([req.features])[0]
return PredictResponse(prediction=score, confidence=0.95)

Monitoring

What to Monitor

SignalToolAlert Threshold
Prediction latency (p99)Prometheus + Grafana> 500ms
Error rateSame> 1%
Input feature driftEvidently, NannyMLPSI > 0.2
Output distribution shiftEvidentlyKL divergence > threshold
Business metric (downstream)CustomDefined per use case

Logging

Log enough to reconstruct any prediction:

  • Timestamp, request ID
  • Input features (anonymized if needed)
  • Model version
  • Prediction and confidence
  • Latency

Deployment Strategy

Shadow Mode

Run new model in parallel, log predictions, compare offline — zero user impact.

Canary Release

Route 5–10% of traffic to new model, monitor metrics, gradually increase.

A/B Testing

Split traffic intentionally, measure business metrics with statistical significance before full rollout.

CI/CD for ML

# Example GitHub Actions stage
- name: Model evaluation gate
run: |
python evaluate.py --model artifacts/model.pkl --threshold 0.85
# Fails pipeline if AUC < 0.85

Key gates before promotion:

  1. Unit tests on feature engineering code
  2. Model performance above threshold on evaluation set
  3. Data validation passing on latest data slice
  4. Integration test against staging endpoint

Common Production Incidents

IncidentRoot CausePrevention
Silent model degradationNo output monitoringTrack prediction distribution daily
Training/serving skewDifferent preprocessing code pathsSingle feature computation function used in both
OOM in servingBatch size too largeLoad test before deployment
Slow cold startModel loaded per requestLoad model at startup, not per request