100 Days of ML · Module 8 (100)

Module 8: ML Deployment & Production

100 Days of ML Module 8 — Model serialisation, Flask/FastAPI REST APIs, Streamlit apps, Docker, cloud deployment (AWS/GCP/Heroku), model monitoring, CI/CD for ML, and capstone project.

⏱ 65 Min Read • 100 • Updated: May 2026

A model that lives only in a Jupyter notebook has zero business value. This final module teaches you to package, serve, containerise, deploy, and monitor ML models in production — the skills that separate data scientists from ML engineers. By Day 100 you'll have deployed a real model to the cloud.

Model Serialisation — pickle, joblib, ONNX

Why this matters

Model Serialisation: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Serialisation?

Training a model every time you need a prediction is slow and wasteful. Serialisation saves the trained model (including all learned parameters and the preprocessing pipeline) to disk. At serving time, you load the serialised model and call predict() instantly.

pickle — Python's Native Serialisation

Code Example

import pickle
import joblib
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import onnx
import warnings
warnings.filterwarnings('ignore')

# ── Train a pipeline ──────────────────────────────────────────
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  GradientBoostingClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)

# ── Method 1: pickle ──────────────────────────────────────────
with open('model.pkl', 'wb') as f:
    pickle.dump(pipeline, f, protocol=pickle.HIGHEST_PROTOCOL)

# Load
with open('model.pkl', 'rb') as f:
    loaded_pkl = pickle.load(f)
print(f"pickle accuracy: {loaded_pkl.score(X_test, y_test):.4f}")

# Limitations:
# - Python-version specific (pickle files may not load across Python versions)
# - Security risk: never unpickle untrusted data (arbitrary code execution)
# - Slow for large numpy arrays

# ── Method 2: joblib — PREFERRED for sklearn models ───────────
# joblib uses memory-mapped numpy arrays — much faster for large models
joblib.dump(pipeline, 'model.joblib', compress=3)  # compress: 0-9, 3 = good balance
loaded_jl = joblib.load('model.joblib')
print(f"joblib accuracy: {loaded_jl.score(X_test, y_test):.4f}")

# joblib features:
# - Faster for models with large numpy arrays (e.g., Random Forests)
# - compress param reduces file size
# - Supports parallel loading for large files

# ── Method 3: ONNX — cross-platform, cross-language ──────────
# pip install skl2onnx onnxruntime
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as rt

# Convert sklearn pipeline to ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(pipeline, initial_types=initial_type, target_opset=15)

with open('model.onnx', 'wb') as f:
    f.write(onnx_model.SerializeToString())

# Run inference with ONNX Runtime (works in C++, Java, C#, JavaScript too!)
sess = rt.InferenceSession('model.onnx')
input_name  = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

X_test_float = X_test[:5].astype(np.float32)
onnx_preds = sess.run([output_name], {input_name: X_test_float})[0]
print(f"ONNX predictions: {onnx_preds}")

# ONNX advantages:
# - Language-agnostic: deploy Python model in a Go/Java/C++ service
# - Hardware-optimised: ONNX Runtime uses CPU/GPU optimisations
# - Used in production by Microsoft, NVIDIA, Intel

# ── Versioning convention ─────────────────────────────────────
import datetime
version = datetime.datetime.now().strftime('%Y%m%d_%H%M')
joblib.dump(pipeline, f'models/churn_v{version}.joblib')  # e.g. models/churn_v20260526_1430.joblib

Format	Speed	Cross-Language	Use Case
pickle	OK	Python only	Quick prototyping; small models
joblib	Fast	Python only	Production sklearn/numpy models
ONNX	Very fast (optimised)	Any language	Enterprise cross-platform deployment
MLflow format	Varies	REST API via mlflow serve	MLflow ecosystem

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain model serialisation and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 92 — FastAPI

Flask REST API for ML

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Project Structure

Code Example

churn-api/
├── app.py              ← Flask application
├── model.joblib        ← Serialised pipeline
├── requirements.txt    ← Flask, joblib, scikit-learn, gunicorn
└── Dockerfile          ← Container definition (Day 95)

app.py — Complete Flask ML API

"""
app.py — Flask REST API for churn prediction model
Usage: python app.py  (development)
       gunicorn -w 4 -b 0.0.0.0:5000 app:app  (production)
"""
from flask import Flask, request, jsonify
import joblib
import numpy as np
import pandas as pd
import logging
import time
from functools import wraps

# ── Initialise app ────────────────────────────────────────────
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ── Load model at startup (not per-request!) ──────────────────
try:
    pipeline = joblib.load('model.joblib')
    logger.info("Model loaded successfully")
except Exception as e:
    logger.error(f"Failed to load model: {e}")
    pipeline = None

# ── Input validation ──────────────────────────────────────────
REQUIRED_FIELDS = ['tenure_months', 'monthly_charges', 'total_charges',
                   'contract_type', 'internet_service', 'tech_support',
                   'senior_citizen', 'num_complaints']

def validate_input(data):
    """Validate request JSON. Returns (cleaned_data, error_message)."""
    missing = [f for f in REQUIRED_FIELDS if f not in data]
    if missing:
        return None, f"Missing fields: {missing}"

    # Type validation
    try:
        tenure = int(data['tenure_months'])
        if not (0 < tenure <= 120):
            return None, "tenure_months must be between 1 and 120"
        monthly = float(data['monthly_charges'])
        if not (0 < monthly <= 500):
            return None, "monthly_charges must be between 0 and 500"
    except (ValueError, TypeError) as e:
        return None, f"Invalid data type: {str(e)}"

    return data, None

# ── Timing decorator ──────────────────────────────────────────
def timed(f):
    @wraps(f)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = f(*args, **kwargs)
        elapsed = (time.time() - start) * 1000
        logger.info(f"{f.__name__} took {elapsed:.1f}ms")
        return result
    return wrapper

# ── Health check endpoint ─────────────────────────────────────
@app.route('/health', methods=['GET'])
def health():
    """Kubernetes/Docker health check."""
    status = 'healthy' if pipeline is not None else 'unhealthy'
    return jsonify({'status': status, 'model': 'churn-v1'}), 200 if pipeline else 503

# ── Single prediction endpoint ────────────────────────────────
@app.route('/predict', methods=['POST'])
@timed
def predict():
    """Predict churn probability for a single customer.
    
    Request JSON:
    {
        "tenure_months": 12,
        "monthly_charges": 65.0,
        "total_charges": 780.0,
        "contract_type": "Month-to-month",
        "internet_service": "Fiber optic",
        "tech_support": "No",
        "senior_citizen": 0,
        "num_complaints": 1
    }
    
    Response:
    {
        "churn_probability": 0.73,
        "prediction": 1,
        "risk_level": "High"
    }
    """
    if pipeline is None:
        return jsonify({'error': 'Model not loaded'}), 503

    # Parse JSON
    if not request.is_json:
        return jsonify({'error': 'Content-Type must be application/json'}), 400

    data = request.get_json()
    if not data:
        return jsonify({'error': 'Empty request body'}), 400

    # Validate
    cleaned, error = validate_input(data)
    if error:
        return jsonify({'error': error}), 422

    # Predict
    try:
        df = pd.DataFrame([cleaned])
        prob = pipeline.predict_proba(df)[0, 1]
        prediction = int(prob >= 0.35)  # Custom threshold from Day 87
        risk = 'High' if prob > 0.6 else ('Medium' if prob > 0.35 else 'Low')

        return jsonify({
            'churn_probability': round(float(prob), 4),
            'prediction':        prediction,
            'risk_level':        risk,
            'recommended_action': {
                'High':   'Immediate retention call + discount offer',
                'Medium': 'Send personalised retention email',
                'Low':    'No action needed'
            }[risk]
        }), 200

    except Exception as e:
        logger.error(f"Prediction error: {e}", exc_info=True)
        return jsonify({'error': 'Prediction failed', 'detail': str(e)}), 500

# ── Batch prediction endpoint ─────────────────────────────────
@app.route('/predict/batch', methods=['POST'])
@timed
def predict_batch():
    """Predict for multiple customers at once."""
    data = request.get_json()
    if not isinstance(data, list):
        return jsonify({'error': 'Request body must be a JSON array'}), 400
    if len(data) > 1000:
        return jsonify({'error': 'Maximum 1000 records per batch'}), 400

    try:
        df = pd.DataFrame(data)
        probs = pipeline.predict_proba(df)[:, 1]
        results = [
            {'index': i, 'churn_probability': round(float(p), 4),
             'prediction': int(p >= 0.35)}
            for i, p in enumerate(probs)
        ]
        return jsonify({'predictions': results, 'count': len(results)}), 200
    except Exception as e:
        return jsonify({'error': str(e)}), 500

# ── Model info endpoint ───────────────────────────────────────
@app.route('/model/info', methods=['GET'])
def model_info():
    return jsonify({
        'model_type': type(pipeline.named_steps.get('model', pipeline)).__name__,
        'features': REQUIRED_FIELDS,
        'threshold': 0.35,
        'version': 'v1.0.0'
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=False)

requirements.txt

flask==3.0.0
gunicorn==21.2.0
joblib==1.3.2
scikit-learn==1.4.0
numpy==1.26.0
pandas==2.1.0

Code Example

# ── Test the API with curl ────────────────────────────────────
# Health check
curl http://localhost:5000/health

# Single prediction
curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"tenure_months": 3, "monthly_charges": 85.0, "total_charges": 255.0,
       "contract_type": "Month-to-month", "internet_service": "Fiber optic",
       "tech_support": "No", "senior_citizen": 0, "num_complaints": 2}'

# Expected response:
# {"churn_probability": 0.7831, "prediction": 1, "risk_level": "High",
#  "recommended_action": "Immediate retention call + discount offer"}

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain flask rest api for ml and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 93 — Streamlit Apps

FastAPI for ML — Modern, Fast, Auto-Documented

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

FastAPI is the modern successor to Flask for ML APIs. It offers automatic OpenAPI/Swagger documentation, request validation via Pydantic, async support, and is ~2–3× faster than Flask for I/O-bound workloads.

main.py — Complete FastAPI ML Service

"""
main.py — FastAPI ML service with Pydantic validation and auto-documentation
Run: uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Docs: http://localhost:8000/docs  (Swagger UI — generated automatically!)
"""
from fastapi import FastAPI, HTTPException, status
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field, validator
from typing import List, Literal, Optional
import joblib
import pandas as pd
import numpy as np
import logging
from contextlib import asynccontextmanager

logger = logging.getLogger(__name__)

# ── Lifespan (load model at startup) ─────────────────────────
ml_models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: load model
    try:
        ml_models['churn'] = joblib.load('model.joblib')
        logger.info("Model loaded")
    except Exception as e:
        logger.error(f"Model load failed: {e}")
    yield
    # Shutdown: cleanup
    ml_models.clear()

app = FastAPI(
    title="Churn Prediction API",
    description="ML-powered customer churn prediction service",
    version="1.0.0",
    lifespan=lifespan
)

# ── CORS middleware ───────────────────────────────────────────
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],   # Restrict in production
    allow_methods=["*"],
    allow_headers=["*"],
)

# ── Pydantic Models — automatic validation + documentation ────
class CustomerFeatures(BaseModel):
    """Input features for a single customer."""
    tenure_months:    int    = Field(..., ge=1, le=120, description="Months as customer (1-120)")
    monthly_charges:  float  = Field(..., ge=0, le=500, description="Monthly bill in USD")
    total_charges:    float  = Field(..., ge=0, description="Total amount billed")
    contract_type:    Literal['Month-to-month', 'One year', 'Two year']
    internet_service: Literal['DSL', 'Fiber optic', 'No']
    tech_support:     Literal['Yes', 'No']
    senior_citizen:   Literal[0, 1] = Field(..., description="1 if senior citizen")
    num_complaints:   int    = Field(..., ge=0, le=50, description="Number of support complaints")

    @validator('total_charges')
    def total_must_be_consistent(cls, v, values):
        """Sanity check: total_charges should be >= monthly_charges."""
        if 'monthly_charges' in values and v < values['monthly_charges'] * 0.5:
            raise ValueError('total_charges seems too low for the given tenure and monthly_charges')
        return v

    class Config:
        json_schema_extra = {
            "example": {
                "tenure_months": 3, "monthly_charges": 85.0, "total_charges": 255.0,
                "contract_type": "Month-to-month", "internet_service": "Fiber optic",
                "tech_support": "No", "senior_citizen": 0, "num_complaints": 2
            }
        }

class BatchRequest(BaseModel):
    customers: List[CustomerFeatures] = Field(..., max_items=1000)

class PredictionResponse(BaseModel):
    churn_probability: float
    prediction:        int
    risk_level:        str
    recommended_action: str

class BatchPredictionResponse(BaseModel):
    predictions: List[PredictionResponse]
    count: int

# ── Prediction logic ──────────────────────────────────────────
def get_prediction(features: CustomerFeatures) -> PredictionResponse:
    if 'churn' not in ml_models:
        raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
                            detail="Model not loaded")
    df = pd.DataFrame([features.dict()])
    prob = float(ml_models['churn'].predict_proba(df)[0, 1])
    pred = int(prob >= 0.35)
    risk = 'High' if prob > 0.6 else ('Medium' if prob > 0.35 else 'Low')
    actions = {
        'High':   'Immediate retention call + discount offer',
        'Medium': 'Send personalised retention email',
        'Low':    'No action needed'
    }
    return PredictionResponse(
        churn_probability=round(prob, 4),
        prediction=pred,
        risk_level=risk,
        recommended_action=actions[risk]
    )

# ── Endpoints ─────────────────────────────────────────────────
@app.get("/health", tags=["Monitoring"])
async def health_check():
    """Kubernetes readiness probe."""
    model_ok = 'churn' in ml_models
    return {"status": "healthy" if model_ok else "degraded", "model_loaded": model_ok}

@app.post("/predict", response_model=PredictionResponse, tags=["Prediction"])
async def predict(customer: CustomerFeatures):
    """Predict churn probability for a single customer.
    
    Returns probability (0-1), binary prediction, and recommended retention action.
    """
    return get_prediction(customer)

@app.post("/predict/batch", response_model=BatchPredictionResponse, tags=["Prediction"])
async def predict_batch(batch: BatchRequest):
    """Predict churn for up to 1000 customers in a single request."""
    predictions = [get_prediction(c) for c in batch.customers]
    return BatchPredictionResponse(predictions=predictions, count=len(predictions))

@app.get("/model/info", tags=["Model"])
async def model_info():
    """Return metadata about the deployed model."""
    return {
        "model_version": "1.0.0",
        "threshold": 0.35,
        "features": list(CustomerFeatures.__fields__.keys()),
        "documentation": "/docs"
    }

# ── requirements.txt for FastAPI ──────────────────────────────
# fastapi==0.109.0
# uvicorn[standard]==0.27.0
# pydantic==2.6.0
# joblib==1.3.2
# scikit-learn==1.4.0
# numpy==1.26.0
# pandas==2.1.0

💡

Flask vs FastAPI — When to Choose

FastAPI: New projects, modern Python (3.9+), need auto-docs, async I/O, type safety. Industry standard in 2024–2026.
Flask: Legacy systems, simpler needs, or when team already knows Flask well.
Both work fine for ML serving. FastAPI's Pydantic validation catches bad inputs automatically — a huge production win.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain fastapi for ml and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 94 — Docker Basics

Streamlit Apps — ML Demos in 20 Lines

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Streamlit turns Python scripts into interactive web apps with no HTML/CSS/JS required. Perfect for ML demos, internal tools, and portfolio pieces.

streamlit_app.py — Full Interactive ML Demo

"""
streamlit_app.py — Interactive churn prediction demo
Run: streamlit run streamlit_app.py
"""
import streamlit as st
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go

# ── Page config ───────────────────────────────────────────────
st.set_page_config(
    page_title="Churn Predictor",
    page_icon="📊",
    layout="wide",
    initial_sidebar_state="expanded"
)

# ── Load model (cached so it doesn't reload on every interaction) ─
@st.cache_resource
def load_model():
    return joblib.load('model.joblib')

model = load_model()

# ── Sidebar — feature inputs ──────────────────────────────────
st.sidebar.title("🔧 Customer Features")
st.sidebar.markdown("Adjust parameters to predict churn probability")

tenure = st.sidebar.slider(
    "Tenure (months)", min_value=1, max_value=72, value=12,
    help="How long has this customer been with us?"
)
monthly_charges = st.sidebar.slider(
    "Monthly Charges ($)", min_value=20.0, max_value=120.0, value=65.0, step=1.0
)
contract_type = st.sidebar.selectbox(
    "Contract Type",
    options=["Month-to-month", "One year", "Two year"],
    help="Month-to-month customers churn 3x more than annual contracts"
)
internet_service = st.sidebar.selectbox(
    "Internet Service",
    options=["DSL", "Fiber optic", "No"]
)
tech_support = st.sidebar.radio(
    "Tech Support", options=["Yes", "No"], horizontal=True
)
senior_citizen = st.sidebar.checkbox("Senior Citizen")
num_complaints = st.sidebar.number_input(
    "Number of Complaints", min_value=0, max_value=20, value=0
)

# ── Main content ──────────────────────────────────────────────
st.title("📊 Customer Churn Predictor")
st.markdown("Predict whether a customer will churn in the next 30 days based on their profile.")

# Prepare input
total_charges = tenure * monthly_charges
input_data = pd.DataFrame([{
    'tenure_months':    tenure,
    'monthly_charges':  monthly_charges,
    'total_charges':    total_charges,
    'contract_type':    contract_type,
    'internet_service': internet_service,
    'tech_support':     tech_support,
    'senior_citizen':   int(senior_citizen),
    'num_complaints':   num_complaints
}])

# Predict
prob = model.predict_proba(input_data)[0, 1]
risk = 'High' if prob > 0.6 else ('Medium' if prob > 0.35 else 'Low')
color_map = {'High': '#e74c3c', 'Medium': '#f39c12', 'Low': '#2ecc71'}

# ── Layout ────────────────────────────────────────────────────
col1, col2, col3 = st.columns(3)

with col1:
    st.metric("Churn Probability", f"{prob:.1%}",
              delta=f"{prob - 0.2:.1%} vs avg",
              delta_color="inverse")

with col2:
    st.metric("Risk Level", risk)

with col3:
    st.metric("Customer LTV", f"${total_charges:,.0f}")

# ── Gauge chart ───────────────────────────────────────────────
fig = go.Figure(go.Indicator(
    mode="gauge+number",
    value=prob * 100,
    domain={'x': [0, 1], 'y': [0, 1]},
    title={'text': "Churn Probability (%)", 'font': {'size': 20}},
    gauge={
        'axis': {'range': [0, 100], 'tickwidth': 1},
        'bar':  {'color': color_map[risk]},
        'steps': [
            {'range': [0, 35],  'color': 'rgba(46,204,113,.15)'},
            {'range': [35, 60], 'color': 'rgba(243,156,18,.15)'},
            {'range': [60, 100],'color': 'rgba(231,76,60,.15)'}
        ],
        'threshold': {
            'line': {'color': "white", 'width': 3},
            'thickness': 0.75, 'value': 35
        }
    }
))
fig.update_layout(height=250, margin=dict(t=30, b=10))
st.plotly_chart(fig, use_container_width=True)

# ── Recommendations ───────────────────────────────────────────
st.subheader("📋 Recommended Actions")
if risk == 'High':
    st.error("🚨 High churn risk! Immediate intervention recommended.")
    st.markdown("""
    - 📞 **Immediate retention call** — assign to high-priority queue
    - 🎁 **Offer contract upgrade** — 20% discount for switching to annual plan
    - 🛠️ **Free tech support upgrade** for 3 months
    """)
elif risk == 'Medium':
    st.warning("⚠️ Moderate churn risk. Proactive outreach recommended.")
    st.markdown("""
    - 📧 Send personalised retention email with loyalty rewards
    - 💬 Trigger in-app survey to understand pain points
    """)
else:
    st.success("✅ Low churn risk. Customer appears satisfied.")
    st.markdown("- Consider upselling premium features")

# ── Feature analysis ──────────────────────────────────────────
with st.expander("🔍 Feature Impact Analysis"):
    st.markdown("How each feature contributes to churn risk:")
    feature_impacts = {
        'Contract Type': 0.28 if contract_type == 'Month-to-month' else -0.15,
        'Tenure': -0.02 * tenure,
        'Monthly Charges': 0.003 * monthly_charges,
        'Complaints': 0.12 * num_complaints,
        'Tech Support': -0.08 if tech_support == 'Yes' else 0.05
    }
    impact_df = pd.DataFrame.from_dict(
        feature_impacts, orient='index', columns=['Impact']
    ).sort_values('Impact', ascending=True)

    fig2, ax = plt.subplots(figsize=(8, 3))
    colors = ['#e74c3c' if v > 0 else '#2ecc71' for v in impact_df['Impact']]
    impact_df['Impact'].plot.barh(ax=ax, color=colors)
    ax.set_xlabel("Impact on Churn Probability"); ax.set_title("Feature Impacts")
    ax.axvline(x=0, color='white', linewidth=0.5)
    fig2.patch.set_facecolor('none'); ax.set_facecolor('none')
    st.pyplot(fig2)

# ── Batch prediction ──────────────────────────────────────────
st.divider()
st.subheader("📁 Batch Prediction")
uploaded = st.file_uploader("Upload a CSV of customers", type=['csv'])
if uploaded:
    batch_df = pd.read_csv(uploaded)
    st.dataframe(batch_df.head())
    if st.button("Run Batch Prediction"):
        with st.spinner("Predicting..."):
            probs = model.predict_proba(batch_df)[:, 1]
            batch_df['churn_probability'] = probs.round(4)
            batch_df['risk_level'] = pd.cut(
                probs, bins=[0, 0.35, 0.6, 1.0],
                labels=['Low', 'Medium', 'High']
            )
        st.success(f"Done! {len(batch_df)} predictions made.")
        st.dataframe(batch_df)
        csv = batch_df.to_csv(index=False).encode('utf-8')
        st.download_button("⬇️ Download Results", csv, "predictions.csv", "text/csv")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain streamlit apps and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 95 — Docker for ML

Docker — Containerising ML Models

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

⚠️

Failure mode — Image bloat

Installing build-essential and Jupyter in production images increases attack surface and cold-start time. Use multi-stage builds: compile in builder stage, copy only .pkl + slim runtime into final image.

Why Docker for ML?

Docker packages your app + all its dependencies (Python version, libraries, OS libraries) into a single portable container. "Works on my machine" becomes "works everywhere" — local laptop, AWS, GCP, Azure, Kubernetes.

Dockerfile — Production-grade ML API container

# ── Use official Python slim image (smaller than full Python) ─
FROM python:3.11-slim

# ── Set environment variables ─────────────────────────────────
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1

# ── Set working directory ─────────────────────────────────────
WORKDIR /app

# ── Install system dependencies ───────────────────────────────
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/*

# ── Install Python dependencies ───────────────────────────────
# Copy requirements first (Docker layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# ── Copy application code ─────────────────────────────────────
COPY main.py .         # FastAPI app
COPY model.joblib .    # Serialised model

# ── Create non-root user (security best practice) ─────────────
RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
USER appuser

# ── Expose port ───────────────────────────────────────────────
EXPOSE 8000

# ── Health check ──────────────────────────────────────────────
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# ── Start command ─────────────────────────────────────────────
CMD ["uvicorn", "main:app", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--workers", "2", \
     "--log-level", "info"]

Code Example

# ── Build and run Docker container ───────────────────────────

# Build the image
docker build -t churn-api:v1.0 .

# Run container locally
docker run -d \
  --name churn-api \
  -p 8000:8000 \
  -e LOG_LEVEL=info \
  churn-api:v1.0

# Check it's running
docker ps
docker logs churn-api

# Test the endpoint
curl http://localhost:8000/health
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"tenure_months": 3, "monthly_charges": 85.0, ...}'

# Stop and remove
docker stop churn-api && docker rm churn-api

# Push to Docker Hub (or AWS ECR)
docker tag churn-api:v1.0 yourusername/churn-api:v1.0
docker push yourusername/churn-api:v1.0

# ── Docker Compose — for multi-service setup ──────────────────
# docker-compose.yml
version: '3.8'
services:
  api:
    build: .
    ports: ["8000:8000"]
    environment:
      - LOG_LEVEL=info
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  # Optional: Redis for caching predictions
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]

# ── .dockerignore — exclude unnecessary files ─────────────────
# __pycache__/
# *.pyc
# .git/
# notebooks/
# tests/
# *.md

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain docker and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 96 — Cloud Deployment

Cloud Deployment — Heroku, AWS EC2, Google Cloud Run

Why this matters

Cloud Deployment: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Option 1: Heroku — Easiest (but no free tier since 2022)

Code Example

# ── Procfile ──────────────────────────────────────────────────
web: gunicorn -w 2 -k uvicorn.workers.UvicornWorker main:app

# ── Deploy to Heroku ──────────────────────────────────────────
heroku login
heroku create churn-api-genaiwallah
heroku config:set LOG_LEVEL=info

git push heroku main

# The app will be live at https://churn-api-genaiwallah.herokuapp.com
heroku logs --tail    # Stream logs

Option 2: AWS EC2 — Full Control

Code Example

# ── 1. Launch EC2 instance (t2.micro = free tier) ─────────────
# Console: EC2 → Launch Instance → Ubuntu 22.04 → t2.micro
# Configure Security Group: Allow port 80, 443, 22 (SSH)

# ── 2. SSH into your instance ─────────────────────────────────
ssh -i "my-key.pem" ubuntu@your-ec2-ip

# ── 3. Install dependencies ───────────────────────────────────
sudo apt-get update
sudo apt-get install -y python3-pip nginx

# ── 4. Clone and set up your app ─────────────────────────────
git clone https://github.com/youruser/churn-api.git
cd churn-api
pip3 install -r requirements.txt

# ── 5. Run FastAPI with gunicorn (production WSGI server) ─────
gunicorn -w 2 -k uvicorn.workers.UvicornWorker main:app \
  --bind 0.0.0.0:8000 --daemon

# ── 6. Configure Nginx as reverse proxy ───────────────────────
# /etc/nginx/sites-available/churn-api
server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

sudo ln -s /etc/nginx/sites-available/churn-api /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl restart nginx

# ── 7. SSL with Let's Encrypt (HTTPS) ────────────────────────
sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.com

Option 3: Google Cloud Run — Serverless Containers (Recommended)

Code Example

# ── Deploy Docker container to Cloud Run ─────────────────────
# Install gcloud CLI first: https://cloud.google.com/sdk/docs/install

# Authenticate
gcloud auth login
gcloud config set project your-project-id

# Build and push to Google Container Registry
gcloud builds submit --tag gcr.io/your-project-id/churn-api:v1

# Deploy to Cloud Run (serverless — pay only for requests)
gcloud run deploy churn-api \
  --image gcr.io/your-project-id/churn-api:v1 \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --memory 512Mi \
  --cpu 1 \
  --max-instances 10 \
  --min-instances 0        # Scale to 0 when no traffic (cost-effective)

# Your API will be live at:
# https://churn-api-xxxx-uc.a.run.app

Platform	Ease	Cost	Scalability	Best For
Render.com	⭐⭐⭐⭐⭐	Free tier	Limited	Portfolio, demos
Heroku	⭐⭐⭐⭐	$7+/month	Medium	Small apps, MVPs
Google Cloud Run	⭐⭐⭐	Pay-per-use	Auto-scales to millions	Production APIs
AWS EC2	⭐⭐	$10+/month	Manual scaling	Full control, legacy
AWS SageMaker	⭐⭐	Expensive	Enterprise	Large ML teams

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain cloud deployment and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 97 — Model Monitoring

Model Monitoring — Data Drift & Concept Drift

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Models Degrade in Production

A model trained in January may perform poorly by June because the world changes. Two types of drift:

Data Drift (Covariate Shift): The input feature distribution $P(X)$ changes. E.g., users start using the app from new demographics, or a product line is discontinued.
Concept Drift: The relationship $P(y|X)$ changes. E.g., what causes churn changes because a competitor launches. The same features now predict different outcomes.

Code Example

"""
model_monitoring.py — Detect data drift with statistical tests
"""
import numpy as np
import pandas as pd
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# ── Method 1: KS Test (Kolmogorov-Smirnov) for numeric features ─
def detect_drift_ks(reference_data: pd.Series, production_data: pd.Series,
                     alpha: float = 0.05) -> dict:
    """
    KS test: H0 = same distribution.
    If p-value < alpha → reject H0 → drift detected.
    """
    stat, p_value = stats.ks_2samp(reference_data.dropna(), production_data.dropna())
    return {
        'feature': reference_data.name,
        'ks_statistic': round(stat, 4),
        'p_value': round(p_value, 4),
        'drift_detected': p_value < alpha,
        'severity': 'High' if stat > 0.2 else ('Medium' if stat > 0.1 else 'Low')
    }

# ── Method 2: Population Stability Index (PSI) ─────────────────
def compute_psi(reference: np.ndarray, production: np.ndarray,
                n_bins: int = 10) -> float:
    """
    PSI < 0.1:  No significant change
    PSI 0.1-0.25: Moderate change — investigate
    PSI > 0.25: Major change — retrain model!
    """
    breakpoints = np.percentile(reference, np.linspace(0, 100, n_bins + 1))
    breakpoints = np.unique(breakpoints)

    ref_counts = np.histogram(reference, bins=breakpoints)[0]
    prod_counts = np.histogram(production, bins=breakpoints)[0]

    # Add small epsilon to avoid log(0)
    ref_pct  = (ref_counts / len(reference)).clip(1e-10)
    prod_pct = (prod_counts / len(production)).clip(1e-10)

    psi = np.sum((prod_pct - ref_pct) * np.log(prod_pct / ref_pct))
    return float(round(psi, 4))

# ── Method 3: Chi-squared test for categorical features ─────────
def detect_drift_categorical(reference: pd.Series, production: pd.Series) -> dict:
    """Chi-squared test for categorical drift."""
    all_categories = set(reference.unique()) | set(production.unique())

    ref_counts  = reference.value_counts().reindex(all_categories, fill_value=0)
    prod_counts = production.value_counts().reindex(all_categories, fill_value=0)

    stat, p_value = stats.chisquare(
        f_obs=prod_counts.values,
        f_exp=ref_counts.values * len(production) / len(reference)
    )
    return {'feature': reference.name, 'chi2': round(stat, 4),
            'p_value': round(p_value, 4), 'drift_detected': p_value < 0.05}

# ── Simulate training and production data ─────────────────────
np.random.seed(42)
n = 1000
reference_df = pd.DataFrame({
    'tenure_months':   np.random.exponential(24, n).clip(1, 72),
    'monthly_charges': np.random.normal(65, 20, n).clip(20, 120),
    'contract_type':   np.random.choice(['Month-to-month', 'One year', 'Two year'], n, p=[0.55, 0.25, 0.20])
})

# Simulate drift: new users have shorter tenure and higher charges
production_df = pd.DataFrame({
    'tenure_months':   np.random.exponential(12, n).clip(1, 72),    # Shorter tenure!
    'monthly_charges': np.random.normal(80, 25, n).clip(20, 120),    # Higher charges!
    'contract_type':   np.random.choice(['Month-to-month', 'One year', 'Two year'], n, p=[0.70, 0.20, 0.10])  # More monthly!
})

# ── Run drift detection ───────────────────────────────────────
print("=== DATA DRIFT REPORT ===
")
for col in ['tenure_months', 'monthly_charges']:
    result = detect_drift_ks(reference_df[col], production_df[col])
    psi = compute_psi(reference_df[col].values, production_df[col].values)
    print(f"Feature: {col}")
    print(f"  KS Statistic: {result['ks_statistic']}, p-value: {result['p_value']}, "
          f"Drift: {result['drift_detected']}, Severity: {result['severity']}")
    print(f"  PSI: {psi} → {'⚠️ RETRAIN' if psi > 0.25 else ('⚡ Monitor' if psi > 0.1 else '✅ OK')}
")

result_cat = detect_drift_categorical(reference_df['contract_type'], production_df['contract_type'])
print(f"Feature: contract_type (categorical)")
print(f"  Chi² p-value: {result_cat['p_value']}, Drift: {result_cat['drift_detected']}")

# ── Evidently AI — professional monitoring library ────────────
# pip install evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset

report = Report(metrics=[DataDriftPreset(), DataQualityPreset()])
report.run(reference_data=reference_df, current_data=production_df)
report.save_html('drift_report.html')
print("
Drift report saved to drift_report.html")
# Open in browser for beautiful interactive drift visualisations

📌

Monitoring Strategy

Daily: Log prediction volume, latency p99, error rate
Weekly: Check PSI on top 10 features; compare prediction distribution
Monthly: Evaluate model on ground truth labels (if available); compare to baseline
Trigger retraining when: PSI > 0.25 on key features, or held-out accuracy drops > 5% from launch performance

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain model monitoring and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 98 — CI/CD Pipelines

CI/CD for ML — GitHub Actions

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

CI/CD (Continuous Integration / Continuous Deployment) automates testing and deployment. For ML, this means: push code → tests run automatically → if all pass → deploy to production.

.github/workflows/ml-pipeline.yml — Complete CI/CD workflow

name: ML Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 2 * * 1'   # Run every Monday at 2am — weekly model check

env:
  PYTHON_VERSION: '3.11'
  MODEL_PATH: models/churn_pipeline.joblib

jobs:
  # ── Job 1: Code Quality ─────────────────────────────────────
  lint-and-format:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: pip

      - name: Install linting tools
        run: pip install ruff black isort

      - name: Check formatting (black)
        run: black --check src/ tests/

      - name: Check imports (isort)
        run: isort --check-only src/ tests/

      - name: Lint (ruff)
        run: ruff check src/ tests/

  # ── Job 2: Unit and Integration Tests ──────────────────────
  test:
    runs-on: ubuntu-latest
    needs: lint-and-format
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: pip

      - name: Install dependencies
        run: pip install -r requirements.txt -r requirements-dev.txt

      - name: Run tests
        run: |
          pytest tests/ \
            --cov=src \
            --cov-report=xml \
            --cov-report=term-missing \
            -v \
            --tb=short

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: coverage.xml

  # ── Job 3: Model Performance Tests ─────────────────────────
  model-validation:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: pip

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run model validation
        run: python scripts/validate_model.py
        env:
          MIN_ROC_AUC: '0.85'          # Model must achieve at least this
          MIN_PRECISION: '0.75'
          TEST_DATA_PATH: data/test.csv

      - name: Check model drift
        run: python scripts/check_drift.py
        env:
          REFERENCE_DATA: data/reference.csv
          PRODUCTION_DATA: data/production_sample.csv
          MAX_PSI: '0.25'

  # ── Job 4: Build Docker Image ───────────────────────────────
  build-docker:
    runs-on: ubuntu-latest
    needs: model-validation
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            ${{ secrets.DOCKERHUB_USERNAME }}/churn-api:latest
            ${{ secrets.DOCKERHUB_USERNAME }}/churn-api:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # ── Job 5: Deploy to Cloud Run ──────────────────────────────
  deploy:
    runs-on: ubuntu-latest
    needs: build-docker
    if: github.ref == 'refs/heads/main'
    environment: production    # Requires manual approval in GitHub
    steps:
      - name: Deploy to Google Cloud Run
        uses: google-github-actions/deploy-cloudrun@v2
        with:
          service: churn-api
          region: us-central1
          image: ${{ secrets.DOCKERHUB_USERNAME }}/churn-api:${{ github.sha }}
          credentials: ${{ secrets.GCP_SA_KEY }}

      - name: Notify Slack
        uses: slackapi/slack-github-action@v1.26.0
        with:
          payload: |
            {"text": "✅ Churn API v${{ github.sha }} deployed to production"}
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

scripts/validate_model.py — Model performance gate

"""validate_model.py — Fail CI if model performance drops below threshold."""
import joblib
import pandas as pd
import sys
import os
from sklearn.metrics import roc_auc_score, precision_score

MIN_ROC_AUC  = float(os.getenv('MIN_ROC_AUC', '0.85'))
MIN_PRECISION = float(os.getenv('MIN_PRECISION', '0.75'))

model = joblib.load(os.getenv('MODEL_PATH', 'models/churn_pipeline.joblib'))
test_df = pd.read_csv(os.getenv('TEST_DATA_PATH', 'data/test.csv'))
X_test = test_df.drop('churn', axis=1)
y_test = test_df['churn']

y_scores = model.predict_proba(X_test)[:, 1]
y_pred   = (y_scores >= 0.35).astype(int)
auc      = roc_auc_score(y_test, y_scores)
prec     = precision_score(y_test, y_pred)

print(f"ROC-AUC:   {auc:.4f} (min: {MIN_ROC_AUC})")
print(f"Precision: {prec:.4f} (min: {MIN_PRECISION})")

if auc < MIN_ROC_AUC:
    print(f"❌ FAIL: ROC-AUC {auc:.4f} < {MIN_ROC_AUC}")
    sys.exit(1)
if prec < MIN_PRECISION:
    print(f"❌ FAIL: Precision {prec:.4f} < {MIN_PRECISION}")
    sys.exit(1)

print("✅ All model performance checks passed!")
sys.exit(0)

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain ci/cd for ml and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 99 — Capstone Project

Capstone Project Outline

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Goal: Build and deploy a complete, production-quality ML system that combines everything from 98. This is the project you put on your resume.
      

Suggested Capstone: Loan Default Prediction System

Problem Framing

Predict whether a loan applicant will default within 12 months. Business metric: Reduce defaults by 20% while maintaining approval rate above 70%. ML metric: PR-AUC (imbalanced).

Data

Use the LendingClub dataset from Kaggle (2M+ loans). Key features: loan amount, grade, income, DTI ratio, employment length, credit history, purpose.

Pipeline

Comprehensive EDA → feature engineering (DTI bins, credit age, issue month) → sklearn Pipeline with ColumnTransformer → XGBoost with SMOTE → Optuna tuning.

Experiment Tracking

Track all 50+ experiments with MLflow. Register the best model. Document why you chose each model decision.

API + Frontend

FastAPI backend with Pydantic validation. Streamlit frontend with interactive loan assessment tool. Docker containerised.

Deployment + Monitoring

Deploy to Google Cloud Run. Weekly PSI monitoring script. GitHub Actions CI/CD with model performance gates. Evidently drift report.

Documentation

Comprehensive README with architecture diagram, results table, and key learnings. Technical blog post on Medium/Towards Data Science.

💡

Other Strong Capstone Ideas

Real-time fraud detection system with streaming data (Kafka + FastAPI)
Product recommendation engine using collaborative filtering (implicit library)
Medical diagnosis assistant (chest X-ray classification with explainability/SHAP)
E-commerce price optimisation with demand elasticity modelling
Multi-class document classifier (PDF/news categorisation with TF-IDF + XGBoost)

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain capstone project outline and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 100 — Final Review 🎓

What's Next After 100 Days?

Why this matters

What's Next After 100 Days?: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

🎉 Congratulations!

You've completed 100 Days of Machine Learning. You've gone from "What is ML?" to deploying production ML systems with monitoring and CI/CD. You are now a competent ML practitioner. But this is only the beginning.

Your Three Paths Forward

🧠 Path 1: Deep Learning & Neural Networks

Master neural networks for images, text, and time series. The most in-demand skill in 2025–2026.

Neural network fundamentals — backpropagation, activation functions
PyTorch or TensorFlow/Keras — framework mastery
CNNs for computer vision (ResNet, EfficientNet, ViT)
RNNs, LSTMs, Transformers for sequence data
Transfer learning and fine-tuning pre-trained models
Resources: fast.ai, Deep Learning Specialisation (Coursera), PyTorch docs

🤖 Path 2: NLP & Generative AI (LangChain)

The hottest area in 2024–2026. LLMs, RAG, agents, and production GenAI systems.

Transformers architecture deep dive (BERT, GPT, T5)
HuggingFace — loading, fine-tuning, deploying NLP models
LangChain — chains, RAG, agents, memory (our LangChain tutorial!)
OpenAI API — function calling, embeddings, fine-tuning
Vector databases — FAISS, Chroma, Pinecone, Weaviate
Building production GenAI applications

⚙️ Path 3: MLOps Engineering

Specialise in the infrastructure and engineering side of ML. Extremely well-paid.

Kubeflow, MLflow, DVC — full MLOps stack
Kubernetes for ML workloads
Feature stores — Feast, Tecton, Hopsworks
Data engineering — Spark, dbt, Airflow
Cloud ML platforms — AWS SageMaker, GCP Vertex AI, Azure ML
Model serving at scale — Triton Inference Server, TorchServe

Recommended Resources to Continue

Resource	Type	Best For
fast.ai Practical Deep Learning	Free online course	Deep learning with PyTorch (top-down approach)
Hands-On Machine Learning (Aurélien Géron)	Book	Deep reference for all topics covered in this course
Full Stack Deep Learning	Free course + lectures	MLOps, deployment, production systems
HuggingFace NLP Course	Free online	Transformers, BERT, GPT fine-tuning
Designing Machine Learning Systems (Chip Huyen)	Book	Production ML systems architecture
Made With ML	Free online	Applied ML with MLOps focus
Kaggle Competitions	Competitions	Real-world problem practice, community notebooks
GenAIWallah LangChain Tutorial	Free tutorial	LangChain, RAG, Agents, Production GenAI

Final Checklist — Are You Job-Ready?

★ 3+ ML projects on GitHub (EDA, models, evaluation, README)
★ At least 1 deployed model accessible via public URL
★ Can explain bias-variance tradeoff, gradient descent, and cross-validation in a 5-min interview
★ Kaggle profile with at least 3 competition submissions
★ At least 1 technical blog post or notebook published
★ LinkedIn updated with ML skills and project links
★ Can walk through a complete ML project from problem → model → deployment in an interview

🚀

Continue Your Journey with LangChain & GenAI

The next frontier is Generative AI. Our LangChain & GenAI tutorial picks up exactly where this course ends — covering LLMs, prompt engineering, RAG systems, agents, and building production GenAI applications.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain what's next after 100 days? and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Continue to the next day in this module.

Model Deployment: Real-time REST API Pipeline

Project Life Cycle → LangChain & GenAI Tutorial →