Module 8: ML Deployment & Production
100 Days of ML Module 8 — Model serialisation, Flask/FastAPI REST APIs, Streamlit apps, Docker, cloud deployment (AWS/GCP/Heroku), model monitoring, CI/CD for ML, and capstone project.
A model that lives only in a Jupyter notebook has zero business value. This final module teaches you to package, serve, containerise, deploy, and monitor ML models in production — the skills that separate data scientists from ML engineers. By Day 100 you'll have deployed a real model to the cloud.
Model Serialisation — pickle, joblib, ONNX
Why this matters
Model Serialisation: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Why Serialisation?
Training a model every time you need a prediction is slow and wasteful. Serialisation saves the trained model (including all learned parameters and the preprocessing pipeline) to disk. At serving time, you load the serialised model and call predict() instantly.
pickle — Python's Native Serialisation
import pickle
import joblib
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import onnx
import warnings
warnings.filterwarnings('ignore')
# ── Train a pipeline ──────────────────────────────────────────
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', GradientBoostingClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
# ── Method 1: pickle ──────────────────────────────────────────
with open('model.pkl', 'wb') as f:
pickle.dump(pipeline, f, protocol=pickle.HIGHEST_PROTOCOL)
# Load
with open('model.pkl', 'rb') as f:
loaded_pkl = pickle.load(f)
print(f"pickle accuracy: {loaded_pkl.score(X_test, y_test):.4f}")
# Limitations:
# - Python-version specific (pickle files may not load across Python versions)
# - Security risk: never unpickle untrusted data (arbitrary code execution)
# - Slow for large numpy arrays
# ── Method 2: joblib — PREFERRED for sklearn models ───────────
# joblib uses memory-mapped numpy arrays — much faster for large models
joblib.dump(pipeline, 'model.joblib', compress=3) # compress: 0-9, 3 = good balance
loaded_jl = joblib.load('model.joblib')
print(f"joblib accuracy: {loaded_jl.score(X_test, y_test):.4f}")
# joblib features:
# - Faster for models with large numpy arrays (e.g., Random Forests)
# - compress param reduces file size
# - Supports parallel loading for large files
# ── Method 3: ONNX — cross-platform, cross-language ──────────
# pip install skl2onnx onnxruntime
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as rt
# Convert sklearn pipeline to ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(pipeline, initial_types=initial_type, target_opset=15)
with open('model.onnx', 'wb') as f:
f.write(onnx_model.SerializeToString())
# Run inference with ONNX Runtime (works in C++, Java, C#, JavaScript too!)
sess = rt.InferenceSession('model.onnx')
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
X_test_float = X_test[:5].astype(np.float32)
onnx_preds = sess.run([output_name], {input_name: X_test_float})[0]
print(f"ONNX predictions: {onnx_preds}")
# ONNX advantages:
# - Language-agnostic: deploy Python model in a Go/Java/C++ service
# - Hardware-optimised: ONNX Runtime uses CPU/GPU optimisations
# - Used in production by Microsoft, NVIDIA, Intel
# ── Versioning convention ─────────────────────────────────────
import datetime
version = datetime.datetime.now().strftime('%Y%m%d_%H%M')
joblib.dump(pipeline, f'models/churn_v{version}.joblib') # e.g. models/churn_v20260526_1430.joblib| Format | Speed | Cross-Language | Use Case |
|---|---|---|---|
| pickle | OK | Python only | Quick prototyping; small models |
| joblib | Fast | Python only | Production sklearn/numpy models |
| ONNX | Very fast (optimised) | Any language | Enterprise cross-platform deployment |
| MLflow format | Varies | REST API via mlflow serve | MLflow ecosystem |
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain model serialisation and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 92 — FastAPI
Flask REST API for ML
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Project Structure
churn-api/
├── app.py ← Flask application
├── model.joblib ← Serialised pipeline
├── requirements.txt ← Flask, joblib, scikit-learn, gunicorn
└── Dockerfile ← Container definition (Day 95)"""
app.py — Flask REST API for churn prediction model
Usage: python app.py (development)
gunicorn -w 4 -b 0.0.0.0:5000 app:app (production)
"""
from flask import Flask, request, jsonify
import joblib
import numpy as np
import pandas as pd
import logging
import time
from functools import wraps
# ── Initialise app ────────────────────────────────────────────
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# ── Load model at startup (not per-request!) ──────────────────
try:
pipeline = joblib.load('model.joblib')
logger.info("Model loaded successfully")
except Exception as e:
logger.error(f"Failed to load model: {e}")
pipeline = None
# ── Input validation ──────────────────────────────────────────
REQUIRED_FIELDS = ['tenure_months', 'monthly_charges', 'total_charges',
'contract_type', 'internet_service', 'tech_support',
'senior_citizen', 'num_complaints']
def validate_input(data):
"""Validate request JSON. Returns (cleaned_data, error_message)."""
missing = [f for f in REQUIRED_FIELDS if f not in data]
if missing:
return None, f"Missing fields: {missing}"
# Type validation
try:
tenure = int(data['tenure_months'])
if not (0 < tenure <= 120):
return None, "tenure_months must be between 1 and 120"
monthly = float(data['monthly_charges'])
if not (0 < monthly <= 500):
return None, "monthly_charges must be between 0 and 500"
except (ValueError, TypeError) as e:
return None, f"Invalid data type: {str(e)}"
return data, None
# ── Timing decorator ──────────────────────────────────────────
def timed(f):
@wraps(f)
def wrapper(*args, **kwargs):
start = time.time()
result = f(*args, **kwargs)
elapsed = (time.time() - start) * 1000
logger.info(f"{f.__name__} took {elapsed:.1f}ms")
return result
return wrapper
# ── Health check endpoint ─────────────────────────────────────
@app.route('/health', methods=['GET'])
def health():
"""Kubernetes/Docker health check."""
status = 'healthy' if pipeline is not None else 'unhealthy'
return jsonify({'status': status, 'model': 'churn-v1'}), 200 if pipeline else 503
# ── Single prediction endpoint ────────────────────────────────
@app.route('/predict', methods=['POST'])
@timed
def predict():
"""Predict churn probability for a single customer.
Request JSON:
{
"tenure_months": 12,
"monthly_charges": 65.0,
"total_charges": 780.0,
"contract_type": "Month-to-month",
"internet_service": "Fiber optic",
"tech_support": "No",
"senior_citizen": 0,
"num_complaints": 1
}
Response:
{
"churn_probability": 0.73,
"prediction": 1,
"risk_level": "High"
}
"""
if pipeline is None:
return jsonify({'error': 'Model not loaded'}), 503
# Parse JSON
if not request.is_json:
return jsonify({'error': 'Content-Type must be application/json'}), 400
data = request.get_json()
if not data:
return jsonify({'error': 'Empty request body'}), 400
# Validate
cleaned, error = validate_input(data)
if error:
return jsonify({'error': error}), 422
# Predict
try:
df = pd.DataFrame([cleaned])
prob = pipeline.predict_proba(df)[0, 1]
prediction = int(prob >= 0.35) # Custom threshold from Day 87
risk = 'High' if prob > 0.6 else ('Medium' if prob > 0.35 else 'Low')
return jsonify({
'churn_probability': round(float(prob), 4),
'prediction': prediction,
'risk_level': risk,
'recommended_action': {
'High': 'Immediate retention call + discount offer',
'Medium': 'Send personalised retention email',
'Low': 'No action needed'
}[risk]
}), 200
except Exception as e:
logger.error(f"Prediction error: {e}", exc_info=True)
return jsonify({'error': 'Prediction failed', 'detail': str(e)}), 500
# ── Batch prediction endpoint ─────────────────────────────────
@app.route('/predict/batch', methods=['POST'])
@timed
def predict_batch():
"""Predict for multiple customers at once."""
data = request.get_json()
if not isinstance(data, list):
return jsonify({'error': 'Request body must be a JSON array'}), 400
if len(data) > 1000:
return jsonify({'error': 'Maximum 1000 records per batch'}), 400
try:
df = pd.DataFrame(data)
probs = pipeline.predict_proba(df)[:, 1]
results = [
{'index': i, 'churn_probability': round(float(p), 4),
'prediction': int(p >= 0.35)}
for i, p in enumerate(probs)
]
return jsonify({'predictions': results, 'count': len(results)}), 200
except Exception as e:
return jsonify({'error': str(e)}), 500
# ── Model info endpoint ───────────────────────────────────────
@app.route('/model/info', methods=['GET'])
def model_info():
return jsonify({
'model_type': type(pipeline.named_steps.get('model', pipeline)).__name__,
'features': REQUIRED_FIELDS,
'threshold': 0.35,
'version': 'v1.0.0'
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=False)flask==3.0.0
gunicorn==21.2.0
joblib==1.3.2
scikit-learn==1.4.0
numpy==1.26.0
pandas==2.1.0# ── Test the API with curl ────────────────────────────────────
# Health check
curl http://localhost:5000/health
# Single prediction
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"tenure_months": 3, "monthly_charges": 85.0, "total_charges": 255.0,
"contract_type": "Month-to-month", "internet_service": "Fiber optic",
"tech_support": "No", "senior_citizen": 0, "num_complaints": 2}'
# Expected response:
# {"churn_probability": 0.7831, "prediction": 1, "risk_level": "High",
# "recommended_action": "Immediate retention call + discount offer"}Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain flask rest api for ml and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 93 — Streamlit Apps
FastAPI for ML — Modern, Fast, Auto-Documented
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
FastAPI is the modern successor to Flask for ML APIs. It offers automatic OpenAPI/Swagger documentation, request validation via Pydantic, async support, and is ~2–3× faster than Flask for I/O-bound workloads.
"""
main.py — FastAPI ML service with Pydantic validation and auto-documentation
Run: uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Docs: http://localhost:8000/docs (Swagger UI — generated automatically!)
"""
from fastapi import FastAPI, HTTPException, status
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field, validator
from typing import List, Literal, Optional
import joblib
import pandas as pd
import numpy as np
import logging
from contextlib import asynccontextmanager
logger = logging.getLogger(__name__)
# ── Lifespan (load model at startup) ─────────────────────────
ml_models = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: load model
try:
ml_models['churn'] = joblib.load('model.joblib')
logger.info("Model loaded")
except Exception as e:
logger.error(f"Model load failed: {e}")
yield
# Shutdown: cleanup
ml_models.clear()
app = FastAPI(
title="Churn Prediction API",
description="ML-powered customer churn prediction service",
version="1.0.0",
lifespan=lifespan
)
# ── CORS middleware ───────────────────────────────────────────
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Restrict in production
allow_methods=["*"],
allow_headers=["*"],
)
# ── Pydantic Models — automatic validation + documentation ────
class CustomerFeatures(BaseModel):
"""Input features for a single customer."""
tenure_months: int = Field(..., ge=1, le=120, description="Months as customer (1-120)")
monthly_charges: float = Field(..., ge=0, le=500, description="Monthly bill in USD")
total_charges: float = Field(..., ge=0, description="Total amount billed")
contract_type: Literal['Month-to-month', 'One year', 'Two year']
internet_service: Literal['DSL', 'Fiber optic', 'No']
tech_support: Literal['Yes', 'No']
senior_citizen: Literal[0, 1] = Field(..., description="1 if senior citizen")
num_complaints: int = Field(..., ge=0, le=50, description="Number of support complaints")
@validator('total_charges')
def total_must_be_consistent(cls, v, values):
"""Sanity check: total_charges should be >= monthly_charges."""
if 'monthly_charges' in values and v < values['monthly_charges'] * 0.5:
raise ValueError('total_charges seems too low for the given tenure and monthly_charges')
return v
class Config:
json_schema_extra = {
"example": {
"tenure_months": 3, "monthly_charges": 85.0, "total_charges": 255.0,
"contract_type": "Month-to-month", "internet_service": "Fiber optic",
"tech_support": "No", "senior_citizen": 0, "num_complaints": 2
}
}
class BatchRequest(BaseModel):
customers: List[CustomerFeatures] = Field(..., max_items=1000)
class PredictionResponse(BaseModel):
churn_probability: float
prediction: int
risk_level: str
recommended_action: str
class BatchPredictionResponse(BaseModel):
predictions: List[PredictionResponse]
count: int
# ── Prediction logic ──────────────────────────────────────────
def get_prediction(features: CustomerFeatures) -> PredictionResponse:
if 'churn' not in ml_models:
raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
detail="Model not loaded")
df = pd.DataFrame([features.dict()])
prob = float(ml_models['churn'].predict_proba(df)[0, 1])
pred = int(prob >= 0.35)
risk = 'High' if prob > 0.6 else ('Medium' if prob > 0.35 else 'Low')
actions = {
'High': 'Immediate retention call + discount offer',
'Medium': 'Send personalised retention email',
'Low': 'No action needed'
}
return PredictionResponse(
churn_probability=round(prob, 4),
prediction=pred,
risk_level=risk,
recommended_action=actions[risk]
)
# ── Endpoints ─────────────────────────────────────────────────
@app.get("/health", tags=["Monitoring"])
async def health_check():
"""Kubernetes readiness probe."""
model_ok = 'churn' in ml_models
return {"status": "healthy" if model_ok else "degraded", "model_loaded": model_ok}
@app.post("/predict", response_model=PredictionResponse, tags=["Prediction"])
async def predict(customer: CustomerFeatures):
"""Predict churn probability for a single customer.
Returns probability (0-1), binary prediction, and recommended retention action.
"""
return get_prediction(customer)
@app.post("/predict/batch", response_model=BatchPredictionResponse, tags=["Prediction"])
async def predict_batch(batch: BatchRequest):
"""Predict churn for up to 1000 customers in a single request."""
predictions = [get_prediction(c) for c in batch.customers]
return BatchPredictionResponse(predictions=predictions, count=len(predictions))
@app.get("/model/info", tags=["Model"])
async def model_info():
"""Return metadata about the deployed model."""
return {
"model_version": "1.0.0",
"threshold": 0.35,
"features": list(CustomerFeatures.__fields__.keys()),
"documentation": "/docs"
}
# ── requirements.txt for FastAPI ──────────────────────────────
# fastapi==0.109.0
# uvicorn[standard]==0.27.0
# pydantic==2.6.0
# joblib==1.3.2
# scikit-learn==1.4.0
# numpy==1.26.0
# pandas==2.1.0Flask vs FastAPI — When to Choose
- FastAPI: New projects, modern Python (3.9+), need auto-docs, async I/O, type safety. Industry standard in 2024–2026.
- Flask: Legacy systems, simpler needs, or when team already knows Flask well.
- Both work fine for ML serving. FastAPI's Pydantic validation catches bad inputs automatically — a huge production win.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain fastapi for ml and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 94 — Docker Basics
Streamlit Apps — ML Demos in 20 Lines
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Streamlit turns Python scripts into interactive web apps with no HTML/CSS/JS required. Perfect for ML demos, internal tools, and portfolio pieces.
"""
streamlit_app.py — Interactive churn prediction demo
Run: streamlit run streamlit_app.py
"""
import streamlit as st
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
# ── Page config ───────────────────────────────────────────────
st.set_page_config(
page_title="Churn Predictor",
page_icon="📊",
layout="wide",
initial_sidebar_state="expanded"
)
# ── Load model (cached so it doesn't reload on every interaction) ─
@st.cache_resource
def load_model():
return joblib.load('model.joblib')
model = load_model()
# ── Sidebar — feature inputs ──────────────────────────────────
st.sidebar.title("🔧 Customer Features")
st.sidebar.markdown("Adjust parameters to predict churn probability")
tenure = st.sidebar.slider(
"Tenure (months)", min_value=1, max_value=72, value=12,
help="How long has this customer been with us?"
)
monthly_charges = st.sidebar.slider(
"Monthly Charges ($)", min_value=20.0, max_value=120.0, value=65.0, step=1.0
)
contract_type = st.sidebar.selectbox(
"Contract Type",
options=["Month-to-month", "One year", "Two year"],
help="Month-to-month customers churn 3x more than annual contracts"
)
internet_service = st.sidebar.selectbox(
"Internet Service",
options=["DSL", "Fiber optic", "No"]
)
tech_support = st.sidebar.radio(
"Tech Support", options=["Yes", "No"], horizontal=True
)
senior_citizen = st.sidebar.checkbox("Senior Citizen")
num_complaints = st.sidebar.number_input(
"Number of Complaints", min_value=0, max_value=20, value=0
)
# ── Main content ──────────────────────────────────────────────
st.title("📊 Customer Churn Predictor")
st.markdown("Predict whether a customer will churn in the next 30 days based on their profile.")
# Prepare input
total_charges = tenure * monthly_charges
input_data = pd.DataFrame([{
'tenure_months': tenure,
'monthly_charges': monthly_charges,
'total_charges': total_charges,
'contract_type': contract_type,
'internet_service': internet_service,
'tech_support': tech_support,
'senior_citizen': int(senior_citizen),
'num_complaints': num_complaints
}])
# Predict
prob = model.predict_proba(input_data)[0, 1]
risk = 'High' if prob > 0.6 else ('Medium' if prob > 0.35 else 'Low')
color_map = {'High': '#e74c3c', 'Medium': '#f39c12', 'Low': '#2ecc71'}
# ── Layout ────────────────────────────────────────────────────
col1, col2, col3 = st.columns(3)
with col1:
st.metric("Churn Probability", f"{prob:.1%}",
delta=f"{prob - 0.2:.1%} vs avg",
delta_color="inverse")
with col2:
st.metric("Risk Level", risk)
with col3:
st.metric("Customer LTV", f"${total_charges:,.0f}")
# ── Gauge chart ───────────────────────────────────────────────
fig = go.Figure(go.Indicator(
mode="gauge+number",
value=prob * 100,
domain={'x': [0, 1], 'y': [0, 1]},
title={'text': "Churn Probability (%)", 'font': {'size': 20}},
gauge={
'axis': {'range': [0, 100], 'tickwidth': 1},
'bar': {'color': color_map[risk]},
'steps': [
{'range': [0, 35], 'color': 'rgba(46,204,113,.15)'},
{'range': [35, 60], 'color': 'rgba(243,156,18,.15)'},
{'range': [60, 100],'color': 'rgba(231,76,60,.15)'}
],
'threshold': {
'line': {'color': "white", 'width': 3},
'thickness': 0.75, 'value': 35
}
}
))
fig.update_layout(height=250, margin=dict(t=30, b=10))
st.plotly_chart(fig, use_container_width=True)
# ── Recommendations ───────────────────────────────────────────
st.subheader("📋 Recommended Actions")
if risk == 'High':
st.error("🚨 High churn risk! Immediate intervention recommended.")
st.markdown("""
- 📞 **Immediate retention call** — assign to high-priority queue
- 🎁 **Offer contract upgrade** — 20% discount for switching to annual plan
- 🛠️ **Free tech support upgrade** for 3 months
""")
elif risk == 'Medium':
st.warning("⚠️ Moderate churn risk. Proactive outreach recommended.")
st.markdown("""
- 📧 Send personalised retention email with loyalty rewards
- 💬 Trigger in-app survey to understand pain points
""")
else:
st.success("✅ Low churn risk. Customer appears satisfied.")
st.markdown("- Consider upselling premium features")
# ── Feature analysis ──────────────────────────────────────────
with st.expander("🔍 Feature Impact Analysis"):
st.markdown("How each feature contributes to churn risk:")
feature_impacts = {
'Contract Type': 0.28 if contract_type == 'Month-to-month' else -0.15,
'Tenure': -0.02 * tenure,
'Monthly Charges': 0.003 * monthly_charges,
'Complaints': 0.12 * num_complaints,
'Tech Support': -0.08 if tech_support == 'Yes' else 0.05
}
impact_df = pd.DataFrame.from_dict(
feature_impacts, orient='index', columns=['Impact']
).sort_values('Impact', ascending=True)
fig2, ax = plt.subplots(figsize=(8, 3))
colors = ['#e74c3c' if v > 0 else '#2ecc71' for v in impact_df['Impact']]
impact_df['Impact'].plot.barh(ax=ax, color=colors)
ax.set_xlabel("Impact on Churn Probability"); ax.set_title("Feature Impacts")
ax.axvline(x=0, color='white', linewidth=0.5)
fig2.patch.set_facecolor('none'); ax.set_facecolor('none')
st.pyplot(fig2)
# ── Batch prediction ──────────────────────────────────────────
st.divider()
st.subheader("📁 Batch Prediction")
uploaded = st.file_uploader("Upload a CSV of customers", type=['csv'])
if uploaded:
batch_df = pd.read_csv(uploaded)
st.dataframe(batch_df.head())
if st.button("Run Batch Prediction"):
with st.spinner("Predicting..."):
probs = model.predict_proba(batch_df)[:, 1]
batch_df['churn_probability'] = probs.round(4)
batch_df['risk_level'] = pd.cut(
probs, bins=[0, 0.35, 0.6, 1.0],
labels=['Low', 'Medium', 'High']
)
st.success(f"Done! {len(batch_df)} predictions made.")
st.dataframe(batch_df)
csv = batch_df.to_csv(index=False).encode('utf-8')
st.download_button("⬇️ Download Results", csv, "predictions.csv", "text/csv")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain streamlit apps and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 95 — Docker for ML
Docker — Containerising ML Models
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Failure mode — Image bloat
Installing build-essential and Jupyter in production images increases attack surface and cold-start time. Use multi-stage builds: compile in builder stage, copy only .pkl + slim runtime into final image.
Why Docker for ML?
Docker packages your app + all its dependencies (Python version, libraries, OS libraries) into a single portable container. "Works on my machine" becomes "works everywhere" — local laptop, AWS, GCP, Azure, Kubernetes.
# ── Use official Python slim image (smaller than full Python) ─
FROM python:3.11-slim
# ── Set environment variables ─────────────────────────────────
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
# ── Set working directory ─────────────────────────────────────
WORKDIR /app
# ── Install system dependencies ───────────────────────────────
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
# ── Install Python dependencies ───────────────────────────────
# Copy requirements first (Docker layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# ── Copy application code ─────────────────────────────────────
COPY main.py . # FastAPI app
COPY model.joblib . # Serialised model
# ── Create non-root user (security best practice) ─────────────
RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
USER appuser
# ── Expose port ───────────────────────────────────────────────
EXPOSE 8000
# ── Health check ──────────────────────────────────────────────
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
# ── Start command ─────────────────────────────────────────────
CMD ["uvicorn", "main:app", \
"--host", "0.0.0.0", \
"--port", "8000", \
"--workers", "2", \
"--log-level", "info"]# ── Build and run Docker container ───────────────────────────
# Build the image
docker build -t churn-api:v1.0 .
# Run container locally
docker run -d \
--name churn-api \
-p 8000:8000 \
-e LOG_LEVEL=info \
churn-api:v1.0
# Check it's running
docker ps
docker logs churn-api
# Test the endpoint
curl http://localhost:8000/health
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"tenure_months": 3, "monthly_charges": 85.0, ...}'
# Stop and remove
docker stop churn-api && docker rm churn-api
# Push to Docker Hub (or AWS ECR)
docker tag churn-api:v1.0 yourusername/churn-api:v1.0
docker push yourusername/churn-api:v1.0
# ── Docker Compose — for multi-service setup ──────────────────
# docker-compose.yml
version: '3.8'
services:
api:
build: .
ports: ["8000:8000"]
environment:
- LOG_LEVEL=info
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
# Optional: Redis for caching predictions
redis:
image: redis:7-alpine
ports: ["6379:6379"]
# ── .dockerignore — exclude unnecessary files ─────────────────
# __pycache__/
# *.pyc
# .git/
# notebooks/
# tests/
# *.mdCommon mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain docker and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Cloud Deployment — Heroku, AWS EC2, Google Cloud Run
Why this matters
Cloud Deployment: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Option 1: Heroku — Easiest (but no free tier since 2022)
# ── Procfile ──────────────────────────────────────────────────
web: gunicorn -w 2 -k uvicorn.workers.UvicornWorker main:app
# ── Deploy to Heroku ──────────────────────────────────────────
heroku login
heroku create churn-api-genaiwallah
heroku config:set LOG_LEVEL=info
git push heroku main
# The app will be live at https://churn-api-genaiwallah.herokuapp.com
heroku logs --tail # Stream logsOption 2: AWS EC2 — Full Control
# ── 1. Launch EC2 instance (t2.micro = free tier) ─────────────
# Console: EC2 → Launch Instance → Ubuntu 22.04 → t2.micro
# Configure Security Group: Allow port 80, 443, 22 (SSH)
# ── 2. SSH into your instance ─────────────────────────────────
ssh -i "my-key.pem" ubuntu@your-ec2-ip
# ── 3. Install dependencies ───────────────────────────────────
sudo apt-get update
sudo apt-get install -y python3-pip nginx
# ── 4. Clone and set up your app ─────────────────────────────
git clone https://github.com/youruser/churn-api.git
cd churn-api
pip3 install -r requirements.txt
# ── 5. Run FastAPI with gunicorn (production WSGI server) ─────
gunicorn -w 2 -k uvicorn.workers.UvicornWorker main:app \
--bind 0.0.0.0:8000 --daemon
# ── 6. Configure Nginx as reverse proxy ───────────────────────
# /etc/nginx/sites-available/churn-api
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
sudo ln -s /etc/nginx/sites-available/churn-api /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl restart nginx
# ── 7. SSL with Let's Encrypt (HTTPS) ────────────────────────
sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.comOption 3: Google Cloud Run — Serverless Containers (Recommended)
# ── Deploy Docker container to Cloud Run ─────────────────────
# Install gcloud CLI first: https://cloud.google.com/sdk/docs/install
# Authenticate
gcloud auth login
gcloud config set project your-project-id
# Build and push to Google Container Registry
gcloud builds submit --tag gcr.io/your-project-id/churn-api:v1
# Deploy to Cloud Run (serverless — pay only for requests)
gcloud run deploy churn-api \
--image gcr.io/your-project-id/churn-api:v1 \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 512Mi \
--cpu 1 \
--max-instances 10 \
--min-instances 0 # Scale to 0 when no traffic (cost-effective)
# Your API will be live at:
# https://churn-api-xxxx-uc.a.run.app| Platform | Ease | Cost | Scalability | Best For |
|---|---|---|---|---|
| Render.com | ⭐⭐⭐⭐⭐ | Free tier | Limited | Portfolio, demos |
| Heroku | ⭐⭐⭐⭐ | $7+/month | Medium | Small apps, MVPs |
| Google Cloud Run | ⭐⭐⭐ | Pay-per-use | Auto-scales to millions | Production APIs |
| AWS EC2 | ⭐⭐ | $10+/month | Manual scaling | Full control, legacy |
| AWS SageMaker | ⭐⭐ | Expensive | Enterprise | Large ML teams |
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain cloud deployment and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Model Monitoring — Data Drift & Concept Drift
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Why Models Degrade in Production
A model trained in January may perform poorly by June because the world changes. Two types of drift:
- Data Drift (Covariate Shift): The input feature distribution $P(X)$ changes. E.g., users start using the app from new demographics, or a product line is discontinued.
- Concept Drift: The relationship $P(y|X)$ changes. E.g., what causes churn changes because a competitor launches. The same features now predict different outcomes.
"""
model_monitoring.py — Detect data drift with statistical tests
"""
import numpy as np
import pandas as pd
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
# ── Method 1: KS Test (Kolmogorov-Smirnov) for numeric features ─
def detect_drift_ks(reference_data: pd.Series, production_data: pd.Series,
alpha: float = 0.05) -> dict:
"""
KS test: H0 = same distribution.
If p-value < alpha → reject H0 → drift detected.
"""
stat, p_value = stats.ks_2samp(reference_data.dropna(), production_data.dropna())
return {
'feature': reference_data.name,
'ks_statistic': round(stat, 4),
'p_value': round(p_value, 4),
'drift_detected': p_value < alpha,
'severity': 'High' if stat > 0.2 else ('Medium' if stat > 0.1 else 'Low')
}
# ── Method 2: Population Stability Index (PSI) ─────────────────
def compute_psi(reference: np.ndarray, production: np.ndarray,
n_bins: int = 10) -> float:
"""
PSI < 0.1: No significant change
PSI 0.1-0.25: Moderate change — investigate
PSI > 0.25: Major change — retrain model!
"""
breakpoints = np.percentile(reference, np.linspace(0, 100, n_bins + 1))
breakpoints = np.unique(breakpoints)
ref_counts = np.histogram(reference, bins=breakpoints)[0]
prod_counts = np.histogram(production, bins=breakpoints)[0]
# Add small epsilon to avoid log(0)
ref_pct = (ref_counts / len(reference)).clip(1e-10)
prod_pct = (prod_counts / len(production)).clip(1e-10)
psi = np.sum((prod_pct - ref_pct) * np.log(prod_pct / ref_pct))
return float(round(psi, 4))
# ── Method 3: Chi-squared test for categorical features ─────────
def detect_drift_categorical(reference: pd.Series, production: pd.Series) -> dict:
"""Chi-squared test for categorical drift."""
all_categories = set(reference.unique()) | set(production.unique())
ref_counts = reference.value_counts().reindex(all_categories, fill_value=0)
prod_counts = production.value_counts().reindex(all_categories, fill_value=0)
stat, p_value = stats.chisquare(
f_obs=prod_counts.values,
f_exp=ref_counts.values * len(production) / len(reference)
)
return {'feature': reference.name, 'chi2': round(stat, 4),
'p_value': round(p_value, 4), 'drift_detected': p_value < 0.05}
# ── Simulate training and production data ─────────────────────
np.random.seed(42)
n = 1000
reference_df = pd.DataFrame({
'tenure_months': np.random.exponential(24, n).clip(1, 72),
'monthly_charges': np.random.normal(65, 20, n).clip(20, 120),
'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n, p=[0.55, 0.25, 0.20])
})
# Simulate drift: new users have shorter tenure and higher charges
production_df = pd.DataFrame({
'tenure_months': np.random.exponential(12, n).clip(1, 72), # Shorter tenure!
'monthly_charges': np.random.normal(80, 25, n).clip(20, 120), # Higher charges!
'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n, p=[0.70, 0.20, 0.10]) # More monthly!
})
# ── Run drift detection ───────────────────────────────────────
print("=== DATA DRIFT REPORT ===
")
for col in ['tenure_months', 'monthly_charges']:
result = detect_drift_ks(reference_df[col], production_df[col])
psi = compute_psi(reference_df[col].values, production_df[col].values)
print(f"Feature: {col}")
print(f" KS Statistic: {result['ks_statistic']}, p-value: {result['p_value']}, "
f"Drift: {result['drift_detected']}, Severity: {result['severity']}")
print(f" PSI: {psi} → {'⚠️ RETRAIN' if psi > 0.25 else ('⚡ Monitor' if psi > 0.1 else '✅ OK')}
")
result_cat = detect_drift_categorical(reference_df['contract_type'], production_df['contract_type'])
print(f"Feature: contract_type (categorical)")
print(f" Chi² p-value: {result_cat['p_value']}, Drift: {result_cat['drift_detected']}")
# ── Evidently AI — professional monitoring library ────────────
# pip install evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
report = Report(metrics=[DataDriftPreset(), DataQualityPreset()])
report.run(reference_data=reference_df, current_data=production_df)
report.save_html('drift_report.html')
print("
Drift report saved to drift_report.html")
# Open in browser for beautiful interactive drift visualisationsMonitoring Strategy
- Daily: Log prediction volume, latency p99, error rate
- Weekly: Check PSI on top 10 features; compare prediction distribution
- Monthly: Evaluate model on ground truth labels (if available); compare to baseline
- Trigger retraining when: PSI > 0.25 on key features, or held-out accuracy drops > 5% from launch performance
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain model monitoring and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 98 — CI/CD Pipelines
CI/CD for ML — GitHub Actions
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
CI/CD (Continuous Integration / Continuous Deployment) automates testing and deployment. For ML, this means: push code → tests run automatically → if all pass → deploy to production.
name: ML Pipeline CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
schedule:
- cron: '0 2 * * 1' # Run every Monday at 2am — weekly model check
env:
PYTHON_VERSION: '3.11'
MODEL_PATH: models/churn_pipeline.joblib
jobs:
# ── Job 1: Code Quality ─────────────────────────────────────
lint-and-format:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip
- name: Install linting tools
run: pip install ruff black isort
- name: Check formatting (black)
run: black --check src/ tests/
- name: Check imports (isort)
run: isort --check-only src/ tests/
- name: Lint (ruff)
run: ruff check src/ tests/
# ── Job 2: Unit and Integration Tests ──────────────────────
test:
runs-on: ubuntu-latest
needs: lint-and-format
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip
- name: Install dependencies
run: pip install -r requirements.txt -r requirements-dev.txt
- name: Run tests
run: |
pytest tests/ \
--cov=src \
--cov-report=xml \
--cov-report=term-missing \
-v \
--tb=short
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
file: coverage.xml
# ── Job 3: Model Performance Tests ─────────────────────────
model-validation:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run model validation
run: python scripts/validate_model.py
env:
MIN_ROC_AUC: '0.85' # Model must achieve at least this
MIN_PRECISION: '0.75'
TEST_DATA_PATH: data/test.csv
- name: Check model drift
run: python scripts/check_drift.py
env:
REFERENCE_DATA: data/reference.csv
PRODUCTION_DATA: data/production_sample.csv
MAX_PSI: '0.25'
# ── Job 4: Build Docker Image ───────────────────────────────
build-docker:
runs-on: ubuntu-latest
needs: model-validation
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
${{ secrets.DOCKERHUB_USERNAME }}/churn-api:latest
${{ secrets.DOCKERHUB_USERNAME }}/churn-api:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
# ── Job 5: Deploy to Cloud Run ──────────────────────────────
deploy:
runs-on: ubuntu-latest
needs: build-docker
if: github.ref == 'refs/heads/main'
environment: production # Requires manual approval in GitHub
steps:
- name: Deploy to Google Cloud Run
uses: google-github-actions/deploy-cloudrun@v2
with:
service: churn-api
region: us-central1
image: ${{ secrets.DOCKERHUB_USERNAME }}/churn-api:${{ github.sha }}
credentials: ${{ secrets.GCP_SA_KEY }}
- name: Notify Slack
uses: slackapi/slack-github-action@v1.26.0
with:
payload: |
{"text": "✅ Churn API v${{ github.sha }} deployed to production"}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}"""validate_model.py — Fail CI if model performance drops below threshold."""
import joblib
import pandas as pd
import sys
import os
from sklearn.metrics import roc_auc_score, precision_score
MIN_ROC_AUC = float(os.getenv('MIN_ROC_AUC', '0.85'))
MIN_PRECISION = float(os.getenv('MIN_PRECISION', '0.75'))
model = joblib.load(os.getenv('MODEL_PATH', 'models/churn_pipeline.joblib'))
test_df = pd.read_csv(os.getenv('TEST_DATA_PATH', 'data/test.csv'))
X_test = test_df.drop('churn', axis=1)
y_test = test_df['churn']
y_scores = model.predict_proba(X_test)[:, 1]
y_pred = (y_scores >= 0.35).astype(int)
auc = roc_auc_score(y_test, y_scores)
prec = precision_score(y_test, y_pred)
print(f"ROC-AUC: {auc:.4f} (min: {MIN_ROC_AUC})")
print(f"Precision: {prec:.4f} (min: {MIN_PRECISION})")
if auc < MIN_ROC_AUC:
print(f"❌ FAIL: ROC-AUC {auc:.4f} < {MIN_ROC_AUC}")
sys.exit(1)
if prec < MIN_PRECISION:
print(f"❌ FAIL: Precision {prec:.4f} < {MIN_PRECISION}")
sys.exit(1)
print("✅ All model performance checks passed!")
sys.exit(0)Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain ci/cd for ml and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Capstone Project Outline
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Suggested Capstone: Loan Default Prediction System
Problem Framing
Predict whether a loan applicant will default within 12 months. Business metric: Reduce defaults by 20% while maintaining approval rate above 70%. ML metric: PR-AUC (imbalanced).
Data
Use the LendingClub dataset from Kaggle (2M+ loans). Key features: loan amount, grade, income, DTI ratio, employment length, credit history, purpose.
Pipeline
Comprehensive EDA → feature engineering (DTI bins, credit age, issue month) → sklearn Pipeline with ColumnTransformer → XGBoost with SMOTE → Optuna tuning.
Experiment Tracking
Track all 50+ experiments with MLflow. Register the best model. Document why you chose each model decision.
API + Frontend
FastAPI backend with Pydantic validation. Streamlit frontend with interactive loan assessment tool. Docker containerised.
Deployment + Monitoring
Deploy to Google Cloud Run. Weekly PSI monitoring script. GitHub Actions CI/CD with model performance gates. Evidently drift report.
Documentation
Comprehensive README with architecture diagram, results table, and key learnings. Technical blog post on Medium/Towards Data Science.
Other Strong Capstone Ideas
- Real-time fraud detection system with streaming data (Kafka + FastAPI)
- Product recommendation engine using collaborative filtering (implicit library)
- Medical diagnosis assistant (chest X-ray classification with explainability/SHAP)
- E-commerce price optimisation with demand elasticity modelling
- Multi-class document classifier (PDF/news categorisation with TF-IDF + XGBoost)
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain capstone project outline and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 100 — Final Review 🎓
What's Next After 100 Days?
Why this matters
What's Next After 100 Days?: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
🎉 Congratulations!
You've completed 100 Days of Machine Learning. You've gone from "What is ML?" to deploying production ML systems with monitoring and CI/CD. You are now a competent ML practitioner. But this is only the beginning.
Your Three Paths Forward
🧠 Path 1: Deep Learning & Neural Networks
Master neural networks for images, text, and time series. The most in-demand skill in 2025–2026.
- Neural network fundamentals — backpropagation, activation functions
- PyTorch or TensorFlow/Keras — framework mastery
- CNNs for computer vision (ResNet, EfficientNet, ViT)
- RNNs, LSTMs, Transformers for sequence data
- Transfer learning and fine-tuning pre-trained models
- Resources: fast.ai, Deep Learning Specialisation (Coursera), PyTorch docs
🤖 Path 2: NLP & Generative AI (LangChain)
The hottest area in 2024–2026. LLMs, RAG, agents, and production GenAI systems.
- Transformers architecture deep dive (BERT, GPT, T5)
- HuggingFace — loading, fine-tuning, deploying NLP models
- LangChain — chains, RAG, agents, memory (our LangChain tutorial!)
- OpenAI API — function calling, embeddings, fine-tuning
- Vector databases — FAISS, Chroma, Pinecone, Weaviate
- Building production GenAI applications
⚙️ Path 3: MLOps Engineering
Specialise in the infrastructure and engineering side of ML. Extremely well-paid.
- Kubeflow, MLflow, DVC — full MLOps stack
- Kubernetes for ML workloads
- Feature stores — Feast, Tecton, Hopsworks
- Data engineering — Spark, dbt, Airflow
- Cloud ML platforms — AWS SageMaker, GCP Vertex AI, Azure ML
- Model serving at scale — Triton Inference Server, TorchServe
Recommended Resources to Continue
| Resource | Type | Best For |
|---|---|---|
| fast.ai Practical Deep Learning | Free online course | Deep learning with PyTorch (top-down approach) |
| Hands-On Machine Learning (Aurélien Géron) | Book | Deep reference for all topics covered in this course |
| Full Stack Deep Learning | Free course + lectures | MLOps, deployment, production systems |
| HuggingFace NLP Course | Free online | Transformers, BERT, GPT fine-tuning |
| Designing Machine Learning Systems (Chip Huyen) | Book | Production ML systems architecture |
| Made With ML | Free online | Applied ML with MLOps focus |
| Kaggle Competitions | Competitions | Real-world problem practice, community notebooks |
| GenAIWallah LangChain Tutorial | Free tutorial | LangChain, RAG, Agents, Production GenAI |
Final Checklist — Are You Job-Ready?
- ★ 3+ ML projects on GitHub (EDA, models, evaluation, README)
- ★ At least 1 deployed model accessible via public URL
- ★ Can explain bias-variance tradeoff, gradient descent, and cross-validation in a 5-min interview
- ★ Kaggle profile with at least 3 competition submissions
- ★ At least 1 technical blog post or notebook published
- ★ LinkedIn updated with ML skills and project links
- ★ Can walk through a complete ML project from problem → model → deployment in an interview
Continue Your Journey with LangChain & GenAI
The next frontier is Generative AI. Our LangChain & GenAI tutorial picks up exactly where this course ends — covering LLMs, prompt engineering, RAG systems, agents, and building production GenAI applications.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain what's next after 100 days? and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Continue to the next day in this module.
