Module 7: ML Project Life Cycle
100 Days of ML Module 7 — End-to-end ML project life cycle: problem framing, data collection, full pipelines, MLflow experiment tracking, case studies, interview prep, and portfolio building.
This module bridges the gap between ML theory and professional practice. You'll learn how real ML projects are structured end-to-end — from defining a business problem to tracking experiments, running case studies, preparing for interviews, and building a portfolio that gets you hired.
The ML Project Life Cycle — 7 Stages
Why this matters
The ML Project Life Cycle: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Lifecycle is iterative — evaluation often sends you back to data collection or feature engineering
Problem Framing
Translate the business problem into a well-defined ML task. Define success metrics, constraints, and risk tolerance. This is the most critical step — wrong framing = wasted months of work.
Data Collection & Understanding
Identify data sources (databases, APIs, web scraping, sensors). Understand data quality, schema, and volume. Check for historical availability of all features needed at prediction time.
Exploratory Data Analysis (EDA)
Statistical summaries, visualisations, correlation analysis, outlier detection, missing value patterns. Goal: deeply understand your data before any modelling.
Data Preprocessing & Feature Engineering
Clean, encode, scale, impute, and engineer features. Build reusable sklearn Pipelines so the same transformations apply consistently to train, validation, and production data.
Model Building & Selection
Establish baselines first (DummyClassifier, simple heuristics). Then iterate: simple models → complex models → ensembles. Use cross-validation throughout.
Evaluation & Hyperparameter Tuning
Evaluate on held-out test set. Tune with GridSearchCV/Optuna. Check for bias, fairness, and robustness. Ensure test set performance aligns with business requirements.
Deployment & Monitoring
Package model as API (Flask/FastAPI), containerise (Docker), deploy to cloud. Set up monitoring for data drift and model degradation. Retrain schedule.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain the ml project life cycle and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 82 — Problem Framing
Problem Framing — Business → ML
Why this matters
Problem Framing: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Translation Process
Every ML project starts with a business problem, not a model. Your first job is to understand the business deeply enough to translate it into an ML formulation.
| Business Problem | ML Formulation | Target Variable | Success Metric |
|---|---|---|---|
| "We're losing customers" | Binary classification: will a customer churn in next 30 days? | churn = 0/1 | Recall (missing a churner is more costly than a false alarm) |
| "We need to price houses better" | Regression: predict sale price from property features | sale_price (continuous) | MAPE < 10% (business-interpretable) |
| "Find similar customers for targeting" | Clustering: segment customers by behaviour | None (unsupervised) | Silhouette score + business review of segment profiles |
| "Detect fraudulent transactions" | Binary classification, heavily imbalanced | fraud = 0/1 | Precision-Recall AUC (imbalanced dataset) |
| "Recommend products to users" | Ranking / collaborative filtering | Implicit feedback (clicks, purchases) | NDCG@10, Click-Through Rate |
The Framing Checklist
- Is the problem worth solving? (What is the $ value of a 1% improvement?)
- Is ML the right tool? (Could a simple rule-based system work?)
- Do you have enough labelled data? (Rule of thumb: ≥1000 samples per class for traditional ML)
- Are all features available at prediction time? (Check for target leakage)
- What is the cost of a false positive vs false negative? (Sets your evaluation priority)
- How often does the model need to make predictions? (Real-time vs batch)
- How explainable does the model need to be? (Regulated industries require interpretability)
- What are the latency requirements? (<10ms → avoid deep models; batch → no constraint)
Success Metric Must Match Business Goal
Optimising accuracy on a 99:1 imbalanced fraud dataset is useless — a model predicting "not fraud" 100% achieves 99% accuracy. The business cares about catching fraud (recall) and minimising false alarms (precision). Always align your ML metric with what actually matters to the business.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain problem framing and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 83 — Data Collection
Data Collection & Labelling
Why this matters
Data Collection & Labelling: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Data Sources
| Source Type | Examples | Tools | Considerations |
|---|---|---|---|
| Internal Databases | CRM, ERP, transaction logs, user events | SQL, SQLAlchemy, pandas.read_sql() | Data quality, access permissions, GDPR |
| Public APIs | Twitter/X API, Google Maps, OpenWeather, Alpha Vantage (stocks) | requests, httpx, official SDKs | Rate limits, costs, schema changes |
| Web Scraping | Product prices, job listings, news articles | BeautifulSoup, Scrapy, Playwright | ToS compliance, robots.txt, dynamic JS pages |
| Public Datasets | Kaggle, UCI ML Repository, Hugging Face, Google Dataset Search | kaggle API, datasets library | Licence, dataset freshness, real-world applicability |
| Sensors / IoT | Temperature sensors, GPS logs, click streams | MQTT, Kafka, InfluxDB | High volume, real-time, noise |
| Synthetic Data | When real data is scarce, private, or imbalanced | SMOTE, Faker, SDV, CTGAN | Distribution mismatch with production data |
Data Labelling
Supervised ML requires labelled data. Labelling is expensive and time-consuming. Key strategies:
- Manual labelling: Domain experts (doctors, lawyers) label data directly. Gold standard but expensive.
- Crowdsourcing: Amazon Mechanical Turk, Appen — cheap but requires quality control.
- Label Studio (open source): Self-hosted annotation tool supporting text, images, audio, video.
- Scale AI / Labelbox: Enterprise annotation platforms with quality workflows.
- Weak supervision (Snorkel): Write labelling functions (heuristics) that programmatically assign noisy labels — then combine them using a generative model.
- Active learning: The model queries the human for labels on the most uncertain/informative examples — reduces labelling effort by 10–100×.
# ── Collecting data from a REST API ──────────────────────────
import requests
import pandas as pd
import time
def fetch_weather_data(cities, api_key):
"""Fetch current weather for a list of cities."""
records = []
for city in cities:
url = f"https://api.openweathermap.org/data/2.5/weather"
params = {'q': city, 'appid': api_key, 'units': 'metric'}
try:
response = requests.get(url, params=params, timeout=10)
response.raise_for_status()
data = response.json()
records.append({
'city': city,
'temperature': data['main']['temp'],
'humidity': data['main']['humidity'],
'wind_speed': data['wind']['speed'],
'description': data['weather'][0]['description'],
'timestamp': pd.Timestamp.now()
})
time.sleep(0.2) # Respect rate limits
except requests.RequestException as e:
print(f"Error fetching {city}: {e}")
return pd.DataFrame(records)
# ── Simple web scraping with BeautifulSoup ────────────────────
from bs4 import BeautifulSoup
def scrape_job_titles(url):
"""Extract job titles from a job listing page."""
headers = {'User-Agent': 'Mozilla/5.0 (compatible; research-bot/1.0)'}
response = requests.get(url, headers=headers, timeout=15)
soup = BeautifulSoup(response.content, 'html.parser')
titles = [tag.get_text(strip=True) for tag in soup.select('.job-title')]
return titles
# ── Loading from Kaggle programmatically ─────────────────────
# First: pip install kaggle, set up ~/.kaggle/kaggle.json
import subprocess
subprocess.run(['kaggle', 'datasets', 'download', '-d',
'uciml/breast-cancer-wisconsin-data', '--unzip', '-p', './data/'])
df = pd.read_csv('./data/data.csv')
# ── Tracking data provenance ──────────────────────────────────
data_catalog = {
'source': 'Kaggle UCI Breast Cancer Wisconsin',
'url': 'https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data',
'downloaded': pd.Timestamp.now().isoformat(),
'rows': len(df),
'columns': list(df.columns),
'license': 'CC BY 4.0',
'notes': 'Original dataset from UCI ML Repository'
}
import json
with open('./data/data_catalog.json', 'w') as f:
json.dump(data_catalog, f, indent=2)Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain data collection & labelling and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Complete ML Pipeline — Raw Data to Prediction
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
This is a complete, production-quality ML pipeline that goes from raw CSV to a serialised, ready-to-serve model in a single script:
"""
complete_ml_pipeline.py
A full ML pipeline: data → preprocessing → model → evaluation → serialisation
"""
import pandas as pd
import numpy as np
import joblib
import json
from pathlib import Path
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (classification_report, roc_auc_score,
f1_score, confusion_matrix)
from sklearn.datasets import fetch_openml
# ════════════════════════════════════════════════════════════
# STEP 1: Load Data
# ════════════════════════════════════════════════════════════
print("="*60)
print("STEP 1: Loading Data")
print("="*60)
# Using Titanic as example (replace with your data source)
titanic = fetch_openml('titanic', version=1, as_frame=True, parser='auto')
df = titanic.frame.copy()
# Target variable
df['survived'] = (df['survived'].astype(int) == 1).astype(int)
print(f"Dataset shape: {df.shape}")
print(f"Target distribution:
{df['survived'].value_counts(normalize=True).round(3)}")
# ════════════════════════════════════════════════════════════
# STEP 2: Feature Definition
# ════════════════════════════════════════════════════════════
print("
STEP 2: Feature Definition")
NUMERIC_FEATURES = ['age', 'fare', 'sibsp', 'parch']
CATEGORICAL_FEATURES = ['pclass', 'sex', 'embarked']
TARGET = 'survived'
X = df[NUMERIC_FEATURES + CATEGORICAL_FEATURES].copy()
y = df[TARGET]
print(f"Features: {NUMERIC_FEATURES + CATEGORICAL_FEATURES}")
print(f"Missing values:
{X.isnull().sum()}")
# ════════════════════════════════════════════════════════════
# STEP 3: Train/Test Split
# ════════════════════════════════════════════════════════════
print("
STEP 3: Train/Test Split")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# ════════════════════════════════════════════════════════════
# STEP 4: Build sklearn Pipeline (leak-proof!)
# ════════════════════════════════════════════════════════════
print("
STEP 4: Building sklearn Pipeline")
# Numeric: impute median → scale
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical: impute mode → one-hot encode
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine with ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, NUMERIC_FEATURES),
('cat', categorical_transformer, CATEGORICAL_FEATURES)
], remainder='drop')
# Full pipeline: preprocessor → model
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(
n_estimators=200, max_depth=4, learning_rate=0.08,
subsample=0.8, random_state=42
))
])
# ════════════════════════════════════════════════════════════
# STEP 5: Cross-Validation
# ════════════════════════════════════════════════════════════
print("
STEP 5: Cross-Validation")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
full_pipeline, X_train, y_train, cv=cv,
scoring=['accuracy', 'f1', 'roc_auc'],
return_train_score=True, n_jobs=-1
)
for metric in ['accuracy', 'f1', 'roc_auc']:
train_m = cv_results[f'train_{metric}'].mean()
val_m = cv_results[f'test_{metric}'].mean()
val_std = cv_results[f'test_{metric}'].std()
print(f" {metric:12s}: Train={train_m:.4f} | CV={val_m:.4f} ± {val_std:.4f}")
# ════════════════════════════════════════════════════════════
# STEP 6: Final Training & Test Evaluation
# ════════════════════════════════════════════════════════════
print("
STEP 6: Final Training")
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
y_scores = full_pipeline.predict_proba(X_test)[:, 1]
print("
--- TEST SET EVALUATION ---")
print(classification_report(y_test, y_pred, target_names=['Died', 'Survived']))
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_scores):.4f}")
print(f"Confusion Matrix:
{confusion_matrix(y_test, y_pred)}")
# ════════════════════════════════════════════════════════════
# STEP 7: Serialise Model
# ════════════════════════════════════════════════════════════
print("
STEP 7: Saving Model")
output_dir = Path('./models')
output_dir.mkdir(exist_ok=True)
joblib.dump(full_pipeline, output_dir / 'titanic_pipeline.joblib')
print(f"Model saved to {output_dir / 'titanic_pipeline.joblib'}")
# Save metadata
metadata = {
'model_type': 'GradientBoostingClassifier',
'features': NUMERIC_FEATURES + CATEGORICAL_FEATURES,
'target': TARGET,
'cv_roc_auc': float(cv_results['test_roc_auc'].mean()),
'test_roc_auc': float(roc_auc_score(y_test, y_scores)),
'train_size': len(X_train),
'test_size': len(X_test),
}
with open(output_dir / 'metadata.json', 'w') as f:
json.dump(metadata, f, indent=2)
# ════════════════════════════════════════════════════════════
# STEP 8: Verify loaded model works
# ════════════════════════════════════════════════════════════
loaded_pipeline = joblib.load(output_dir / 'titanic_pipeline.joblib')
test_record = pd.DataFrame([{
'age': 30, 'fare': 50.0, 'sibsp': 1, 'parch': 0,
'pclass': '2', 'sex': 'female', 'embarked': 'S'
}])
prob = loaded_pipeline.predict_proba(test_record)[0, 1]
print(f"
Sample prediction — survival probability: {prob:.2%}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain complete ml pipeline and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 85 — MLflow Tracking
Experiment Tracking with MLflow
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Why Experiment Tracking?
In a real project you run dozens of experiments: different models, different features, different hyperparameters. Without tracking, you lose results, can't reproduce them, and don't know which version of the model is in production. MLflow solves all of this.
# pip install mlflow
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# ── Setup MLflow ──────────────────────────────────────────────
# Start tracking server: mlflow ui (then open http://localhost:5000)
mlflow.set_tracking_uri("http://localhost:5000") # Or use local ./mlruns folder
mlflow.set_experiment("breast-cancer-classification")
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# ── Run 1: Gradient Boosting ──────────────────────────────────
with mlflow.start_run(run_name="GradientBoosting-v1"):
# Log parameters
params = {
'model': 'GradientBoostingClassifier',
'n_estimators': 200,
'max_depth': 4,
'learning_rate': 0.05,
'subsample': 0.8,
'cv_folds': 5
}
mlflow.log_params(params)
# Train model
model = GradientBoostingClassifier(
n_estimators=params['n_estimators'],
max_depth=params['max_depth'],
learning_rate=params['learning_rate'],
subsample=params['subsample'],
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
y_scores = model.predict_proba(X_test)[:, 1]
cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
# Log metrics
metrics = {
'test_accuracy': accuracy_score(y_test, y_pred),
'test_f1': f1_score(y_test, y_pred),
'test_roc_auc': roc_auc_score(y_test, y_scores),
'cv_roc_auc_mean': cv_scores.mean(),
'cv_roc_auc_std': cv_scores.std()
}
mlflow.log_metrics(metrics)
print(f"GB — Test ROC-AUC: {metrics['test_roc_auc']:.4f}, CV: {metrics['cv_roc_auc_mean']:.4f}")
# Log model
mlflow.sklearn.log_model(model, "model", registered_model_name="BreastCancerClassifier")
# Log feature importances as artifact
importances = pd.DataFrame({
'feature': load_breast_cancer().feature_names,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
fig, ax = plt.subplots(figsize=(10, 6))
importances.head(15).plot.barh(x='feature', y='importance', ax=ax)
ax.set_title('Feature Importances — GradientBoosting')
plt.tight_layout()
fig.savefig('/tmp/feature_importance.png')
mlflow.log_artifact('/tmp/feature_importance.png') # Stores in run artifacts
# Log any file as artifact
mlflow.log_text(str(importances.to_dict()), "feature_importances.txt")
# ── Run 2: Random Forest for comparison ───────────────────────
with mlflow.start_run(run_name="RandomForest-v1"):
params = {'model': 'RandomForestClassifier', 'n_estimators': 200, 'max_depth': 10}
mlflow.log_params(params)
rf = RandomForestClassifier(**{k:v for k,v in params.items() if k != 'model'}, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_scores = rf.predict_proba(X_test)[:, 1]
cv_scores = cross_val_score(rf, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
mlflow.log_metrics({
'test_roc_auc': roc_auc_score(y_test, y_scores),
'cv_roc_auc_mean': cv_scores.mean()
})
mlflow.sklearn.log_model(rf, "model", registered_model_name="BreastCancerClassifier")
print(f"RF — Test ROC-AUC: {roc_auc_score(y_test, y_scores):.4f}, CV: {cv_scores.mean():.4f}")
# ── Model Registry — promote best model ───────────────────────
# After comparing runs in the MLflow UI, promote the best to Staging/Production:
# client = mlflow.tracking.MlflowClient()
# client.transition_model_version_stage(
# name="BreastCancerClassifier", version=1, stage="Production"
# )MLflow Alternatives
If MLflow feels heavyweight: Weights & Biases (wandb) — excellent UI, free for small teams; Neptune.ai — strong for collaborative teams; Comet ML. For minimal overhead in notebooks: just use a results_df DataFrame with pandas and log to CSV after each experiment.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain experiment tracking with mlflow and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 86 — Case Study 1
Case Study 1 — House Price Prediction
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
A complete regression case study using the Ames Housing dataset — a Kaggle classic with 79 features and ~1,500 houses.
"""
house_price_prediction.py — Full EDA + Preprocessing + XGBoost + Evaluation
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from xgboost import XGBRegressor
import joblib
# ── Load data (from Kaggle House Prices competition) ──────────
# df = pd.read_csv('train.csv')
# Simulating key steps with descriptions
# ── EDA Highlights ────────────────────────────────────────────
print("=== EDA PHASE ===")
# 1. Target variable distribution
# SalePrice is right-skewed → log-transform for regression
# df['SalePrice'].hist(bins=50) → right tail
# np.log1p(df['SalePrice']).hist(bins=50) → approximately normal
# 2. Most important numeric correlations
# corr = df.select_dtypes('number').corr()['SalePrice'].abs().sort_values(ascending=False)
# Top: OverallQual (0.79), GrLivArea (0.71), GarageCars (0.64), GarageArea (0.62)
# 3. Key findings
findings = {
"Missing data": "PoolQC, Fence, MiscFeature > 80% missing — drop; Garage/Basement ~5% — impute",
"Target transform": "Log-transform SalePrice (right-skewed)",
"Top predictors": "OverallQual, GrLivArea, TotalBsmtSF, 1stFlrSF",
"Outliers": "Remove houses with GrLivArea > 4000 AND SalePrice < 200k (data entry errors)",
"Feature engineering": "TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF; HouseAge = YrSold - YearBuilt"
}
for k, v in findings.items():
print(f" {k}: {v}")
# ── Feature Engineering ───────────────────────────────────────
def engineer_features(df):
df = df.copy()
# Create new features
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
df['TotalBathrooms'] = (df['FullBath'] + df['HalfBath'] * 0.5 +
df['BsmtFullBath'] + df['BsmtHalfBath'] * 0.5)
df['HouseAge'] = df['YrSold'] - df['YearBuilt']
df['RemodAge'] = df['YrSold'] - df['YearRemodAdd']
df['GarageAge'] = df['YrSold'] - df['GarageYrBlt'].fillna(df['YearBuilt'])
df['IsNew'] = (df['YearBuilt'] == df['YrSold']).astype(int)
df['HasPool'] = (df['PoolArea'] > 0).astype(int)
df['HasGarage'] = (df['GarageArea'] > 0).astype(int)
df['HasBasement'] = (df['TotalBsmtSF'] > 0).astype(int)
return df
# ── Training pipeline ─────────────────────────────────────────
NUMERIC_COLS = ['OverallQual', 'GrLivArea', 'TotalSF', 'TotalBathrooms',
'HouseAge', 'GarageCars', 'HasPool', 'HasGarage']
ORDINAL_COLS = ['ExterQual', 'KitchenQual', 'BsmtQual', 'GarageQual',
'HeatingQC', 'FireplaceQu']
ORDINAL_CATEGORIES = [['Po', 'Fa', 'TA', 'Gd', 'Ex']] * len(ORDINAL_COLS)
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
ordinal_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OrdinalEncoder(categories=ORDINAL_CATEGORIES,
handle_unknown='use_encoded_value', unknown_value=-1))
])
preprocessor = ColumnTransformer([
('num', numeric_transformer, NUMERIC_COLS),
('ord', ordinal_transformer, ORDINAL_COLS)
])
xgb_pipeline = Pipeline([
('preprocessor', preprocessor),
('model', XGBRegressor(
n_estimators=500,
max_depth=4,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1.0,
random_state=42,
n_jobs=-1,
early_stopping_rounds=50,
eval_metric='rmse'
))
])
# ── Cross-Validation (log-transformed target) ─────────────────
# y_log = np.log1p(y)
# cv = KFold(n_splits=5, shuffle=True, random_state=42)
# cv_scores = cross_val_score(pipeline, X_train, y_log, cv=cv, scoring='neg_mean_squared_error')
# rmse_cv = np.sqrt(-cv_scores).mean() # In log-space; exponentiate for $
# ── Evaluation function ───────────────────────────────────────
def evaluate_regression(y_true, y_pred, y_pred_log=None):
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
r2 = r2_score(y_true, y_pred)
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f" MAE: ${mae:,.0f}")
print(f" RMSE: ${rmse:,.0f}")
print(f" R²: {r2:.4f}")
print(f" MAPE: {mape:.2f}%")
if y_pred_log is not None:
rmsle = np.sqrt(mean_squared_error(np.log1p(y_true), y_pred_log))
print(f" RMSLE (log-space): {rmsle:.4f}") # Kaggle metric for this competition
print("
=== EXAMPLE EVALUATION RESULTS ===")
print("Gradient Boosted Trees on Ames Housing:")
print(" MAE: $14,500")
print(" RMSE: $22,800")
print(" R²: 0.9102")
print(" MAPE: 8.1%")
print(" Kaggle Leaderboard: Top 15% with this simple pipeline")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain case study 1 and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 87 — Case Study 2
Case Study 2 — Customer Churn Prediction
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Churn prediction is heavily imbalanced (typically 5–20% churn rate). This case study covers SMOTE oversampling, threshold tuning, and business-aligned evaluation.
"""
churn_prediction.py — Imbalanced classification with SMOTE and threshold tuning
"""
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (classification_report, roc_auc_score,
average_precision_score, f1_score, confusion_matrix)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline # imblearn's Pipeline
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# pip install imbalanced-learn
# ── Simulate churn dataset ────────────────────────────────────
np.random.seed(42)
n = 5000
df = pd.DataFrame({
'tenure_months': np.random.exponential(24, n).clip(1, 72).astype(int),
'monthly_charges': np.random.normal(65, 25, n).clip(20, 120).round(2),
'total_charges': None, # will compute
'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n,
p=[0.55, 0.25, 0.20]),
'internet_service':np.random.choice(['DSL', 'Fiber optic', 'No'], n, p=[0.35, 0.45, 0.20]),
'tech_support': np.random.choice(['Yes', 'No'], n, p=[0.4, 0.6]),
'senior_citizen': np.random.choice([0, 1], n, p=[0.84, 0.16]),
'num_complaints': np.random.poisson(0.3, n),
})
df['total_charges'] = (df['tenure_months'] * df['monthly_charges']).round(2)
# Generate churn (higher churn for month-to-month, fiber, complaints)
churn_prob = 0.05
churn_prob += 0.12 * (df['contract_type'] == 'Month-to-month')
churn_prob += 0.08 * (df['internet_service'] == 'Fiber optic')
churn_prob += 0.05 * df['num_complaints']
churn_prob -= 0.02 * (df['tenure_months'] / 12)
churn_prob -= 0.03 * (df['tech_support'] == 'Yes')
churn_prob = churn_prob.clip(0.02, 0.90)
df['churn'] = (np.random.random(n) < churn_prob).astype(int)
print(f"Churn rate: {df['churn'].mean():.1%}") # ~18%
# ── Features ─────────────────────────────────────────────────
NUMERIC = ['tenure_months', 'monthly_charges', 'total_charges', 'num_complaints']
CATEGORICAL = ['contract_type', 'internet_service', 'tech_support']
BINARY = ['senior_citizen']
FEATURES = NUMERIC + CATEGORICAL + BINARY
X = df[FEATURES]
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# ── Preprocessing ─────────────────────────────────────────────
num_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
cat_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
('num', num_transformer, NUMERIC + BINARY),
('cat', cat_transformer, CATEGORICAL)
])
# ── Pipeline WITH SMOTE (handles imbalance) ───────────────────
# IMPORTANT: SMOTE must be applied ONLY to training data, inside CV!
# Use imblearn's Pipeline (not sklearn's) to integrate SMOTE correctly.
pipeline_smote = ImbPipeline([
('preprocessor', preprocessor),
('smote', SMOTE(sampling_strategy=0.5, random_state=42)), # Upsample minority to 50% of majority
('classifier', GradientBoostingClassifier(
n_estimators=200, max_depth=4, learning_rate=0.05,
subsample=0.8, random_state=42
))
])
# ── Cross-validation ──────────────────────────────────────────
from sklearn.model_selection import cross_validate
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
pipeline_smote, X_train, y_train, cv=cv,
scoring=['f1', 'roc_auc', 'average_precision'],
n_jobs=-1
)
print("
=== Cross-Validation Results ===")
for metric in ['f1', 'roc_auc', 'average_precision']:
mean = cv_results[f'test_{metric}'].mean()
std = cv_results[f'test_{metric}'].std()
print(f" {metric:22s}: {mean:.4f} ± {std:.4f}")
# ── Fit and evaluate ──────────────────────────────────────────
pipeline_smote.fit(X_train, y_train)
y_scores = pipeline_smote.predict_proba(X_test)[:, 1]
# ── Threshold Tuning ─────────────────────────────────────────
print("
=== Threshold Analysis ===")
thresholds = np.arange(0.2, 0.7, 0.05)
results = []
for t in thresholds:
y_pred_t = (y_scores >= t).astype(int)
results.append({
'threshold': round(t, 2),
'precision': round(f1_score(y_test, y_pred_t, average='binary', zero_division=0), 3),
'recall': round(f1_score(y_test, y_pred_t, average='binary', zero_division=0), 3),
'f1': round(f1_score(y_test, y_pred_t), 3),
'churners_caught': int(y_pred_t[y_test==1].sum()),
'false_alarms': int(y_pred_t[y_test==0].sum()),
})
threshold_df = pd.DataFrame(results)
print(threshold_df.to_string(index=False))
# ── Business Impact Analysis ──────────────────────────────────
# Assume: Retention offer costs $50 | Losing a churner costs $500/year
COST_RETENTION = 50
REVENUE_SAVED = 500
best_t = 0.35 # Example optimal threshold
y_pred_best = (y_scores >= best_t).astype(int)
cm = confusion_matrix(y_test, y_pred_best)
tn, fp, fn, tp = cm.ravel()
cost_false_positives = fp * COST_RETENTION # Wasted offers
revenue_saved = tp * REVENUE_SAVED # Churners retained
net_value = revenue_saved - cost_false_positives
print(f"
=== Business Impact (threshold={best_t}) ===")
print(f" Churners caught: {tp}")
print(f" False alarms: {fp}")
print(f" Revenue saved: ${revenue_saved:,}")
print(f" Cost of false alarms: ${cost_false_positives:,}")
print(f" Net value: ${net_value:,}")SMOTE Rules
- Apply SMOTE only to training data — never to test/validation data
- Use imblearn Pipeline to prevent leakage — it automatically applies SMOTE only within each CV fold
- SMOTE generates synthetic minority samples by interpolating between existing minority examples
- Alternative: class_weight='balanced' parameter in sklearn classifiers — simpler and often sufficient
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain case study 2 and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 88 — Interview Prep
ML Interview Questions & Answers
Why this matters
ML Interview Questions & Answers: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
L2 (Ridge): Adds $\lambda\sum\theta_i^2$ penalty. Shrinks all weights toward zero but rarely to exactly zero → keeps all features. Better when all features are expected to contribute.
ElasticNet: Combines both: $\lambda_1\sum|\theta_i| + \lambda_2\sum\theta_i^2$. Best of both worlds for high-dimensional data.
Prioritise Precision when the cost of a False Positive is high: email spam filter (legitimate email in spam = user loses important mail), legal content moderation (wrongly removing content = censorship). F-beta score: set beta > 1 for recall-focus, beta < 1 for precision-focus.
2. Class weights:
class_weight='balanced' in sklearn — automatically upweights the minority class in the loss.3. Threshold tuning: Adjust decision threshold from 0.5 to a value that optimises your target metric (precision, recall, F1).
4. Resampling: SMOTE oversampling (creates synthetic minority samples) or random undersampling of majority class.
5. Anomaly detection framing: If very rare positive class (<1%), train an anomaly detector on the majority class (Isolation Forest, One-Class SVM).
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain ml interview questions & answers and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Building an ML Portfolio
Why this matters
Building an ML Portfolio: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
GitHub Repository Structure
ml-projects/
├── 01-house-prices/
│ ├── README.md ← Problem, dataset, results, key learnings
│ ├── notebooks/
│ │ ├── 01_eda.ipynb
│ │ ├── 02_feature_engineering.ipynb
│ │ └── 03_modelling.ipynb
│ ├── src/
│ │ ├── features.py ← Reusable feature engineering
│ │ ├── train.py ← Training script
│ │ └── predict.py ← Prediction script
│ ├── models/
│ │ └── xgb_pipeline.joblib
│ └── requirements.txt
├── 02-churn-prediction/
├── 03-customer-segmentation/
└── README.md ← Portfolio overview with links and screenshotsWhat Makes a Strong Portfolio Project
- Clear problem statement — what business question does this solve?
- Non-trivial EDA with insights, not just code dumps
- Evidence of good ML practices: pipelines, cross-validation, no data leakage
- Baseline model + multiple iterations with improvement narrative
- Proper evaluation (not just accuracy — explain why you chose your metric)
- Key learnings — what would you do differently? What surprised you?
- Interactive demo (Streamlit app, Gradio, or deployed FastAPI endpoint)
- Clean, readable code with docstrings
Top Kaggle Competitions for Portfolio
| Competition | Type | Why It's Good |
|---|---|---|
| Titanic — Machine Learning from Disaster | Binary classification | Classic beginner project; well-documented; easy to run |
| House Prices — Advanced Regression | Regression | Feature engineering heavy; lots of creativity room |
| Spaceship Titanic | Binary classification | Fun theme, tabular, good for feature engineering |
| Store Sales — Time Series Forecasting | Time series | Real-world business problem; teaches temporal CV |
| Playground Series (monthly) | Various | Kaggle-generated synthetic data; fresh each month |
| Any tabular competition (top 100 leaderboard) | Various | High-quality notebooks from top performers to learn from |
Writing Technical Blog Posts
A well-written blog post demonstrates communication skills — essential for ML roles. Structure:
- Hook: Start with the problem and why it matters ($X million saved, 30% improvement)
- Data exploration: 3–5 key visualisations with insights (not just code)
- Methodology: Your approach and the reasoning behind it
- Results: Metrics, comparison table, business impact
- Lessons learned: What didn't work, what surprised you
- Code link: Always link to the GitHub repo
Best platforms: Medium (Towards Data Science publication), Substack, or your own GitHub Pages site.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain building an ml portfolio and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 90 — Mid-Review
Module 1–7 Review — Key Concept Checklist
Why this matters
Module 1: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Module 1–2: Foundations & EDA
- Supervised vs Unsupervised vs Reinforcement Learning — can explain with examples
- Batch vs Online Learning — know when to use each
- NumPy: vectorisation, broadcasting, matrix operations
- Pandas: groupby, merge, pivot, apply, missing value handling
- EDA: univariate/bivariate/multivariate analysis, detecting outliers, understanding distributions
Module 3: Preprocessing
- Imputation strategies (mean vs median vs KNN vs MICE)
- Scaling (StandardScaler vs MinMaxScaler vs RobustScaler — when to use each)
- Categorical encoding (OHE vs Label vs Target vs Ordinal)
- Building sklearn Pipelines + ColumnTransformer — no data leakage
- Class imbalance: SMOTE, class_weight, threshold tuning
Module 4: Supervised Learning
- Linear/Logistic Regression — cost functions, gradient descent, regularisation
- Decision Trees — Gini impurity, information gain, overfitting via depth
- SVM — max-margin classifier, kernel trick, C and gamma parameters
- Random Forest — bagging, feature randomness, OOB error
- Gradient Boosting (XGBoost) — sequential trees, regularisation, early stopping
Module 5: Unsupervised Learning
- K-Means — WCSS, K-Means++, Elbow method, Silhouette score
- DBSCAN — epsilon, min_samples, core/border/noise points
- PCA — explained variance ratio, scree plot, n_components selection
- t-SNE — perplexity, NEVER use as ML features
- Isolation Forest — anomaly score, contamination parameter
Module 6: Evaluation & Tuning
- Stratified K-Fold CV — why stratify, reading CV score vs train score gap
- Precision/Recall/F1 — formulas, tradeoffs, when to use which
- ROC-AUC vs PR-AUC — when imbalanced data makes ROC misleading
- Regression metrics — MAE vs RMSE vs R² — when residuals matter more
- Bias-variance tradeoff — learning curve diagnosis
- Hyperparameter tuning — Grid → Random → Optuna progression
Module 7: ML Life Cycle
- Problem framing — business metric ≠ ML metric (but must align)
- Full end-to-end pipeline in a single reproducible script
- MLflow experiment tracking — log params, metrics, artifacts, models
- Case study skills: EDA → features → model → evaluation → insights
- Portfolio: 3+ strong projects on GitHub + 1 deployed demo
You've Completed 90% of the Journey!
90 cover everything you need to be a competent ML practitioner. Module 8 (100) will teach you to take models from notebooks into production — the skill that separates senior engineers from Kaggle participants.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain module 1 and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Continue to the next day in this module.
