Search topics…
Tutorials
Explore
June 6 Offline Event →
100 Days of ML · Module 7 (90)

Module 7: ML Project Life Cycle

100 Days of ML Module 7 — End-to-end ML project life cycle: problem framing, data collection, full pipelines, MLflow experiment tracking, case studies, interview prep, and portfolio building.

⏱ 60 Min Read 90 Updated: May 2026

This module bridges the gap between ML theory and professional practice. You'll learn how real ML projects are structured end-to-end — from defining a business problem to tracking experiments, running case studies, preparing for interviews, and building a portfolio that gets you hired.

The ML Project Life Cycle — 7 Stages

Why this matters

The ML Project Life Cycle: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Industry Reality: ML practitioners spend roughly 70% of their time on data (collection, cleaning, feature engineering) and only 30% on actual modelling. The best models sit on top of great data pipelines — not the other way around.
1. Problem Framing
2. Data Collection
3. EDA
4. Preprocessing
5. Modelling
6. Evaluation
7. Deployment

Lifecycle is iterative — evaluation often sends you back to data collection or feature engineering

1

Problem Framing

Translate the business problem into a well-defined ML task. Define success metrics, constraints, and risk tolerance. This is the most critical step — wrong framing = wasted months of work.

2

Data Collection & Understanding

Identify data sources (databases, APIs, web scraping, sensors). Understand data quality, schema, and volume. Check for historical availability of all features needed at prediction time.

3

Exploratory Data Analysis (EDA)

Statistical summaries, visualisations, correlation analysis, outlier detection, missing value patterns. Goal: deeply understand your data before any modelling.

4

Data Preprocessing & Feature Engineering

Clean, encode, scale, impute, and engineer features. Build reusable sklearn Pipelines so the same transformations apply consistently to train, validation, and production data.

5

Model Building & Selection

Establish baselines first (DummyClassifier, simple heuristics). Then iterate: simple models → complex models → ensembles. Use cross-validation throughout.

6

Evaluation & Hyperparameter Tuning

Evaluate on held-out test set. Tune with GridSearchCV/Optuna. Check for bias, fairness, and robustness. Ensure test set performance aligns with business requirements.

7

Deployment & Monitoring

Package model as API (Flask/FastAPI), containerise (Docker), deploy to cloud. Set up monitoring for data drift and model degradation. Retrain schedule.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain the ml project life cycle and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 82 — Problem Framing

Problem Framing — Business → ML

Why this matters

Problem Framing: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Translation Process

Every ML project starts with a business problem, not a model. Your first job is to understand the business deeply enough to translate it into an ML formulation.

Business ProblemML FormulationTarget VariableSuccess Metric
"We're losing customers"Binary classification: will a customer churn in next 30 days?churn = 0/1Recall (missing a churner is more costly than a false alarm)
"We need to price houses better"Regression: predict sale price from property featuressale_price (continuous)MAPE < 10% (business-interpretable)
"Find similar customers for targeting"Clustering: segment customers by behaviourNone (unsupervised)Silhouette score + business review of segment profiles
"Detect fraudulent transactions"Binary classification, heavily imbalancedfraud = 0/1Precision-Recall AUC (imbalanced dataset)
"Recommend products to users"Ranking / collaborative filteringImplicit feedback (clicks, purchases)NDCG@10, Click-Through Rate

The Framing Checklist

  • Is the problem worth solving? (What is the $ value of a 1% improvement?)
  • Is ML the right tool? (Could a simple rule-based system work?)
  • Do you have enough labelled data? (Rule of thumb: ≥1000 samples per class for traditional ML)
  • Are all features available at prediction time? (Check for target leakage)
  • What is the cost of a false positive vs false negative? (Sets your evaluation priority)
  • How often does the model need to make predictions? (Real-time vs batch)
  • How explainable does the model need to be? (Regulated industries require interpretability)
  • What are the latency requirements? (<10ms → avoid deep models; batch → no constraint)
⚠️
Success Metric Must Match Business Goal

Optimising accuracy on a 99:1 imbalanced fraud dataset is useless — a model predicting "not fraud" 100% achieves 99% accuracy. The business cares about catching fraud (recall) and minimising false alarms (precision). Always align your ML metric with what actually matters to the business.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain problem framing and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 83 — Data Collection

Data Collection & Labelling

Why this matters

Data Collection & Labelling: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Data Sources

Source TypeExamplesToolsConsiderations
Internal DatabasesCRM, ERP, transaction logs, user eventsSQL, SQLAlchemy, pandas.read_sql()Data quality, access permissions, GDPR
Public APIsTwitter/X API, Google Maps, OpenWeather, Alpha Vantage (stocks)requests, httpx, official SDKsRate limits, costs, schema changes
Web ScrapingProduct prices, job listings, news articlesBeautifulSoup, Scrapy, PlaywrightToS compliance, robots.txt, dynamic JS pages
Public DatasetsKaggle, UCI ML Repository, Hugging Face, Google Dataset Searchkaggle API, datasets libraryLicence, dataset freshness, real-world applicability
Sensors / IoTTemperature sensors, GPS logs, click streamsMQTT, Kafka, InfluxDBHigh volume, real-time, noise
Synthetic DataWhen real data is scarce, private, or imbalancedSMOTE, Faker, SDV, CTGANDistribution mismatch with production data

Data Labelling

Supervised ML requires labelled data. Labelling is expensive and time-consuming. Key strategies:

  • Manual labelling: Domain experts (doctors, lawyers) label data directly. Gold standard but expensive.
  • Crowdsourcing: Amazon Mechanical Turk, Appen — cheap but requires quality control.
  • Label Studio (open source): Self-hosted annotation tool supporting text, images, audio, video.
  • Scale AI / Labelbox: Enterprise annotation platforms with quality workflows.
  • Weak supervision (Snorkel): Write labelling functions (heuristics) that programmatically assign noisy labels — then combine them using a generative model.
  • Active learning: The model queries the human for labels on the most uncertain/informative examples — reduces labelling effort by 10–100×.
Code Example
# ── Collecting data from a REST API ──────────────────────────
import requests
import pandas as pd
import time

def fetch_weather_data(cities, api_key):
    """Fetch current weather for a list of cities."""
    records = []
    for city in cities:
        url = f"https://api.openweathermap.org/data/2.5/weather"
        params = {'q': city, 'appid': api_key, 'units': 'metric'}
        try:
            response = requests.get(url, params=params, timeout=10)
            response.raise_for_status()
            data = response.json()
            records.append({
                'city':        city,
                'temperature': data['main']['temp'],
                'humidity':    data['main']['humidity'],
                'wind_speed':  data['wind']['speed'],
                'description': data['weather'][0]['description'],
                'timestamp':   pd.Timestamp.now()
            })
            time.sleep(0.2)  # Respect rate limits
        except requests.RequestException as e:
            print(f"Error fetching {city}: {e}")
    return pd.DataFrame(records)

# ── Simple web scraping with BeautifulSoup ────────────────────
from bs4 import BeautifulSoup

def scrape_job_titles(url):
    """Extract job titles from a job listing page."""
    headers = {'User-Agent': 'Mozilla/5.0 (compatible; research-bot/1.0)'}
    response = requests.get(url, headers=headers, timeout=15)
    soup = BeautifulSoup(response.content, 'html.parser')
    titles = [tag.get_text(strip=True) for tag in soup.select('.job-title')]
    return titles

# ── Loading from Kaggle programmatically ─────────────────────
# First: pip install kaggle, set up ~/.kaggle/kaggle.json
import subprocess
subprocess.run(['kaggle', 'datasets', 'download', '-d',
                'uciml/breast-cancer-wisconsin-data', '--unzip', '-p', './data/'])
df = pd.read_csv('./data/data.csv')

# ── Tracking data provenance ──────────────────────────────────
data_catalog = {
    'source': 'Kaggle UCI Breast Cancer Wisconsin',
    'url': 'https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data',
    'downloaded': pd.Timestamp.now().isoformat(),
    'rows': len(df),
    'columns': list(df.columns),
    'license': 'CC BY 4.0',
    'notes': 'Original dataset from UCI ML Repository'
}
import json
with open('./data/data_catalog.json', 'w') as f:
    json.dump(data_catalog, f, indent=2)

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain data collection & labelling and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 84 — End-to-End Project

Complete ML Pipeline — Raw Data to Prediction

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

This is a complete, production-quality ML pipeline that goes from raw CSV to a serialised, ready-to-serve model in a single script:

Code Example
"""
complete_ml_pipeline.py
A full ML pipeline: data → preprocessing → model → evaluation → serialisation
"""
import pandas as pd
import numpy as np
import joblib
import json
from pathlib import Path

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (classification_report, roc_auc_score,
                             f1_score, confusion_matrix)
from sklearn.datasets import fetch_openml

# ════════════════════════════════════════════════════════════
# STEP 1: Load Data
# ════════════════════════════════════════════════════════════
print("="*60)
print("STEP 1: Loading Data")
print("="*60)

# Using Titanic as example (replace with your data source)
titanic = fetch_openml('titanic', version=1, as_frame=True, parser='auto')
df = titanic.frame.copy()

# Target variable
df['survived'] = (df['survived'].astype(int) == 1).astype(int)

print(f"Dataset shape: {df.shape}")
print(f"Target distribution:
{df['survived'].value_counts(normalize=True).round(3)}")

# ════════════════════════════════════════════════════════════
# STEP 2: Feature Definition
# ════════════════════════════════════════════════════════════
print("
STEP 2: Feature Definition")

NUMERIC_FEATURES    = ['age', 'fare', 'sibsp', 'parch']
CATEGORICAL_FEATURES = ['pclass', 'sex', 'embarked']
TARGET              = 'survived'

X = df[NUMERIC_FEATURES + CATEGORICAL_FEATURES].copy()
y = df[TARGET]

print(f"Features: {NUMERIC_FEATURES + CATEGORICAL_FEATURES}")
print(f"Missing values:
{X.isnull().sum()}")

# ════════════════════════════════════════════════════════════
# STEP 3: Train/Test Split
# ════════════════════════════════════════════════════════════
print("
STEP 3: Train/Test Split")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")

# ════════════════════════════════════════════════════════════
# STEP 4: Build sklearn Pipeline (leak-proof!)
# ════════════════════════════════════════════════════════════
print("
STEP 4: Building sklearn Pipeline")

# Numeric: impute median → scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical: impute mode → one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, NUMERIC_FEATURES),
    ('cat', categorical_transformer, CATEGORICAL_FEATURES)
], remainder='drop')

# Full pipeline: preprocessor → model
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(
        n_estimators=200, max_depth=4, learning_rate=0.08,
        subsample=0.8, random_state=42
    ))
])

# ════════════════════════════════════════════════════════════
# STEP 5: Cross-Validation
# ════════════════════════════════════════════════════════════
print("
STEP 5: Cross-Validation")

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
    full_pipeline, X_train, y_train, cv=cv,
    scoring=['accuracy', 'f1', 'roc_auc'],
    return_train_score=True, n_jobs=-1
)

for metric in ['accuracy', 'f1', 'roc_auc']:
    train_m = cv_results[f'train_{metric}'].mean()
    val_m   = cv_results[f'test_{metric}'].mean()
    val_std = cv_results[f'test_{metric}'].std()
    print(f"  {metric:12s}: Train={train_m:.4f} | CV={val_m:.4f} ± {val_std:.4f}")

# ════════════════════════════════════════════════════════════
# STEP 6: Final Training & Test Evaluation
# ════════════════════════════════════════════════════════════
print("
STEP 6: Final Training")
full_pipeline.fit(X_train, y_train)

y_pred   = full_pipeline.predict(X_test)
y_scores = full_pipeline.predict_proba(X_test)[:, 1]

print("
--- TEST SET EVALUATION ---")
print(classification_report(y_test, y_pred, target_names=['Died', 'Survived']))
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_scores):.4f}")
print(f"Confusion Matrix:
{confusion_matrix(y_test, y_pred)}")

# ════════════════════════════════════════════════════════════
# STEP 7: Serialise Model
# ════════════════════════════════════════════════════════════
print("
STEP 7: Saving Model")

output_dir = Path('./models')
output_dir.mkdir(exist_ok=True)

joblib.dump(full_pipeline, output_dir / 'titanic_pipeline.joblib')
print(f"Model saved to {output_dir / 'titanic_pipeline.joblib'}")

# Save metadata
metadata = {
    'model_type': 'GradientBoostingClassifier',
    'features': NUMERIC_FEATURES + CATEGORICAL_FEATURES,
    'target': TARGET,
    'cv_roc_auc': float(cv_results['test_roc_auc'].mean()),
    'test_roc_auc': float(roc_auc_score(y_test, y_scores)),
    'train_size': len(X_train),
    'test_size': len(X_test),
}
with open(output_dir / 'metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

# ════════════════════════════════════════════════════════════
# STEP 8: Verify loaded model works
# ════════════════════════════════════════════════════════════
loaded_pipeline = joblib.load(output_dir / 'titanic_pipeline.joblib')
test_record = pd.DataFrame([{
    'age': 30, 'fare': 50.0, 'sibsp': 1, 'parch': 0,
    'pclass': '2', 'sex': 'female', 'embarked': 'S'
}])
prob = loaded_pipeline.predict_proba(test_record)[0, 1]
print(f"
Sample prediction — survival probability: {prob:.2%}")

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain complete ml pipeline and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 85 — MLflow Tracking

Experiment Tracking with MLflow

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Experiment Tracking?

In a real project you run dozens of experiments: different models, different features, different hyperparameters. Without tracking, you lose results, can't reproduce them, and don't know which version of the model is in production. MLflow solves all of this.

MLflow Components: Tracking (log params/metrics/artifacts), Projects (packaging), Models (model registry + serving), Model Registry (versioning and staging workflows).
Code Example
# pip install mlflow
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# ── Setup MLflow ──────────────────────────────────────────────
# Start tracking server: mlflow ui  (then open http://localhost:5000)
mlflow.set_tracking_uri("http://localhost:5000")  # Or use local ./mlruns folder
mlflow.set_experiment("breast-cancer-classification")

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ── Run 1: Gradient Boosting ──────────────────────────────────
with mlflow.start_run(run_name="GradientBoosting-v1"):

    # Log parameters
    params = {
        'model': 'GradientBoostingClassifier',
        'n_estimators': 200,
        'max_depth': 4,
        'learning_rate': 0.05,
        'subsample': 0.8,
        'cv_folds': 5
    }
    mlflow.log_params(params)

    # Train model
    model = GradientBoostingClassifier(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        learning_rate=params['learning_rate'],
        subsample=params['subsample'],
        random_state=42
    )
    model.fit(X_train, y_train)

    # Evaluate
    y_pred   = model.predict(X_test)
    y_scores = model.predict_proba(X_test)[:, 1]

    cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)

    # Log metrics
    metrics = {
        'test_accuracy': accuracy_score(y_test, y_pred),
        'test_f1':       f1_score(y_test, y_pred),
        'test_roc_auc':  roc_auc_score(y_test, y_scores),
        'cv_roc_auc_mean': cv_scores.mean(),
        'cv_roc_auc_std':  cv_scores.std()
    }
    mlflow.log_metrics(metrics)
    print(f"GB — Test ROC-AUC: {metrics['test_roc_auc']:.4f}, CV: {metrics['cv_roc_auc_mean']:.4f}")

    # Log model
    mlflow.sklearn.log_model(model, "model", registered_model_name="BreastCancerClassifier")

    # Log feature importances as artifact
    importances = pd.DataFrame({
        'feature': load_breast_cancer().feature_names,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)

    fig, ax = plt.subplots(figsize=(10, 6))
    importances.head(15).plot.barh(x='feature', y='importance', ax=ax)
    ax.set_title('Feature Importances — GradientBoosting')
    plt.tight_layout()
    fig.savefig('/tmp/feature_importance.png')
    mlflow.log_artifact('/tmp/feature_importance.png')  # Stores in run artifacts

    # Log any file as artifact
    mlflow.log_text(str(importances.to_dict()), "feature_importances.txt")

# ── Run 2: Random Forest for comparison ───────────────────────
with mlflow.start_run(run_name="RandomForest-v1"):
    params = {'model': 'RandomForestClassifier', 'n_estimators': 200, 'max_depth': 10}
    mlflow.log_params(params)

    rf = RandomForestClassifier(**{k:v for k,v in params.items() if k != 'model'}, random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    y_scores = rf.predict_proba(X_test)[:, 1]
    cv_scores = cross_val_score(rf, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)

    mlflow.log_metrics({
        'test_roc_auc': roc_auc_score(y_test, y_scores),
        'cv_roc_auc_mean': cv_scores.mean()
    })
    mlflow.sklearn.log_model(rf, "model", registered_model_name="BreastCancerClassifier")
    print(f"RF  — Test ROC-AUC: {roc_auc_score(y_test, y_scores):.4f}, CV: {cv_scores.mean():.4f}")

# ── Model Registry — promote best model ───────────────────────
# After comparing runs in the MLflow UI, promote the best to Staging/Production:
# client = mlflow.tracking.MlflowClient()
# client.transition_model_version_stage(
#     name="BreastCancerClassifier", version=1, stage="Production"
# )
💡
MLflow Alternatives

If MLflow feels heavyweight: Weights & Biases (wandb) — excellent UI, free for small teams; Neptune.ai — strong for collaborative teams; Comet ML. For minimal overhead in notebooks: just use a results_df DataFrame with pandas and log to CSV after each experiment.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain experiment tracking with mlflow and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 86 — Case Study 1

Case Study 1 — House Price Prediction

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

A complete regression case study using the Ames Housing dataset — a Kaggle classic with 79 features and ~1,500 houses.

Code Example
"""
house_price_prediction.py — Full EDA + Preprocessing + XGBoost + Evaluation
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from xgboost import XGBRegressor
import joblib

# ── Load data (from Kaggle House Prices competition) ──────────
# df = pd.read_csv('train.csv')
# Simulating key steps with descriptions

# ── EDA Highlights ────────────────────────────────────────────
print("=== EDA PHASE ===")

# 1. Target variable distribution
# SalePrice is right-skewed → log-transform for regression
# df['SalePrice'].hist(bins=50)  → right tail
# np.log1p(df['SalePrice']).hist(bins=50)  → approximately normal

# 2. Most important numeric correlations
# corr = df.select_dtypes('number').corr()['SalePrice'].abs().sort_values(ascending=False)
# Top: OverallQual (0.79), GrLivArea (0.71), GarageCars (0.64), GarageArea (0.62)

# 3. Key findings
findings = {
    "Missing data":         "PoolQC, Fence, MiscFeature > 80% missing — drop; Garage/Basement ~5% — impute",
    "Target transform":     "Log-transform SalePrice (right-skewed)",
    "Top predictors":       "OverallQual, GrLivArea, TotalBsmtSF, 1stFlrSF",
    "Outliers":             "Remove houses with GrLivArea > 4000 AND SalePrice < 200k (data entry errors)",
    "Feature engineering":  "TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF; HouseAge = YrSold - YearBuilt"
}
for k, v in findings.items():
    print(f"  {k}: {v}")

# ── Feature Engineering ───────────────────────────────────────
def engineer_features(df):
    df = df.copy()
    # Create new features
    df['TotalSF']     = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
    df['TotalBathrooms'] = (df['FullBath'] + df['HalfBath'] * 0.5 +
                            df['BsmtFullBath'] + df['BsmtHalfBath'] * 0.5)
    df['HouseAge']    = df['YrSold'] - df['YearBuilt']
    df['RemodAge']    = df['YrSold'] - df['YearRemodAdd']
    df['GarageAge']   = df['YrSold'] - df['GarageYrBlt'].fillna(df['YearBuilt'])
    df['IsNew']       = (df['YearBuilt'] == df['YrSold']).astype(int)
    df['HasPool']     = (df['PoolArea'] > 0).astype(int)
    df['HasGarage']   = (df['GarageArea'] > 0).astype(int)
    df['HasBasement'] = (df['TotalBsmtSF'] > 0).astype(int)
    return df

# ── Training pipeline ─────────────────────────────────────────
NUMERIC_COLS = ['OverallQual', 'GrLivArea', 'TotalSF', 'TotalBathrooms',
                'HouseAge', 'GarageCars', 'HasPool', 'HasGarage']
ORDINAL_COLS = ['ExterQual', 'KitchenQual', 'BsmtQual', 'GarageQual',
                'HeatingQC', 'FireplaceQu']
ORDINAL_CATEGORIES = [['Po', 'Fa', 'TA', 'Gd', 'Ex']] * len(ORDINAL_COLS)

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
ordinal_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=ORDINAL_CATEGORIES,
                               handle_unknown='use_encoded_value', unknown_value=-1))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, NUMERIC_COLS),
    ('ord', ordinal_transformer, ORDINAL_COLS)
])

xgb_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', XGBRegressor(
        n_estimators=500,
        max_depth=4,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=0.1,
        reg_lambda=1.0,
        random_state=42,
        n_jobs=-1,
        early_stopping_rounds=50,
        eval_metric='rmse'
    ))
])

# ── Cross-Validation (log-transformed target) ─────────────────
# y_log = np.log1p(y)
# cv = KFold(n_splits=5, shuffle=True, random_state=42)
# cv_scores = cross_val_score(pipeline, X_train, y_log, cv=cv, scoring='neg_mean_squared_error')
# rmse_cv = np.sqrt(-cv_scores).mean()  # In log-space; exponentiate for $

# ── Evaluation function ───────────────────────────────────────
def evaluate_regression(y_true, y_pred, y_pred_log=None):
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2   = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

    print(f"  MAE:  ${mae:,.0f}")
    print(f"  RMSE: ${rmse:,.0f}")
    print(f"  R²:   {r2:.4f}")
    print(f"  MAPE: {mape:.2f}%")
    if y_pred_log is not None:
        rmsle = np.sqrt(mean_squared_error(np.log1p(y_true), y_pred_log))
        print(f"  RMSLE (log-space): {rmsle:.4f}")  # Kaggle metric for this competition

print("
=== EXAMPLE EVALUATION RESULTS ===")
print("Gradient Boosted Trees on Ames Housing:")
print("  MAE:  $14,500")
print("  RMSE: $22,800")
print("  R²:   0.9102")
print("  MAPE: 8.1%")
print("  Kaggle Leaderboard: Top 15% with this simple pipeline")

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain case study 1 and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 87 — Case Study 2

Case Study 2 — Customer Churn Prediction

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Churn prediction is heavily imbalanced (typically 5–20% churn rate). This case study covers SMOTE oversampling, threshold tuning, and business-aligned evaluation.

Code Example
"""
churn_prediction.py — Imbalanced classification with SMOTE and threshold tuning
"""
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (classification_report, roc_auc_score,
                             average_precision_score, f1_score, confusion_matrix)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline  # imblearn's Pipeline
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# pip install imbalanced-learn

# ── Simulate churn dataset ────────────────────────────────────
np.random.seed(42)
n = 5000

df = pd.DataFrame({
    'tenure_months':   np.random.exponential(24, n).clip(1, 72).astype(int),
    'monthly_charges': np.random.normal(65, 25, n).clip(20, 120).round(2),
    'total_charges':   None,  # will compute
    'contract_type':   np.random.choice(['Month-to-month', 'One year', 'Two year'], n,
                                        p=[0.55, 0.25, 0.20]),
    'internet_service':np.random.choice(['DSL', 'Fiber optic', 'No'], n, p=[0.35, 0.45, 0.20]),
    'tech_support':    np.random.choice(['Yes', 'No'], n, p=[0.4, 0.6]),
    'senior_citizen':  np.random.choice([0, 1], n, p=[0.84, 0.16]),
    'num_complaints':  np.random.poisson(0.3, n),
})
df['total_charges'] = (df['tenure_months'] * df['monthly_charges']).round(2)

# Generate churn (higher churn for month-to-month, fiber, complaints)
churn_prob = 0.05
churn_prob += 0.12 * (df['contract_type'] == 'Month-to-month')
churn_prob += 0.08 * (df['internet_service'] == 'Fiber optic')
churn_prob += 0.05 * df['num_complaints']
churn_prob -= 0.02 * (df['tenure_months'] / 12)
churn_prob -= 0.03 * (df['tech_support'] == 'Yes')
churn_prob = churn_prob.clip(0.02, 0.90)
df['churn'] = (np.random.random(n) < churn_prob).astype(int)

print(f"Churn rate: {df['churn'].mean():.1%}")  # ~18%

# ── Features ─────────────────────────────────────────────────
NUMERIC = ['tenure_months', 'monthly_charges', 'total_charges', 'num_complaints']
CATEGORICAL = ['contract_type', 'internet_service', 'tech_support']
BINARY = ['senior_citizen']
FEATURES = NUMERIC + CATEGORICAL + BINARY

X = df[FEATURES]
y = df['churn']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# ── Preprocessing ─────────────────────────────────────────────
num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
    ('num', num_transformer, NUMERIC + BINARY),
    ('cat', cat_transformer, CATEGORICAL)
])

# ── Pipeline WITH SMOTE (handles imbalance) ───────────────────
# IMPORTANT: SMOTE must be applied ONLY to training data, inside CV!
# Use imblearn's Pipeline (not sklearn's) to integrate SMOTE correctly.

pipeline_smote = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(sampling_strategy=0.5, random_state=42)),  # Upsample minority to 50% of majority
    ('classifier', GradientBoostingClassifier(
        n_estimators=200, max_depth=4, learning_rate=0.05,
        subsample=0.8, random_state=42
    ))
])

# ── Cross-validation ──────────────────────────────────────────
from sklearn.model_selection import cross_validate
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
    pipeline_smote, X_train, y_train, cv=cv,
    scoring=['f1', 'roc_auc', 'average_precision'],
    n_jobs=-1
)
print("
=== Cross-Validation Results ===")
for metric in ['f1', 'roc_auc', 'average_precision']:
    mean = cv_results[f'test_{metric}'].mean()
    std  = cv_results[f'test_{metric}'].std()
    print(f"  {metric:22s}: {mean:.4f} ± {std:.4f}")

# ── Fit and evaluate ──────────────────────────────────────────
pipeline_smote.fit(X_train, y_train)
y_scores = pipeline_smote.predict_proba(X_test)[:, 1]

# ── Threshold Tuning ─────────────────────────────────────────
print("
=== Threshold Analysis ===")
thresholds = np.arange(0.2, 0.7, 0.05)
results = []
for t in thresholds:
    y_pred_t = (y_scores >= t).astype(int)
    results.append({
        'threshold': round(t, 2),
        'precision': round(f1_score(y_test, y_pred_t, average='binary', zero_division=0), 3),
        'recall':    round(f1_score(y_test, y_pred_t, average='binary', zero_division=0), 3),
        'f1':        round(f1_score(y_test, y_pred_t), 3),
        'churners_caught': int(y_pred_t[y_test==1].sum()),
        'false_alarms':    int(y_pred_t[y_test==0].sum()),
    })
threshold_df = pd.DataFrame(results)
print(threshold_df.to_string(index=False))

# ── Business Impact Analysis ──────────────────────────────────
# Assume: Retention offer costs $50 | Losing a churner costs $500/year
COST_RETENTION = 50
REVENUE_SAVED = 500

best_t = 0.35  # Example optimal threshold
y_pred_best = (y_scores >= best_t).astype(int)
cm = confusion_matrix(y_test, y_pred_best)
tn, fp, fn, tp = cm.ravel()

cost_false_positives = fp * COST_RETENTION   # Wasted offers
revenue_saved        = tp * REVENUE_SAVED    # Churners retained
net_value = revenue_saved - cost_false_positives
print(f"
=== Business Impact (threshold={best_t}) ===")
print(f"  Churners caught:      {tp}")
print(f"  False alarms:         {fp}")
print(f"  Revenue saved:        ${revenue_saved:,}")
print(f"  Cost of false alarms: ${cost_false_positives:,}")
print(f"  Net value:            ${net_value:,}")
📌
SMOTE Rules
  • Apply SMOTE only to training data — never to test/validation data
  • Use imblearn Pipeline to prevent leakage — it automatically applies SMOTE only within each CV fold
  • SMOTE generates synthetic minority samples by interpolating between existing minority examples
  • Alternative: class_weight='balanced' parameter in sklearn classifiers — simpler and often sufficient

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain case study 2 and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 88 — Interview Prep

ML Interview Questions & Answers

Why this matters

ML Interview Questions & Answers: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is overfitting, and how do you prevent it?
Overfitting occurs when a model learns the training data's noise instead of the underlying pattern — it performs well on training data but poorly on unseen data. Prevention strategies: (1) Regularisation (L1/L2 penalty, dropout for deep learning), (2) Cross-validation instead of a single hold-out, (3) More training data or data augmentation, (4) Simpler model (reduce max_depth, fewer layers), (5) Early stopping, (6) Ensemble methods (bagging reduces variance), (7) Feature selection (remove irrelevant features). Monitor the gap between training and validation scores — a large gap signals overfitting.
Explain gradient descent. What is the learning rate?
Gradient descent is an iterative optimisation algorithm that minimises a loss function $L(\theta)$ by updating parameters in the opposite direction of the gradient: $\theta \leftarrow \theta - \eta abla_\theta L(\theta)$. The learning rate $\eta$ controls step size: too large = oscillates or diverges; too small = converges too slowly. Variants: Batch GD (all data), Stochastic GD (one sample), Mini-batch GD (small batches). Adaptive methods (Adam, RMSProp) adjust the learning rate per parameter automatically.
What is the difference between L1 and L2 regularisation?
L1 (Lasso): Adds $\lambda\sum|\theta_i|$ penalty. Produces sparse models — drives some weights exactly to zero → automatic feature selection. Useful when you suspect many features are irrelevant.
L2 (Ridge): Adds $\lambda\sum\theta_i^2$ penalty. Shrinks all weights toward zero but rarely to exactly zero → keeps all features. Better when all features are expected to contribute.
ElasticNet: Combines both: $\lambda_1\sum|\theta_i| + \lambda_2\sum\theta_i^2$. Best of both worlds for high-dimensional data.
How does a Random Forest work? Why is it better than a single decision tree?
A Random Forest is a bagging ensemble of decision trees. For each tree: (1) Bootstrap sample (sample n data points with replacement), (2) At each split, consider only a random subset of $\sqrt{p}$ features (feature randomness). Final prediction = majority vote (classification) or mean (regression). Why better: Individual deep trees have high variance (overfit). By averaging many trees trained on different subsets, variance is dramatically reduced while bias stays low. The feature randomness decorrelates trees — if one feature dominates, not all trees use it.
What is cross-validation and why use it instead of a single train/test split?
K-Fold CV splits data into k folds, trains on k-1 folds and evaluates on the held-out fold, rotating k times. The final score is the mean across folds. Advantages over single split: (1) Every sample is used for both training and validation → more reliable estimate, (2) The standard deviation of CV scores tells you how consistent (stable) the model is, (3) Reduces the "lucky/unlucky split" variance. Disadvantage: k× slower. Use StratifiedKFold for classification to preserve class proportions in every fold.
What is data leakage? Give a concrete example.
Data leakage means information from the future or the test set "leaks" into model training, giving falsely optimistic results. Example 1 (preprocessing leakage): Fitting StandardScaler on all 10,000 samples before splitting — the test set's statistics (mean, std) influence the scaler, which is illegal. Always fit on train, transform on test. Example 2 (feature leakage): In a loan default prediction, including "loan_write_off_amount" as a feature — this is 0 for non-defaulters but non-zero for defaulters, directly encoding the target. Example 3 (temporal leakage): Using November's data to predict September churn.
Explain the bias-variance tradeoff with examples.
Bias: Error from overly simplistic assumptions. A linear model fitting quadratic data has high bias — it systematically underfits regardless of training size. Variance: Error from sensitivity to training data. A depth-30 decision tree memorises every training noise point — it changes wildly on different training samples. Tradeoff: Increasing model complexity reduces bias but increases variance; regularisation increases bias but reduces variance. The goal is the sweet spot with minimum total error = Bias² + Variance + Noise. Learning curves diagnose the issue visually.
When would you choose precision over recall (and vice versa)?
Prioritise Recall when the cost of a False Negative is high: medical screening (missing cancer = death), fraud detection (missing fraud = financial loss). You'd rather have false alarms than miss the real case. Lower the threshold.
Prioritise Precision when the cost of a False Positive is high: email spam filter (legitimate email in spam = user loses important mail), legal content moderation (wrongly removing content = censorship). F-beta score: set beta > 1 for recall-focus, beta < 1 for precision-focus.
What is gradient boosting, and how does it differ from bagging?
Bagging (e.g., Random Forest): Train multiple trees in parallel on bootstrapped data; average predictions. Reduces variance. Trees are independent. Boosting (e.g., XGBoost): Train trees sequentially; each new tree fits the residuals (errors) of the previous ensemble. Reduces both bias and variance. Uses gradient descent in function space — each tree is a gradient step in the direction that reduces the loss most. XGBoost adds L1/L2 regularisation, shrinkage (learning_rate), and column subsampling on top.
How do you handle a severely imbalanced dataset (e.g., 99:1)?
1. Change the evaluation metric: Use Precision-Recall AUC, F1, or F-beta instead of accuracy.
2. Class weights: class_weight='balanced' in sklearn — automatically upweights the minority class in the loss.
3. Threshold tuning: Adjust decision threshold from 0.5 to a value that optimises your target metric (precision, recall, F1).
4. Resampling: SMOTE oversampling (creates synthetic minority samples) or random undersampling of majority class.
5. Anomaly detection framing: If very rare positive class (<1%), train an anomaly detector on the majority class (Isolation Forest, One-Class SVM).

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain ml interview questions & answers and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 89 — Portfolio Building

Building an ML Portfolio

Why this matters

Building an ML Portfolio: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

GitHub Repository Structure

Code Example
ml-projects/
├── 01-house-prices/
│   ├── README.md              ← Problem, dataset, results, key learnings
│   ├── notebooks/
│   │   ├── 01_eda.ipynb
│   │   ├── 02_feature_engineering.ipynb
│   │   └── 03_modelling.ipynb
│   ├── src/
│   │   ├── features.py        ← Reusable feature engineering
│   │   ├── train.py           ← Training script
│   │   └── predict.py         ← Prediction script
│   ├── models/
│   │   └── xgb_pipeline.joblib
│   └── requirements.txt
├── 02-churn-prediction/
├── 03-customer-segmentation/
└── README.md                  ← Portfolio overview with links and screenshots

What Makes a Strong Portfolio Project

  • Clear problem statement — what business question does this solve?
  • Non-trivial EDA with insights, not just code dumps
  • Evidence of good ML practices: pipelines, cross-validation, no data leakage
  • Baseline model + multiple iterations with improvement narrative
  • Proper evaluation (not just accuracy — explain why you chose your metric)
  • Key learnings — what would you do differently? What surprised you?
  • Interactive demo (Streamlit app, Gradio, or deployed FastAPI endpoint)
  • Clean, readable code with docstrings

Top Kaggle Competitions for Portfolio

CompetitionTypeWhy It's Good
Titanic — Machine Learning from DisasterBinary classificationClassic beginner project; well-documented; easy to run
House Prices — Advanced RegressionRegressionFeature engineering heavy; lots of creativity room
Spaceship TitanicBinary classificationFun theme, tabular, good for feature engineering
Store Sales — Time Series ForecastingTime seriesReal-world business problem; teaches temporal CV
Playground Series (monthly)VariousKaggle-generated synthetic data; fresh each month
Any tabular competition (top 100 leaderboard)VariousHigh-quality notebooks from top performers to learn from

Writing Technical Blog Posts

A well-written blog post demonstrates communication skills — essential for ML roles. Structure:

  1. Hook: Start with the problem and why it matters ($X million saved, 30% improvement)
  2. Data exploration: 3–5 key visualisations with insights (not just code)
  3. Methodology: Your approach and the reasoning behind it
  4. Results: Metrics, comparison table, business impact
  5. Lessons learned: What didn't work, what surprised you
  6. Code link: Always link to the GitHub repo

Best platforms: Medium (Towards Data Science publication), Substack, or your own GitHub Pages site.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain building an ml portfolio and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 90 — Mid-Review

Module 1–7 Review — Key Concept Checklist

Why this matters

Module 1: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Module 1–2: Foundations & EDA

  • Supervised vs Unsupervised vs Reinforcement Learning — can explain with examples
  • Batch vs Online Learning — know when to use each
  • NumPy: vectorisation, broadcasting, matrix operations
  • Pandas: groupby, merge, pivot, apply, missing value handling
  • EDA: univariate/bivariate/multivariate analysis, detecting outliers, understanding distributions

Module 3: Preprocessing

  • Imputation strategies (mean vs median vs KNN vs MICE)
  • Scaling (StandardScaler vs MinMaxScaler vs RobustScaler — when to use each)
  • Categorical encoding (OHE vs Label vs Target vs Ordinal)
  • Building sklearn Pipelines + ColumnTransformer — no data leakage
  • Class imbalance: SMOTE, class_weight, threshold tuning

Module 4: Supervised Learning

  • Linear/Logistic Regression — cost functions, gradient descent, regularisation
  • Decision Trees — Gini impurity, information gain, overfitting via depth
  • SVM — max-margin classifier, kernel trick, C and gamma parameters
  • Random Forest — bagging, feature randomness, OOB error
  • Gradient Boosting (XGBoost) — sequential trees, regularisation, early stopping

Module 5: Unsupervised Learning

  • K-Means — WCSS, K-Means++, Elbow method, Silhouette score
  • DBSCAN — epsilon, min_samples, core/border/noise points
  • PCA — explained variance ratio, scree plot, n_components selection
  • t-SNE — perplexity, NEVER use as ML features
  • Isolation Forest — anomaly score, contamination parameter

Module 6: Evaluation & Tuning

  • Stratified K-Fold CV — why stratify, reading CV score vs train score gap
  • Precision/Recall/F1 — formulas, tradeoffs, when to use which
  • ROC-AUC vs PR-AUC — when imbalanced data makes ROC misleading
  • Regression metrics — MAE vs RMSE vs R² — when residuals matter more
  • Bias-variance tradeoff — learning curve diagnosis
  • Hyperparameter tuning — Grid → Random → Optuna progression

Module 7: ML Life Cycle

  • Problem framing — business metric ≠ ML metric (but must align)
  • Full end-to-end pipeline in a single reproducible script
  • MLflow experiment tracking — log params, metrics, artifacts, models
  • Case study skills: EDA → features → model → evaluation → insights
  • Portfolio: 3+ strong projects on GitHub + 1 deployed demo
🎯
You've Completed 90% of the Journey!

90 cover everything you need to be a competent ML practitioner. Module 8 (100) will teach you to take models from notebooks into production — the skill that separates senior engineers from Kaggle participants.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain module 1 and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Continue to the next day in this module.

End-to-End Machine Learning Product Lifecycle
1. Business Goal 2. Data Prep & EDA 3. Model Training 4. Deployment Monitor
Evaluation & Tuning → Deployment & Production →