Search topics…
Tutorials
Explore
June 6 Offline Event →
100 Days of ML · Module 6 (80)

Module 6: Model Evaluation & Hyperparameter Tuning

100 Days of ML Module 6 — Cross-validation, confusion matrix, precision/recall/F1, ROC-AUC, regression metrics, bias-variance tradeoff, GridSearchCV, Optuna, and advanced ensembles.

⏱ 55 Min Read 80 Updated: May 2026

Building a model is only half the battle. Knowing whether it's actually good — and systematically improving it — is the other half. This module teaches you to evaluate models rigorously (avoiding data leakage), interpret every major metric, understand the bias-variance tradeoff, and tune hyperparameters efficiently with GridSearchCV, RandomizedSearchCV, and Optuna.

Train/Test Split & Data Leakage

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Fundamental Split

Before training any model, you must hold out a portion of your data as a test set — data the model never sees during training. This gives an unbiased estimate of generalisation performance.

Code Example
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# ── Basic train/test split ────────────────────────────────────
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing (80/20 is standard)
    random_state=42,    # Reproducibility seed
    stratify=y          # CRITICAL for classification: maintains class proportions
)
print(f"Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}")
print(f"Train class balance: {y_train.value_counts(normalize=True).round(3).to_dict()}")
print(f"Test class balance:  {y_test.value_counts(normalize=True).round(3).to_dict()}")
# With stratify=y, both have the same class proportions

# ── Three-way split: Train / Validation / Test ─────────────────
# NEVER tune hyperparameters using the test set!
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp)
# 0.176 of 0.85 ≈ 0.15, giving roughly 70/15/15 split

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

The stratify Parameter

When your dataset has class imbalance (e.g., 95% non-fraud, 5% fraud), a random split might put all fraud samples in training or testing by chance. stratify=y ensures both splits maintain the same class ratio.

⚠️
Data Leakage — The #1 Evaluation Mistake

Data leakage occurs when information from the test set "leaks" into the training process. Common causes:

  • Scaling before splitting: Fitting StandardScaler on the full dataset lets test set statistics influence the scaler. Always fit() on train, transform() on test.
  • Imputation before splitting: Same issue — fit imputers only on training data.
  • Target leakage: Including features that are derived from or correlated with the target in a way that wouldn't exist at prediction time (e.g., using "amount_refunded" to predict "will_be_refunded").
  • Temporal leakage: In time-series, using future data to predict the past.
Code Example
# ── WRONG way (data leakage!) ─────────────────────────────────
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)   # WRONG: uses test set stats!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)

# ── CORRECT way ───────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit ONLY on train
X_test_scaled  = scaler.transform(X_test)         # transform only
# Use Pipelines to automate this correctly (prevents leakage automatically)

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain train/test split & data leakage and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 72 — Cross-Validation

K-Fold Cross-Validation

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Cross-Validation?

A single train/test split can be "lucky" or "unlucky" depending on which data ends up in each set. K-Fold Cross-Validation evaluates a model on $k$ different train/test splits of the data, giving a more reliable and less variance-prone estimate of performance.

5-Fold Cross-Validation Structure

Fold 1:
TEST
TRAIN
Fold 2:
TRAIN
TEST
TRAIN
Fold 3:
TRAIN
TEST
TRAIN

Final score = mean of 5 fold scores; std = reliability of the estimate

Code Example
from sklearn.model_selection import (cross_val_score, cross_validate,
                                     KFold, StratifiedKFold, LeaveOneOut)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)

# ── Simple cross_val_score ────────────────────────────────────
cv_scores = cross_val_score(
    model, X, y,
    cv=5,                 # Number of folds (5 or 10 are standard)
    scoring='roc_auc',    # Metric to evaluate
    n_jobs=-1             # Use all CPU cores
)
print(f"CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
print(f"Individual fold scores: {cv_scores.round(4)}")

# ── StratifiedKFold — ALWAYS use for classification ───────────
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
print(f"Stratified F1: {stratified_scores.mean():.4f} ± {stratified_scores.std():.4f}")

# ── cross_validate — multiple metrics at once ─────────────────
cv_results = cross_validate(
    model, X, y,
    cv=skf,
    scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'],
    return_train_score=True,  # Also compute training scores
    n_jobs=-1
)
for metric in ['accuracy', 'f1', 'roc_auc']:
    train_mean = cv_results[f'train_{metric}'].mean()
    test_mean  = cv_results[f'test_{metric}'].mean()
    print(f"{metric:12s}: Train={train_mean:.4f}, CV={test_mean:.4f}")
# Large gap between train and CV → overfitting!

# ── Leave-One-Out (LOO) — for very small datasets ─────────────
loo = LeaveOneOut()
loo_scores = cross_val_score(model, X[:50], y[:50], cv=loo)  # Only use 50 samples for speed
print(f"LOO Accuracy (50 samples): {loo_scores.mean():.4f}")
💡
K-Fold Best Practices
  • Use k=5 or k=10 as the standard. k=5 is faster; k=10 gives slightly less biased estimates.
  • Always use StratifiedKFold for classification — regular KFold may put all rare class samples in one fold.
  • For time-series data, use TimeSeriesSplit — never shuffle temporal data.
  • The standard deviation of CV scores tells you how consistent the model is. High std = unstable model.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain k-fold cross-validation and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 73 — Confusion Matrix

Confusion Matrix

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Worked example — Confusion matrix counts

TP=80, FP=20, FN=10, TN=90 → Accuracy = 170/200 = 85%, Precision = 80/100 = 80%, Recall = 80/90 ≈ 89%. In fraud detection, FN (missed fraud) often costs more than FP — optimize recall, not accuracy.

The Four Outcomes

For a binary classification problem (Positive = disease, Negative = healthy):

Predicted Positive
Predicted Negative
Actual Positive
TP
True Positive
FN
False Negative (Type II)
Actual Negative
FP
False Positive (Type I)
TN
True Negative
  • TP (True Positive): Correctly predicted as positive (sick → detected as sick)
  • TN (True Negative): Correctly predicted as negative (healthy → detected as healthy)
  • FP (False Positive — Type I Error): Predicted positive, actually negative (healthy → detected as sick)
  • FN (False Negative — Type II Error): Predicted negative, actually positive (sick → detected as healthy) — usually more dangerous!
Code Example
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (confusion_matrix, ConfusionMatrixDisplay,
                             classification_report)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# ── Confusion Matrix ──────────────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
print("Raw confusion matrix:")
print(cm)
# [[TN, FP],
#  [FN, TP]]

# ── Visualise with seaborn ────────────────────────────────────
plt.figure(figsize=(6, 5))
sns.heatmap(
    cm, annot=True, fmt='d', cmap='Blues',
    xticklabels=['Predicted Negative', 'Predicted Positive'],
    yticklabels=['Actual Negative', 'Actual Positive']
)
plt.title('Confusion Matrix — Cancer Detection')
plt.tight_layout()
plt.show()

# ── sklearn's built-in display ────────────────────────────────
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=load_breast_cancer().target_names
)
disp.plot(cmap='Blues', colorbar=False)
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

# ── Full classification report ────────────────────────────────
print(classification_report(y_test, y_pred, target_names=['malignant', 'benign']))
# Output:
#               precision    recall  f1-score   support
#    malignant       0.97      0.95      0.96        42
#       benign       0.97      0.99      0.98        72
#     accuracy                           0.97       114

Accuracy Paradox

Problem with Accuracy: If 97% of transactions are legitimate and 3% are fraud, a model that always predicts "legitimate" achieves 97% accuracy — but is completely useless! For imbalanced datasets, always use precision, recall, F1, and ROC-AUC instead of accuracy.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain confusion matrix and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 74 — Precision & Recall

Precision, Recall, F1-Score & Averaging Strategies

Precision vs. Recall Overlap Mapping

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Actual Positives Predicted Positives True Positives

The Core Formulas

$$\text{Precision} = \frac{TP}{TP + FP} \quad \text{(of all positives predicted, how many were correct?)}$$ $$\text{Recall (Sensitivity)} = \frac{TP}{TP + FN} \quad \text{(of all actual positives, how many did we catch?)}$$ $$\text{F1} = \frac{2 \cdot P \cdot R}{P + R} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN} \quad \text{(harmonic mean of Precision and Recall)}$$

Precision-Recall Tradeoff

By adjusting the classification decision threshold (default 0.5), you can trade precision for recall:

  • Higher threshold (e.g., 0.8): More conservative — fewer false positives, more false negatives → Higher Precision, Lower Recall
  • Lower threshold (e.g., 0.3): More aggressive — catch more positives, but more false alarms → Lower Precision, Higher Recall
When to Prioritise Recall: Medical diagnosis (missing cancer is worse than a false alarm), fraud detection (missing fraud is costly).
When to Prioritise Precision: Email spam filter (legitimate emails in spam = bad user experience), recommendation systems.

Averaging for Multi-Class

AverageCalculationWhen to Use
MacroSimple mean across all classes — equal weight to each classWhen all classes are equally important; sensitive to minority class performance
WeightedMean weighted by class support (number of samples)Imbalanced datasets — default in many frameworks
MicroAggregate TP/FP/FN across all classes, then computeEqual weight to each sample; equivalent to accuracy for F1
Code Example
from sklearn.metrics import (precision_score, recall_score, f1_score,
                             fbeta_score, precision_recall_curve)
import matplotlib.pyplot as plt
import numpy as np

# ── Binary metrics ────────────────────────────────────────────
precision = precision_score(y_test, y_pred)
recall    = recall_score(y_test, y_pred)
f1        = f1_score(y_test, y_pred)
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1:        {f1:.4f}")

# ── F-Beta score — weight recall more than precision ──────────
# beta > 1 → recall more important; beta < 1 → precision more important
f2 = fbeta_score(y_test, y_pred, beta=2)  # Recall is 2x more important
print(f"F2 Score (recall-focused): {f2:.4f}")

# ── Multi-class averaging ─────────────────────────────────────
# y_multi = multiclass labels
# f1_macro    = f1_score(y_multi, y_pred_multi, average='macro')
# f1_weighted = f1_score(y_multi, y_pred_multi, average='weighted')
# f1_micro    = f1_score(y_multi, y_pred_multi, average='micro')

# ── Threshold tuning with Precision-Recall curve ─────────────
y_scores = model.predict_proba(X_test)[:, 1]  # Probability of positive class
precisions, recalls, thresholds = precision_recall_curve(y_test, y_scores)

plt.figure(figsize=(8, 5))
plt.subplot(1, 2, 1)
plt.plot(recalls[:-1], precisions[:-1], 'b-', linewidth=2)
plt.xlabel('Recall'); plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(thresholds, precisions[:-1], 'b-', label='Precision')
plt.plot(thresholds, recalls[:-1], 'r-', label='Recall')
plt.xlabel('Threshold'); plt.ylabel('Score')
plt.title('Precision and Recall vs Threshold')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

# ── Find optimal threshold (maximise F1) ─────────────────────
f1_scores = 2 * (precisions[:-1] * recalls[:-1]) / (precisions[:-1] + recalls[:-1] + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {best_threshold:.4f}")
y_pred_optimal = (y_scores >= best_threshold).astype(int)
print(f"F1 at optimal threshold: {f1_score(y_test, y_pred_optimal):.4f}")

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain precision, recall, f1-score & averaging strategies and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 75 — ROC-AUC Curve

ROC-AUC Curve & Precision-Recall AUC

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

ROC Curve

The Receiver Operating Characteristic curve plots TPR (True Positive Rate = Recall) vs FPR (False Positive Rate = 1 - Specificity) at every possible threshold:

$$TPR = \frac{TP}{TP + FN} \qquad FPR = \frac{FP}{FP + TN}$$

AUC Interpretation

AUC ValueMeaning
1.0Perfect classifier — correctly ranks all positives above all negatives
0.9–0.99Excellent
0.8–0.9Good
0.7–0.8Fair
0.5Random guessing (diagonal line)
< 0.5Worse than random — predictions are inverted
Code Example
from sklearn.metrics import roc_curve, roc_auc_score, average_precision_score
import matplotlib.pyplot as plt

y_scores = model.predict_proba(X_test)[:, 1]

# ── ROC Curve ─────────────────────────────────────────────────
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
auc_score = roc_auc_score(y_test, y_scores)

plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--', label='Random Classifier (AUC = 0.5)')
plt.fill_between(fpr, tpr, alpha=0.1, color='blue')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity / Recall)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"ROC-AUC: {auc_score:.4f}")

# ── Precision-Recall AUC (better for imbalanced datasets) ─────
from sklearn.metrics import precision_recall_curve, auc as sklearn_auc

precisions, recalls, _ = precision_recall_curve(y_test, y_scores)
pr_auc = sklearn_auc(recalls, precisions)
avg_precision = average_precision_score(y_test, y_scores)

plt.figure(figsize=(8, 5))
plt.plot(recalls, precisions, 'g-', linewidth=2, label=f'PR Curve (AP = {avg_precision:.4f})')
plt.axhline(y=y_test.mean(), color='r', linestyle='--', label=f'Random ({y_test.mean():.2f})')
plt.xlabel('Recall'); plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

# ── Multi-class ROC-AUC ───────────────────────────────────────
# roc_auc_score(y_test, y_proba, multi_class='ovr', average='macro')
# 'ovr' = One-vs-Rest; 'ovo' = One-vs-One
📌
ROC-AUC vs PR-AUC for Imbalanced Data

ROC-AUC can be misleading for heavily imbalanced datasets. With 99% negative class, a model that predicts mostly negative will have a low FPR (good ROC) but terrible precision (bad PR). For imbalanced classification (fraud, rare disease), prefer PR-AUC (Average Precision) over ROC-AUC as your primary metric.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain roc-auc curve & precision-recall auc and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 76 — Regression Metrics

Regression Metrics

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Core Regression Metrics

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$ $$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$ $$RMSE = \sqrt{MSE}$$ $$MAPE = \frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|$$ $$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$
MetricRangeUnitProsCons
MAE[0, ∞)Same as targetInterpretable, robust to outliersNot differentiable at 0, doesn't penalise large errors
MSE[0, ∞)Squared targetDifferentiable, penalises large errors heavilyUnit is squared (hard to interpret), sensitive to outliers
RMSE[0, ∞)Same as targetInterpretable + penalises large errorsStill sensitive to outliers
MAPE[0%, ∞%)PercentageScale-independent, easy to explain to businessExplodes when y_i ≈ 0; biased toward negative errors
(-∞, 1]UnitlessProportion of variance explained; 1.0 = perfectCan be negative for worse-than-baseline models
Code Example
from sklearn.metrics import (mean_absolute_error, mean_squared_error,
                             r2_score, mean_absolute_percentage_error)
from sklearn.linear_model import Ridge
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import numpy as np

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# ── Compute all metrics ───────────────────────────────────────
mae   = mean_absolute_error(y_test, y_pred)
mse   = mean_squared_error(y_test, y_pred)
rmse  = np.sqrt(mse)
mape  = mean_absolute_percentage_error(y_test, y_pred) * 100  # Convert to %
r2    = r2_score(y_test, y_pred)

print(f"MAE:  {mae:.2f}")
print(f"MSE:  {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE: {mape:.2f}%")
print(f"R²:   {r2:.4f}")

# ── Adjusted R² — penalises unnecessary features ──────────────
n = len(y_test)      # Number of samples
p = X_test.shape[1]  # Number of features
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"Adjusted R²: {adj_r2:.4f}")
# Adjusted R² penalises adding features that don't improve the model

# ── Residual plot — most important diagnostic ─────────────────
import matplotlib.pyplot as plt
residuals = y_test - y_pred
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals, alpha=0.6, s=20)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values'); plt.ylabel('Residuals')
plt.title('Residual Plot (should be random around 0)')
plt.subplot(1, 2, 2)
plt.hist(residuals, bins=30, edgecolor='black', color='#d4af37', alpha=0.8)
plt.xlabel('Residual'); plt.ylabel('Frequency')
plt.title('Residual Distribution (should be ~Normal)')
plt.tight_layout(); plt.show()

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain regression metrics and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 77 — Bias-Variance

Bias-Variance Tradeoff & Learning Curves

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Fundamental Decomposition

For any ML model, the expected generalisation error can be decomposed as:

$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$
BiasVariance
DefinitionError from incorrect assumptions in the model (wrong model family)Error from sensitivity to fluctuations in training data
SymptomUnderfitting — poor on both train and testOverfitting — great on train, poor on test
ExampleFitting a line to quadratic dataDecision tree with depth=30 memorising training noise
FixMore complex model, more features, better featuresRegularisation, dropout, more data, pruning, ensemble

Learning Curves

Learning curves plot training and cross-validation scores as a function of training set size. They are the most powerful tool to diagnose bias vs variance.

Code Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = load_breast_cancer(return_X_y=True)

# ── Learning curve ────────────────────────────────────────────
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(kernel='rbf', C=1.0, random_state=42))
])

train_sizes, train_scores, val_scores = learning_curve(
    pipeline, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),  # 10% to 100% of training data
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    shuffle=True, random_state=42
)

train_mean = train_scores.mean(axis=1)
train_std  = train_scores.std(axis=1)
val_mean   = val_scores.mean(axis=1)
val_std    = val_scores.std(axis=1)

plt.figure(figsize=(9, 5))
plt.plot(train_sizes, train_mean, 'b-o', label='Training Score', linewidth=2)
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
plt.plot(train_sizes, val_mean, 'r-o', label='CV Score', linewidth=2)
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
plt.xlabel('Training Examples')
plt.ylabel('Accuracy')
plt.title('Learning Curve — SVM with RBF Kernel')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# ── Interpreting Learning Curves ──────────────────────────────
print("""
DIAGNOSIS GUIDE:
• Both curves plateau at HIGH score → Good model, no problem
• Both curves plateau at LOW score  → High Bias (Underfitting)
  Fix: More complex model, better features, remove regularisation
• Large gap between train and CV    → High Variance (Overfitting)
  Fix: More data, regularisation, simpler model, dropout
• CV score still improving with more data → Get more data!
""")
💡
Validation Curve — Best Parameter for Bias/Variance

Use sklearn.model_selection.validation_curve to plot train/CV scores vs a single hyperparameter (e.g., max_depth, C, alpha). This shows exactly where a parameter transitions from underfitting to overfitting — the optimal value is at the peak CV score.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain bias-variance tradeoff & learning curves and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 78 — GridSearchCV

GridSearchCV & RandomizedSearchCV

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

GridSearchCV — Exhaustive Search

GridSearchCV tries every combination of hyperparameters in a grid. With CV=5 and 3×3×3=27 parameter combinations, it trains 27×5=135 models. Always uses cross-validation internally to avoid overfitting to the validation set.

Code Example
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from scipy.stats import uniform, randint
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# ── GridSearchCV ──────────────────────────────────────────────
param_grid = {
    'n_estimators':    [100, 200, 300],
    'max_depth':       [3, 5, 7],
    'learning_rate':   [0.01, 0.05, 0.1],
    'min_samples_leaf': [1, 5]
}
# Total: 3×3×3×2 = 54 combinations × 5 folds = 270 model fits

grid_search = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,                    # 5-fold cross-validation
    scoring='roc_auc',       # Optimise for ROC-AUC
    refit=True,              # Refit best model on all training data
    n_jobs=-1,               # Parallelise across all CPU cores
    verbose=1,               # Print progress
    return_train_score=True  # Also track training scores
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV ROC-AUC: {grid_search.best_score_:.4f}")
print(f"Test ROC-AUC: {grid_search.score(X_test, y_test):.4f}")

# Access the best estimator directly
best_model = grid_search.best_estimator_

# Explore all results
import pandas as pd
cv_results = pd.DataFrame(grid_search.cv_results_)
top10 = cv_results.sort_values('mean_test_score', ascending=False).head(10)
print(top10[['params', 'mean_test_score', 'std_test_score', 'mean_train_score']])

# ── RandomizedSearchCV — for large search spaces ──────────────
param_dist = {
    'n_estimators':    randint(50, 500),         # Sample from 50-500
    'max_depth':       randint(2, 12),            # Sample from 2-12
    'learning_rate':   uniform(0.001, 0.3),       # Sample from 0.001-0.301
    'subsample':       uniform(0.5, 0.5),         # Sample from 0.5-1.0
    'min_samples_leaf': randint(1, 20),
    'max_features':    uniform(0.3, 0.7)
}

random_search = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=50,          # Try 50 random combinations (vs 54+ for grid)
    cv=5,
    scoring='roc_auc',
    refit=True,
    n_jobs=-1,
    random_state=42,
    verbose=1
)
random_search.fit(X_train, y_train)
print(f"
Randomized Best params: {random_search.best_params_}")
print(f"Randomized Best CV ROC-AUC: {random_search.best_score_:.4f}")
💡
Grid vs Random Search — When to Use Which
  • GridSearchCV: Small parameter spaces (≤ 50 combinations), when you know the right ballpark for each parameter
  • RandomizedSearchCV: Large spaces, many parameters — empirically finds equally good solutions in fewer iterations. Use for first-pass exploration.
  • Optuna (Day 79): Best for large spaces with 10+ hyperparameters — uses Bayesian optimisation to focus on promising regions.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain gridsearchcv & randomizedsearchcv and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 79 — Optuna

Optuna — Bayesian Hyperparameter Optimisation

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Optuna?

Optuna uses Tree-structured Parzen Estimator (TPE) — a Bayesian optimisation algorithm that builds a probabilistic model of good hyperparameter regions and focuses search there. It is significantly more efficient than random search for large hyperparameter spaces.

Code Example
# pip install optuna
import optuna
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ── Define the objective function ─────────────────────────────
def objective(trial):
    """Optuna calls this function many times, each with different params."""
    params = {
        'n_estimators':     trial.suggest_int('n_estimators', 50, 500),
        'max_depth':        trial.suggest_int('max_depth', 2, 10),
        'learning_rate':    trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
        'subsample':        trial.suggest_float('subsample', 0.5, 1.0),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 30),
        'max_features':     trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
        'min_impurity_decrease': trial.suggest_float('min_impurity_decrease', 0.0, 0.1),
    }
    model = GradientBoostingClassifier(**params, random_state=42)
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
    return scores.mean()  # Optuna maximises this by default

# ── Create and run the study ──────────────────────────────────
study = optuna.create_study(
    direction='maximize',                              # We want max ROC-AUC
    sampler=optuna.samplers.TPESampler(seed=42),       # Bayesian (default)
    pruner=optuna.pruners.MedianPruner(n_startup_trials=5)  # Kill bad trials early
)

# Suppress verbose logging
optuna.logging.set_verbosity(optuna.logging.WARNING)

study.optimize(
    objective,
    n_trials=100,        # Number of hyperparameter configurations to try
    timeout=300,         # Stop after 5 minutes (whichever comes first)
    show_progress_bar=True
)

# ── Results ───────────────────────────────────────────────────
print(f"Best trial: #{study.best_trial.number}")
print(f"Best ROC-AUC (CV): {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

# ── Train final model with best params ────────────────────────
best_model = GradientBoostingClassifier(**study.best_params, random_state=42)
best_model.fit(X_train, y_train)
from sklearn.metrics import roc_auc_score
y_scores = best_model.predict_proba(X_test)[:, 1]
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_scores):.4f}")

# ── Optuna visualisations ──────────────────────────────────────
import optuna.visualization as vis
# vis.plot_optimization_history(study).show()   # Loss over trials
# vis.plot_param_importances(study).show()      # Which params matter most
# vis.plot_contour(study, params=['learning_rate', 'max_depth']).show()

# ── Suggest parameter types reference ─────────────────────────
print("""
trial.suggest_int(name, low, high)               → integer in [low, high]
trial.suggest_float(name, low, high)             → float in [low, high]
trial.suggest_float(name, low, high, log=True)   → float in [low, high] (log scale)
trial.suggest_categorical(name, choices)         → one of the choices
trial.suggest_discrete_uniform(name, low, high, q) → discrete grid
""")
💡
Optuna Integration with XGBoost + Early Stopping

For XGBoost/LightGBM, add a pruning callback inside the objective so Optuna can stop underperforming trials mid-training (saving significant compute):

Code Example
pruning_callback = optuna.integration.XGBoostPruningCallback(trial, 'validation-auc')
model = xgb.XGBClassifier(..., callbacks=[pruning_callback])

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain optuna and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 80 — Ensemble Methods

Advanced Ensembles — Stacking, Bagging, Voting

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Ensemble Taxonomy

MethodStrategyReducesExample
BaggingTrain multiple models on bootstrapped subsets; average predictionsVarianceRandom Forest, BaggingClassifier
BoostingTrain models sequentially; each corrects errors of the previousBias + VarianceXGBoost, LightGBM, AdaBoost
VotingCombine predictions from diverse models by majority vote or averageBothVotingClassifier
StackingUse model predictions as features for a meta-learnerBothStackingClassifier
Code Example
from sklearn.ensemble import (StackingClassifier, BaggingClassifier,
                              VotingClassifier, RandomForestClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# ── Stacking Classifier ───────────────────────────────────────
# Level-0 (base) estimators
base_estimators = [
    ('rf',  RandomForestClassifier(n_estimators=100, random_state=42)),
    ('svm', Pipeline([('scaler', StandardScaler()), ('svc', SVC(probability=True, kernel='rbf'))])),
    ('knn', Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=11))]))
]
# Level-1 (meta) estimator — learns from base model predictions
meta_learner = LogisticRegression(C=1.0, max_iter=1000)

stacking = StackingClassifier(
    estimators=base_estimators,
    final_estimator=meta_learner,
    cv=5,                   # Cross-validate base estimators to prevent leakage
    stack_method='predict_proba',  # Use probabilities as meta-features
    passthrough=False,      # Set True to also pass original features to meta-learner
    n_jobs=-1
)
stacking.fit(X_train, y_train)
print(f"Stacking Test Accuracy: {stacking.score(X_test, y_test):.4f}")

# ── VotingClassifier ──────────────────────────────────────────
# Hard voting: majority vote of class predictions
# Soft voting: average probabilities (usually better)
voting = VotingClassifier(
    estimators=base_estimators,
    voting='soft',    # 'hard' or 'soft'
    n_jobs=-1
)
voting.fit(X_train, y_train)
print(f"Soft Voting Test Accuracy: {voting.score(X_test, y_test):.4f}")

# ── BaggingClassifier ─────────────────────────────────────────
# Bagging around any base estimator (e.g., Deep Decision Trees)
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=None),  # Unpruned tree
    n_estimators=100,
    max_samples=0.8,      # Each tree sees 80% of training samples
    max_features=0.8,     # Each tree uses 80% of features
    bootstrap=True,        # Sample with replacement (bagging)
    bootstrap_features=False,
    random_state=42,
    n_jobs=-1
)
bagging.fit(X_train, y_train)
print(f"Bagging Test Accuracy: {bagging.score(X_test, y_test):.4f}")

# ── Compare all ensembles with cross-validation ───────────────
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    'Random Forest':  RandomForestClassifier(n_estimators=100, random_state=42),
    'Bagging DT':     bagging,
    'Soft Voting':    VotingClassifier(estimators=base_estimators, voting='soft', n_jobs=-1),
    'Stacking':       StackingClassifier(estimators=base_estimators, final_estimator=meta_learner, cv=5, n_jobs=-1)
}

print("
=== Cross-Validation Comparison ===")
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc', n_jobs=-1)
    print(f"{name:20s}: {scores.mean():.4f} ± {scores.std():.4f}")
📌
Module 6 Key Takeaways
  • Always use stratified cross-validation for classification — never just a single split
  • Match your evaluation metric to your business goal (F1 ≠ AUC ≠ accuracy)
  • Use learning curves to diagnose bias vs variance before tuning
  • Start with RandomizedSearchCV for exploration, then refine with GridSearchCV or Optuna
  • Stacking usually outperforms voting, but is more complex; use it as a final step

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain advanced ensembles and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Continue to the next day in this module.

Classification Model Performance Confusion Matrix Grid
Actual Positive Actual Negative Predicted Positive Predicted Negative True Positive (TP) False Negative (FN) False Positive (FP) True Negative (TN)
Unsupervised Learning → Project Life Cycle →