100 Days of ML · Module 6 (80)

Module 6: Model Evaluation & Hyperparameter Tuning

100 Days of ML Module 6 — Cross-validation, confusion matrix, precision/recall/F1, ROC-AUC, regression metrics, bias-variance tradeoff, GridSearchCV, Optuna, and advanced ensembles.

⏱ 55 Min Read • 80 • Updated: May 2026

Building a model is only half the battle. Knowing whether it's actually good — and systematically improving it — is the other half. This module teaches you to evaluate models rigorously (avoiding data leakage), interpret every major metric, understand the bias-variance tradeoff, and tune hyperparameters efficiently with GridSearchCV, RandomizedSearchCV, and Optuna.

Train/Test Split & Data Leakage

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Fundamental Split

Before training any model, you must hold out a portion of your data as a test set — data the model never sees during training. This gives an unbiased estimate of generalisation performance.

Code Example

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# ── Basic train/test split ────────────────────────────────────
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing (80/20 is standard)
    random_state=42,    # Reproducibility seed
    stratify=y          # CRITICAL for classification: maintains class proportions
)
print(f"Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}")
print(f"Train class balance: {y_train.value_counts(normalize=True).round(3).to_dict()}")
print(f"Test class balance:  {y_test.value_counts(normalize=True).round(3).to_dict()}")
# With stratify=y, both have the same class proportions

# ── Three-way split: Train / Validation / Test ─────────────────
# NEVER tune hyperparameters using the test set!
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp)
# 0.176 of 0.85 ≈ 0.15, giving roughly 70/15/15 split

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

The stratify Parameter

When your dataset has class imbalance (e.g., 95% non-fraud, 5% fraud), a random split might put all fraud samples in training or testing by chance. stratify=y ensures both splits maintain the same class ratio.

⚠️

Data Leakage — The #1 Evaluation Mistake

Data leakage occurs when information from the test set "leaks" into the training process. Common causes:

Scaling before splitting: Fitting StandardScaler on the full dataset lets test set statistics influence the scaler. Always fit() on train, transform() on test.
Imputation before splitting: Same issue — fit imputers only on training data.
Target leakage: Including features that are derived from or correlated with the target in a way that wouldn't exist at prediction time (e.g., using "amount_refunded" to predict "will_be_refunded").
Temporal leakage: In time-series, using future data to predict the past.

Code Example

# ── WRONG way (data leakage!) ─────────────────────────────────
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)   # WRONG: uses test set stats!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)

# ── CORRECT way ───────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit ONLY on train
X_test_scaled  = scaler.transform(X_test)         # transform only
# Use Pipelines to automate this correctly (prevents leakage automatically)

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain train/test split & data leakage and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 72 — Cross-Validation

K-Fold Cross-Validation

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Cross-Validation?

A single train/test split can be "lucky" or "unlucky" depending on which data ends up in each set. K-Fold Cross-Validation evaluates a model on $k$ different train/test splits of the data, giving a more reliable and less variance-prone estimate of performance.

5-Fold Cross-Validation Structure

Fold 1:

TEST

TRAIN

Fold 2:

TRAIN

TEST

TRAIN

Fold 3:

TRAIN

TEST

TRAIN

Final score = mean of 5 fold scores; std = reliability of the estimate

Code Example

from sklearn.model_selection import (cross_val_score, cross_validate,
                                     KFold, StratifiedKFold, LeaveOneOut)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)

# ── Simple cross_val_score ────────────────────────────────────
cv_scores = cross_val_score(
    model, X, y,
    cv=5,                 # Number of folds (5 or 10 are standard)
    scoring='roc_auc',    # Metric to evaluate
    n_jobs=-1             # Use all CPU cores
)
print(f"CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
print(f"Individual fold scores: {cv_scores.round(4)}")

# ── StratifiedKFold — ALWAYS use for classification ───────────
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
print(f"Stratified F1: {stratified_scores.mean():.4f} ± {stratified_scores.std():.4f}")

# ── cross_validate — multiple metrics at once ─────────────────
cv_results = cross_validate(
    model, X, y,
    cv=skf,
    scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'],
    return_train_score=True,  # Also compute training scores
    n_jobs=-1
)
for metric in ['accuracy', 'f1', 'roc_auc']:
    train_mean = cv_results[f'train_{metric}'].mean()
    test_mean  = cv_results[f'test_{metric}'].mean()
    print(f"{metric:12s}: Train={train_mean:.4f}, CV={test_mean:.4f}")
# Large gap between train and CV → overfitting!

# ── Leave-One-Out (LOO) — for very small datasets ─────────────
loo = LeaveOneOut()
loo_scores = cross_val_score(model, X[:50], y[:50], cv=loo)  # Only use 50 samples for speed
print(f"LOO Accuracy (50 samples): {loo_scores.mean():.4f}")

💡

K-Fold Best Practices

Use k=5 or k=10 as the standard. k=5 is faster; k=10 gives slightly less biased estimates.
Always use StratifiedKFold for classification — regular KFold may put all rare class samples in one fold.
For time-series data, use TimeSeriesSplit — never shuffle temporal data.
The standard deviation of CV scores tells you how consistent the model is. High std = unstable model.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain k-fold cross-validation and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 73 — Confusion Matrix

Confusion Matrix

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Worked example — Confusion matrix counts

TP=80, FP=20, FN=10, TN=90 → Accuracy = 170/200 = 85%, Precision = 80/100 = 80%, Recall = 80/90 ≈ 89%. In fraud detection, FN (missed fraud) often costs more than FP — optimize recall, not accuracy.

The Four Outcomes

For a binary classification problem (Positive = disease, Negative = healthy):

Predicted Positive

Predicted Negative

Actual Positive

TP
True Positive

FN
False Negative (Type II)

Actual Negative

FP
False Positive (Type I)

TN
True Negative

TP (True Positive): Correctly predicted as positive (sick → detected as sick)
TN (True Negative): Correctly predicted as negative (healthy → detected as healthy)
FP (False Positive — Type I Error): Predicted positive, actually negative (healthy → detected as sick)
FN (False Negative — Type II Error): Predicted negative, actually positive (sick → detected as healthy) — usually more dangerous!

Code Example

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (confusion_matrix, ConfusionMatrixDisplay,
                             classification_report)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# ── Confusion Matrix ──────────────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
print("Raw confusion matrix:")
print(cm)
# [[TN, FP],
#  [FN, TP]]

# ── Visualise with seaborn ────────────────────────────────────
plt.figure(figsize=(6, 5))
sns.heatmap(
    cm, annot=True, fmt='d', cmap='Blues',
    xticklabels=['Predicted Negative', 'Predicted Positive'],
    yticklabels=['Actual Negative', 'Actual Positive']
)
plt.title('Confusion Matrix — Cancer Detection')
plt.tight_layout()
plt.show()

# ── sklearn's built-in display ────────────────────────────────
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=load_breast_cancer().target_names
)
disp.plot(cmap='Blues', colorbar=False)
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

# ── Full classification report ────────────────────────────────
print(classification_report(y_test, y_pred, target_names=['malignant', 'benign']))
# Output:
#               precision    recall  f1-score   support
#    malignant       0.97      0.95      0.96        42
#       benign       0.97      0.99      0.98        72
#     accuracy                           0.97       114

Accuracy Paradox

Problem with Accuracy: If 97% of transactions are legitimate and 3% are fraud, a model that always predicts "legitimate" achieves 97% accuracy — but is completely useless! For imbalanced datasets, always use precision, recall, F1, and ROC-AUC instead of accuracy.
      

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain confusion matrix and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 74 — Precision & Recall

Precision, Recall, F1-Score & Averaging Strategies

Precision vs. Recall Overlap Mapping

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Core Formulas

$$\text{Precision} = \frac{TP}{TP + FP} \quad \text{(of all positives predicted, how many were correct?)}$$ $$\text{Recall (Sensitivity)} = \frac{TP}{TP + FN} \quad \text{(of all actual positives, how many did we catch?)}$$ $$\text{F1} = \frac{2 \cdot P \cdot R}{P + R} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN} \quad \text{(harmonic mean of Precision and Recall)}$$

Precision-Recall Tradeoff

By adjusting the classification decision threshold (default 0.5), you can trade precision for recall:

Higher threshold (e.g., 0.8): More conservative — fewer false positives, more false negatives → Higher Precision, Lower Recall
Lower threshold (e.g., 0.3): More aggressive — catch more positives, but more false alarms → Lower Precision, Higher Recall

When to Prioritise Recall: Medical diagnosis (missing cancer is worse than a false alarm), fraud detection (missing fraud is costly). 

When to Prioritise Precision: Email spam filter (legitimate emails in spam = bad user experience), recommendation systems.

Averaging for Multi-Class

Average	Calculation	When to Use
Macro	Simple mean across all classes — equal weight to each class	When all classes are equally important; sensitive to minority class performance
Weighted	Mean weighted by class support (number of samples)	Imbalanced datasets — default in many frameworks
Micro	Aggregate TP/FP/FN across all classes, then compute	Equal weight to each sample; equivalent to accuracy for F1

Code Example

from sklearn.metrics import (precision_score, recall_score, f1_score,
                             fbeta_score, precision_recall_curve)
import matplotlib.pyplot as plt
import numpy as np

# ── Binary metrics ────────────────────────────────────────────
precision = precision_score(y_test, y_pred)
recall    = recall_score(y_test, y_pred)
f1        = f1_score(y_test, y_pred)
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1:        {f1:.4f}")

# ── F-Beta score — weight recall more than precision ──────────
# beta > 1 → recall more important; beta < 1 → precision more important
f2 = fbeta_score(y_test, y_pred, beta=2)  # Recall is 2x more important
print(f"F2 Score (recall-focused): {f2:.4f}")

# ── Multi-class averaging ─────────────────────────────────────
# y_multi = multiclass labels
# f1_macro    = f1_score(y_multi, y_pred_multi, average='macro')
# f1_weighted = f1_score(y_multi, y_pred_multi, average='weighted')
# f1_micro    = f1_score(y_multi, y_pred_multi, average='micro')

# ── Threshold tuning with Precision-Recall curve ─────────────
y_scores = model.predict_proba(X_test)[:, 1]  # Probability of positive class
precisions, recalls, thresholds = precision_recall_curve(y_test, y_scores)

plt.figure(figsize=(8, 5))
plt.subplot(1, 2, 1)
plt.plot(recalls[:-1], precisions[:-1], 'b-', linewidth=2)
plt.xlabel('Recall'); plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(thresholds, precisions[:-1], 'b-', label='Precision')
plt.plot(thresholds, recalls[:-1], 'r-', label='Recall')
plt.xlabel('Threshold'); plt.ylabel('Score')
plt.title('Precision and Recall vs Threshold')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

# ── Find optimal threshold (maximise F1) ─────────────────────
f1_scores = 2 * (precisions[:-1] * recalls[:-1]) / (precisions[:-1] + recalls[:-1] + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {best_threshold:.4f}")
y_pred_optimal = (y_scores >= best_threshold).astype(int)
print(f"F1 at optimal threshold: {f1_score(y_test, y_pred_optimal):.4f}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain precision, recall, f1-score & averaging strategies and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 75 — ROC-AUC Curve

ROC-AUC Curve & Precision-Recall AUC

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

ROC Curve

The Receiver Operating Characteristic curve plots TPR (True Positive Rate = Recall) vs FPR (False Positive Rate = 1 - Specificity) at every possible threshold:

$$TPR = \frac{TP}{TP + FN} \qquad FPR = \frac{FP}{FP + TN}$$

AUC Interpretation

AUC Value	Meaning
1.0	Perfect classifier — correctly ranks all positives above all negatives
0.9–0.99	Excellent
0.8–0.9	Good
0.7–0.8	Fair
0.5	Random guessing (diagonal line)
< 0.5	Worse than random — predictions are inverted

Code Example

from sklearn.metrics import roc_curve, roc_auc_score, average_precision_score
import matplotlib.pyplot as plt

y_scores = model.predict_proba(X_test)[:, 1]

# ── ROC Curve ─────────────────────────────────────────────────
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
auc_score = roc_auc_score(y_test, y_scores)

plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--', label='Random Classifier (AUC = 0.5)')
plt.fill_between(fpr, tpr, alpha=0.1, color='blue')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity / Recall)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"ROC-AUC: {auc_score:.4f}")

# ── Precision-Recall AUC (better for imbalanced datasets) ─────
from sklearn.metrics import precision_recall_curve, auc as sklearn_auc

precisions, recalls, _ = precision_recall_curve(y_test, y_scores)
pr_auc = sklearn_auc(recalls, precisions)
avg_precision = average_precision_score(y_test, y_scores)

plt.figure(figsize=(8, 5))
plt.plot(recalls, precisions, 'g-', linewidth=2, label=f'PR Curve (AP = {avg_precision:.4f})')
plt.axhline(y=y_test.mean(), color='r', linestyle='--', label=f'Random ({y_test.mean():.2f})')
plt.xlabel('Recall'); plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

# ── Multi-class ROC-AUC ───────────────────────────────────────
# roc_auc_score(y_test, y_proba, multi_class='ovr', average='macro')
# 'ovr' = One-vs-Rest; 'ovo' = One-vs-One

📌

ROC-AUC vs PR-AUC for Imbalanced Data

ROC-AUC can be misleading for heavily imbalanced datasets. With 99% negative class, a model that predicts mostly negative will have a low FPR (good ROC) but terrible precision (bad PR). For imbalanced classification (fraud, rare disease), prefer PR-AUC (Average Precision) over ROC-AUC as your primary metric.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain roc-auc curve & precision-recall auc and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 76 — Regression Metrics

Regression Metrics

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Core Regression Metrics

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$ $$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$ $$RMSE = \sqrt{MSE}$$ $$MAPE = \frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|$$ $$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Metric	Range	Unit	Pros	Cons
MAE	[0, ∞)	Same as target	Interpretable, robust to outliers	Not differentiable at 0, doesn't penalise large errors
MSE	[0, ∞)	Squared target	Differentiable, penalises large errors heavily	Unit is squared (hard to interpret), sensitive to outliers
RMSE	[0, ∞)	Same as target	Interpretable + penalises large errors	Still sensitive to outliers
MAPE	[0%, ∞%)	Percentage	Scale-independent, easy to explain to business	Explodes when y_i ≈ 0; biased toward negative errors
R²	(-∞, 1]	Unitless	Proportion of variance explained; 1.0 = perfect	Can be negative for worse-than-baseline models

Code Example

from sklearn.metrics import (mean_absolute_error, mean_squared_error,
                             r2_score, mean_absolute_percentage_error)
from sklearn.linear_model import Ridge
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import numpy as np

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# ── Compute all metrics ───────────────────────────────────────
mae   = mean_absolute_error(y_test, y_pred)
mse   = mean_squared_error(y_test, y_pred)
rmse  = np.sqrt(mse)
mape  = mean_absolute_percentage_error(y_test, y_pred) * 100  # Convert to %
r2    = r2_score(y_test, y_pred)

print(f"MAE:  {mae:.2f}")
print(f"MSE:  {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE: {mape:.2f}%")
print(f"R²:   {r2:.4f}")

# ── Adjusted R² — penalises unnecessary features ──────────────
n = len(y_test)      # Number of samples
p = X_test.shape[1]  # Number of features
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"Adjusted R²: {adj_r2:.4f}")
# Adjusted R² penalises adding features that don't improve the model

# ── Residual plot — most important diagnostic ─────────────────
import matplotlib.pyplot as plt
residuals = y_test - y_pred
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals, alpha=0.6, s=20)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values'); plt.ylabel('Residuals')
plt.title('Residual Plot (should be random around 0)')
plt.subplot(1, 2, 2)
plt.hist(residuals, bins=30, edgecolor='black', color='#d4af37', alpha=0.8)
plt.xlabel('Residual'); plt.ylabel('Frequency')
plt.title('Residual Distribution (should be ~Normal)')
plt.tight_layout(); plt.show()

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain regression metrics and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 77 — Bias-Variance

Bias-Variance Tradeoff & Learning Curves

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Fundamental Decomposition

For any ML model, the expected generalisation error can be decomposed as:

$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$

	Bias	Variance
Definition	Error from incorrect assumptions in the model (wrong model family)	Error from sensitivity to fluctuations in training data
Symptom	Underfitting — poor on both train and test	Overfitting — great on train, poor on test
Example	Fitting a line to quadratic data	Decision tree with depth=30 memorising training noise
Fix	More complex model, more features, better features	Regularisation, dropout, more data, pruning, ensemble

Learning Curves

Learning curves plot training and cross-validation scores as a function of training set size. They are the most powerful tool to diagnose bias vs variance.

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = load_breast_cancer(return_X_y=True)

# ── Learning curve ────────────────────────────────────────────
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(kernel='rbf', C=1.0, random_state=42))
])

train_sizes, train_scores, val_scores = learning_curve(
    pipeline, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),  # 10% to 100% of training data
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    shuffle=True, random_state=42
)

train_mean = train_scores.mean(axis=1)
train_std  = train_scores.std(axis=1)
val_mean   = val_scores.mean(axis=1)
val_std    = val_scores.std(axis=1)

plt.figure(figsize=(9, 5))
plt.plot(train_sizes, train_mean, 'b-o', label='Training Score', linewidth=2)
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
plt.plot(train_sizes, val_mean, 'r-o', label='CV Score', linewidth=2)
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
plt.xlabel('Training Examples')
plt.ylabel('Accuracy')
plt.title('Learning Curve — SVM with RBF Kernel')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# ── Interpreting Learning Curves ──────────────────────────────
print("""
DIAGNOSIS GUIDE:
• Both curves plateau at HIGH score → Good model, no problem
• Both curves plateau at LOW score  → High Bias (Underfitting)
  Fix: More complex model, better features, remove regularisation
• Large gap between train and CV    → High Variance (Overfitting)
  Fix: More data, regularisation, simpler model, dropout
• CV score still improving with more data → Get more data!
""")

💡

Validation Curve — Best Parameter for Bias/Variance

Use sklearn.model_selection.validation_curve to plot train/CV scores vs a single hyperparameter (e.g., max_depth, C, alpha). This shows exactly where a parameter transitions from underfitting to overfitting — the optimal value is at the peak CV score.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain bias-variance tradeoff & learning curves and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 78 — GridSearchCV

GridSearchCV & RandomizedSearchCV

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

GridSearchCV — Exhaustive Search

GridSearchCV tries every combination of hyperparameters in a grid. With CV=5 and 3×3×3=27 parameter combinations, it trains 27×5=135 models. Always uses cross-validation internally to avoid overfitting to the validation set.

Code Example

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from scipy.stats import uniform, randint
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# ── GridSearchCV ──────────────────────────────────────────────
param_grid = {
    'n_estimators':    [100, 200, 300],
    'max_depth':       [3, 5, 7],
    'learning_rate':   [0.01, 0.05, 0.1],
    'min_samples_leaf': [1, 5]
}
# Total: 3×3×3×2 = 54 combinations × 5 folds = 270 model fits

grid_search = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,                    # 5-fold cross-validation
    scoring='roc_auc',       # Optimise for ROC-AUC
    refit=True,              # Refit best model on all training data
    n_jobs=-1,               # Parallelise across all CPU cores
    verbose=1,               # Print progress
    return_train_score=True  # Also track training scores
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV ROC-AUC: {grid_search.best_score_:.4f}")
print(f"Test ROC-AUC: {grid_search.score(X_test, y_test):.4f}")

# Access the best estimator directly
best_model = grid_search.best_estimator_

# Explore all results
import pandas as pd
cv_results = pd.DataFrame(grid_search.cv_results_)
top10 = cv_results.sort_values('mean_test_score', ascending=False).head(10)
print(top10[['params', 'mean_test_score', 'std_test_score', 'mean_train_score']])

# ── RandomizedSearchCV — for large search spaces ──────────────
param_dist = {
    'n_estimators':    randint(50, 500),         # Sample from 50-500
    'max_depth':       randint(2, 12),            # Sample from 2-12
    'learning_rate':   uniform(0.001, 0.3),       # Sample from 0.001-0.301
    'subsample':       uniform(0.5, 0.5),         # Sample from 0.5-1.0
    'min_samples_leaf': randint(1, 20),
    'max_features':    uniform(0.3, 0.7)
}

random_search = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=50,          # Try 50 random combinations (vs 54+ for grid)
    cv=5,
    scoring='roc_auc',
    refit=True,
    n_jobs=-1,
    random_state=42,
    verbose=1
)
random_search.fit(X_train, y_train)
print(f"
Randomized Best params: {random_search.best_params_}")
print(f"Randomized Best CV ROC-AUC: {random_search.best_score_:.4f}")

💡

Grid vs Random Search — When to Use Which

GridSearchCV: Small parameter spaces (≤ 50 combinations), when you know the right ballpark for each parameter
RandomizedSearchCV: Large spaces, many parameters — empirically finds equally good solutions in fewer iterations. Use for first-pass exploration.
Optuna (Day 79): Best for large spaces with 10+ hyperparameters — uses Bayesian optimisation to focus on promising regions.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain gridsearchcv & randomizedsearchcv and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 79 — Optuna

Optuna — Bayesian Hyperparameter Optimisation

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Optuna?

Optuna uses Tree-structured Parzen Estimator (TPE) — a Bayesian optimisation algorithm that builds a probabilistic model of good hyperparameter regions and focuses search there. It is significantly more efficient than random search for large hyperparameter spaces.

Code Example

# pip install optuna
import optuna
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ── Define the objective function ─────────────────────────────
def objective(trial):
    """Optuna calls this function many times, each with different params."""
    params = {
        'n_estimators':     trial.suggest_int('n_estimators', 50, 500),
        'max_depth':        trial.suggest_int('max_depth', 2, 10),
        'learning_rate':    trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
        'subsample':        trial.suggest_float('subsample', 0.5, 1.0),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 30),
        'max_features':     trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
        'min_impurity_decrease': trial.suggest_float('min_impurity_decrease', 0.0, 0.1),
    }
    model = GradientBoostingClassifier(**params, random_state=42)
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
    return scores.mean()  # Optuna maximises this by default

# ── Create and run the study ──────────────────────────────────
study = optuna.create_study(
    direction='maximize',                              # We want max ROC-AUC
    sampler=optuna.samplers.TPESampler(seed=42),       # Bayesian (default)
    pruner=optuna.pruners.MedianPruner(n_startup_trials=5)  # Kill bad trials early
)

# Suppress verbose logging
optuna.logging.set_verbosity(optuna.logging.WARNING)

study.optimize(
    objective,
    n_trials=100,        # Number of hyperparameter configurations to try
    timeout=300,         # Stop after 5 minutes (whichever comes first)
    show_progress_bar=True
)

# ── Results ───────────────────────────────────────────────────
print(f"Best trial: #{study.best_trial.number}")
print(f"Best ROC-AUC (CV): {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

# ── Train final model with best params ────────────────────────
best_model = GradientBoostingClassifier(**study.best_params, random_state=42)
best_model.fit(X_train, y_train)
from sklearn.metrics import roc_auc_score
y_scores = best_model.predict_proba(X_test)[:, 1]
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_scores):.4f}")

# ── Optuna visualisations ──────────────────────────────────────
import optuna.visualization as vis
# vis.plot_optimization_history(study).show()   # Loss over trials
# vis.plot_param_importances(study).show()      # Which params matter most
# vis.plot_contour(study, params=['learning_rate', 'max_depth']).show()

# ── Suggest parameter types reference ─────────────────────────
print("""
trial.suggest_int(name, low, high)               → integer in [low, high]
trial.suggest_float(name, low, high)             → float in [low, high]
trial.suggest_float(name, low, high, log=True)   → float in [low, high] (log scale)
trial.suggest_categorical(name, choices)         → one of the choices
trial.suggest_discrete_uniform(name, low, high, q) → discrete grid
""")

💡

Optuna Integration with XGBoost + Early Stopping

For XGBoost/LightGBM, add a pruning callback inside the objective so Optuna can stop underperforming trials mid-training (saving significant compute):

Code Example

pruning_callback = optuna.integration.XGBoostPruningCallback(trial, 'validation-auc')
model = xgb.XGBClassifier(..., callbacks=[pruning_callback])

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain optuna and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 80 — Ensemble Methods

Advanced Ensembles — Stacking, Bagging, Voting

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Ensemble Taxonomy

Method	Strategy	Reduces	Example
Bagging	Train multiple models on bootstrapped subsets; average predictions	Variance	Random Forest, BaggingClassifier
Boosting	Train models sequentially; each corrects errors of the previous	Bias + Variance	XGBoost, LightGBM, AdaBoost
Voting	Combine predictions from diverse models by majority vote or average	Both	VotingClassifier
Stacking	Use model predictions as features for a meta-learner	Both	StackingClassifier

Code Example

from sklearn.ensemble import (StackingClassifier, BaggingClassifier,
                              VotingClassifier, RandomForestClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# ── Stacking Classifier ───────────────────────────────────────
# Level-0 (base) estimators
base_estimators = [
    ('rf',  RandomForestClassifier(n_estimators=100, random_state=42)),
    ('svm', Pipeline([('scaler', StandardScaler()), ('svc', SVC(probability=True, kernel='rbf'))])),
    ('knn', Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=11))]))
]
# Level-1 (meta) estimator — learns from base model predictions
meta_learner = LogisticRegression(C=1.0, max_iter=1000)

stacking = StackingClassifier(
    estimators=base_estimators,
    final_estimator=meta_learner,
    cv=5,                   # Cross-validate base estimators to prevent leakage
    stack_method='predict_proba',  # Use probabilities as meta-features
    passthrough=False,      # Set True to also pass original features to meta-learner
    n_jobs=-1
)
stacking.fit(X_train, y_train)
print(f"Stacking Test Accuracy: {stacking.score(X_test, y_test):.4f}")

# ── VotingClassifier ──────────────────────────────────────────
# Hard voting: majority vote of class predictions
# Soft voting: average probabilities (usually better)
voting = VotingClassifier(
    estimators=base_estimators,
    voting='soft',    # 'hard' or 'soft'
    n_jobs=-1
)
voting.fit(X_train, y_train)
print(f"Soft Voting Test Accuracy: {voting.score(X_test, y_test):.4f}")

# ── BaggingClassifier ─────────────────────────────────────────
# Bagging around any base estimator (e.g., Deep Decision Trees)
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=None),  # Unpruned tree
    n_estimators=100,
    max_samples=0.8,      # Each tree sees 80% of training samples
    max_features=0.8,     # Each tree uses 80% of features
    bootstrap=True,        # Sample with replacement (bagging)
    bootstrap_features=False,
    random_state=42,
    n_jobs=-1
)
bagging.fit(X_train, y_train)
print(f"Bagging Test Accuracy: {bagging.score(X_test, y_test):.4f}")

# ── Compare all ensembles with cross-validation ───────────────
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    'Random Forest':  RandomForestClassifier(n_estimators=100, random_state=42),
    'Bagging DT':     bagging,
    'Soft Voting':    VotingClassifier(estimators=base_estimators, voting='soft', n_jobs=-1),
    'Stacking':       StackingClassifier(estimators=base_estimators, final_estimator=meta_learner, cv=5, n_jobs=-1)
}

print("
=== Cross-Validation Comparison ===")
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc', n_jobs=-1)
    print(f"{name:20s}: {scores.mean():.4f} ± {scores.std():.4f}")

📌

Module 6 Key Takeaways

Always use stratified cross-validation for classification — never just a single split
Match your evaluation metric to your business goal (F1 ≠ AUC ≠ accuracy)
Use learning curves to diagnose bias vs variance before tuning
Start with RandomizedSearchCV for exploration, then refine with GridSearchCV or Optuna
Stacking usually outperforms voting, but is more complex; use it as a final step

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain advanced ensembles and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Continue to the next day in this module.

Classification Model Performance Confusion Matrix Grid

Unsupervised Learning → Project Life Cycle →