Module 6: Model Evaluation & Hyperparameter Tuning
100 Days of ML Module 6 — Cross-validation, confusion matrix, precision/recall/F1, ROC-AUC, regression metrics, bias-variance tradeoff, GridSearchCV, Optuna, and advanced ensembles.
Building a model is only half the battle. Knowing whether it's actually good — and systematically improving it — is the other half. This module teaches you to evaluate models rigorously (avoiding data leakage), interpret every major metric, understand the bias-variance tradeoff, and tune hyperparameters efficiently with GridSearchCV, RandomizedSearchCV, and Optuna.
Train/Test Split & Data Leakage
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Fundamental Split
Before training any model, you must hold out a portion of your data as a test set — data the model never sees during training. This gives an unbiased estimate of generalisation performance.
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# ── Basic train/test split ────────────────────────────────────
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing (80/20 is standard)
random_state=42, # Reproducibility seed
stratify=y # CRITICAL for classification: maintains class proportions
)
print(f"Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}")
print(f"Train class balance: {y_train.value_counts(normalize=True).round(3).to_dict()}")
print(f"Test class balance: {y_test.value_counts(normalize=True).round(3).to_dict()}")
# With stratify=y, both have the same class proportions
# ── Three-way split: Train / Validation / Test ─────────────────
# NEVER tune hyperparameters using the test set!
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp)
# 0.176 of 0.85 ≈ 0.15, giving roughly 70/15/15 split
print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")The stratify Parameter
When your dataset has class imbalance (e.g., 95% non-fraud, 5% fraud), a random split might put all fraud samples in training or testing by chance. stratify=y ensures both splits maintain the same class ratio.
Data Leakage — The #1 Evaluation Mistake
Data leakage occurs when information from the test set "leaks" into the training process. Common causes:
- Scaling before splitting: Fitting StandardScaler on the full dataset lets test set statistics influence the scaler. Always
fit()on train,transform()on test. - Imputation before splitting: Same issue — fit imputers only on training data.
- Target leakage: Including features that are derived from or correlated with the target in a way that wouldn't exist at prediction time (e.g., using "amount_refunded" to predict "will_be_refunded").
- Temporal leakage: In time-series, using future data to predict the past.
# ── WRONG way (data leakage!) ─────────────────────────────────
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # WRONG: uses test set stats!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
# ── CORRECT way ───────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit ONLY on train
X_test_scaled = scaler.transform(X_test) # transform only
# Use Pipelines to automate this correctly (prevents leakage automatically)Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain train/test split & data leakage and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
K-Fold Cross-Validation
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Why Cross-Validation?
A single train/test split can be "lucky" or "unlucky" depending on which data ends up in each set. K-Fold Cross-Validation evaluates a model on $k$ different train/test splits of the data, giving a more reliable and less variance-prone estimate of performance.
5-Fold Cross-Validation Structure
Final score = mean of 5 fold scores; std = reliability of the estimate
from sklearn.model_selection import (cross_val_score, cross_validate,
KFold, StratifiedKFold, LeaveOneOut)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# ── Simple cross_val_score ────────────────────────────────────
cv_scores = cross_val_score(
model, X, y,
cv=5, # Number of folds (5 or 10 are standard)
scoring='roc_auc', # Metric to evaluate
n_jobs=-1 # Use all CPU cores
)
print(f"CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
print(f"Individual fold scores: {cv_scores.round(4)}")
# ── StratifiedKFold — ALWAYS use for classification ───────────
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
print(f"Stratified F1: {stratified_scores.mean():.4f} ± {stratified_scores.std():.4f}")
# ── cross_validate — multiple metrics at once ─────────────────
cv_results = cross_validate(
model, X, y,
cv=skf,
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'],
return_train_score=True, # Also compute training scores
n_jobs=-1
)
for metric in ['accuracy', 'f1', 'roc_auc']:
train_mean = cv_results[f'train_{metric}'].mean()
test_mean = cv_results[f'test_{metric}'].mean()
print(f"{metric:12s}: Train={train_mean:.4f}, CV={test_mean:.4f}")
# Large gap between train and CV → overfitting!
# ── Leave-One-Out (LOO) — for very small datasets ─────────────
loo = LeaveOneOut()
loo_scores = cross_val_score(model, X[:50], y[:50], cv=loo) # Only use 50 samples for speed
print(f"LOO Accuracy (50 samples): {loo_scores.mean():.4f}")K-Fold Best Practices
- Use k=5 or k=10 as the standard. k=5 is faster; k=10 gives slightly less biased estimates.
- Always use StratifiedKFold for classification — regular KFold may put all rare class samples in one fold.
- For time-series data, use TimeSeriesSplit — never shuffle temporal data.
- The standard deviation of CV scores tells you how consistent the model is. High std = unstable model.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain k-fold cross-validation and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Confusion Matrix
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Worked example — Confusion matrix counts
TP=80, FP=20, FN=10, TN=90 → Accuracy = 170/200 = 85%, Precision = 80/100 = 80%, Recall = 80/90 ≈ 89%. In fraud detection, FN (missed fraud) often costs more than FP — optimize recall, not accuracy.
The Four Outcomes
For a binary classification problem (Positive = disease, Negative = healthy):
True Positive
False Negative (Type II)
False Positive (Type I)
True Negative
- TP (True Positive): Correctly predicted as positive (sick → detected as sick)
- TN (True Negative): Correctly predicted as negative (healthy → detected as healthy)
- FP (False Positive — Type I Error): Predicted positive, actually negative (healthy → detected as sick)
- FN (False Negative — Type II Error): Predicted negative, actually positive (sick → detected as healthy) — usually more dangerous!
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (confusion_matrix, ConfusionMatrixDisplay,
classification_report)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# ── Confusion Matrix ──────────────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
print("Raw confusion matrix:")
print(cm)
# [[TN, FP],
# [FN, TP]]
# ── Visualise with seaborn ────────────────────────────────────
plt.figure(figsize=(6, 5))
sns.heatmap(
cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Predicted Negative', 'Predicted Positive'],
yticklabels=['Actual Negative', 'Actual Positive']
)
plt.title('Confusion Matrix — Cancer Detection')
plt.tight_layout()
plt.show()
# ── sklearn's built-in display ────────────────────────────────
disp = ConfusionMatrixDisplay(
confusion_matrix=cm,
display_labels=load_breast_cancer().target_names
)
disp.plot(cmap='Blues', colorbar=False)
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()
# ── Full classification report ────────────────────────────────
print(classification_report(y_test, y_pred, target_names=['malignant', 'benign']))
# Output:
# precision recall f1-score support
# malignant 0.97 0.95 0.96 42
# benign 0.97 0.99 0.98 72
# accuracy 0.97 114Accuracy Paradox
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain confusion matrix and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Precision, Recall, F1-Score & Averaging Strategies
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Core Formulas
$$\text{Precision} = \frac{TP}{TP + FP} \quad \text{(of all positives predicted, how many were correct?)}$$ $$\text{Recall (Sensitivity)} = \frac{TP}{TP + FN} \quad \text{(of all actual positives, how many did we catch?)}$$ $$\text{F1} = \frac{2 \cdot P \cdot R}{P + R} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN} \quad \text{(harmonic mean of Precision and Recall)}$$Precision-Recall Tradeoff
By adjusting the classification decision threshold (default 0.5), you can trade precision for recall:
- Higher threshold (e.g., 0.8): More conservative — fewer false positives, more false negatives → Higher Precision, Lower Recall
- Lower threshold (e.g., 0.3): More aggressive — catch more positives, but more false alarms → Lower Precision, Higher Recall
When to Prioritise Precision: Email spam filter (legitimate emails in spam = bad user experience), recommendation systems.
Averaging for Multi-Class
| Average | Calculation | When to Use |
|---|---|---|
| Macro | Simple mean across all classes — equal weight to each class | When all classes are equally important; sensitive to minority class performance |
| Weighted | Mean weighted by class support (number of samples) | Imbalanced datasets — default in many frameworks |
| Micro | Aggregate TP/FP/FN across all classes, then compute | Equal weight to each sample; equivalent to accuracy for F1 |
from sklearn.metrics import (precision_score, recall_score, f1_score,
fbeta_score, precision_recall_curve)
import matplotlib.pyplot as plt
import numpy as np
# ── Binary metrics ────────────────────────────────────────────
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1: {f1:.4f}")
# ── F-Beta score — weight recall more than precision ──────────
# beta > 1 → recall more important; beta < 1 → precision more important
f2 = fbeta_score(y_test, y_pred, beta=2) # Recall is 2x more important
print(f"F2 Score (recall-focused): {f2:.4f}")
# ── Multi-class averaging ─────────────────────────────────────
# y_multi = multiclass labels
# f1_macro = f1_score(y_multi, y_pred_multi, average='macro')
# f1_weighted = f1_score(y_multi, y_pred_multi, average='weighted')
# f1_micro = f1_score(y_multi, y_pred_multi, average='micro')
# ── Threshold tuning with Precision-Recall curve ─────────────
y_scores = model.predict_proba(X_test)[:, 1] # Probability of positive class
precisions, recalls, thresholds = precision_recall_curve(y_test, y_scores)
plt.figure(figsize=(8, 5))
plt.subplot(1, 2, 1)
plt.plot(recalls[:-1], precisions[:-1], 'b-', linewidth=2)
plt.xlabel('Recall'); plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(thresholds, precisions[:-1], 'b-', label='Precision')
plt.plot(thresholds, recalls[:-1], 'r-', label='Recall')
plt.xlabel('Threshold'); plt.ylabel('Score')
plt.title('Precision and Recall vs Threshold')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()
# ── Find optimal threshold (maximise F1) ─────────────────────
f1_scores = 2 * (precisions[:-1] * recalls[:-1]) / (precisions[:-1] + recalls[:-1] + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {best_threshold:.4f}")
y_pred_optimal = (y_scores >= best_threshold).astype(int)
print(f"F1 at optimal threshold: {f1_score(y_test, y_pred_optimal):.4f}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain precision, recall, f1-score & averaging strategies and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 75 — ROC-AUC Curve
ROC-AUC Curve & Precision-Recall AUC
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
ROC Curve
The Receiver Operating Characteristic curve plots TPR (True Positive Rate = Recall) vs FPR (False Positive Rate = 1 - Specificity) at every possible threshold:
$$TPR = \frac{TP}{TP + FN} \qquad FPR = \frac{FP}{FP + TN}$$AUC Interpretation
| AUC Value | Meaning |
|---|---|
| 1.0 | Perfect classifier — correctly ranks all positives above all negatives |
| 0.9–0.99 | Excellent |
| 0.8–0.9 | Good |
| 0.7–0.8 | Fair |
| 0.5 | Random guessing (diagonal line) |
| < 0.5 | Worse than random — predictions are inverted |
from sklearn.metrics import roc_curve, roc_auc_score, average_precision_score
import matplotlib.pyplot as plt
y_scores = model.predict_proba(X_test)[:, 1]
# ── ROC Curve ─────────────────────────────────────────────────
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
auc_score = roc_auc_score(y_test, y_scores)
plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--', label='Random Classifier (AUC = 0.5)')
plt.fill_between(fpr, tpr, alpha=0.1, color='blue')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity / Recall)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"ROC-AUC: {auc_score:.4f}")
# ── Precision-Recall AUC (better for imbalanced datasets) ─────
from sklearn.metrics import precision_recall_curve, auc as sklearn_auc
precisions, recalls, _ = precision_recall_curve(y_test, y_scores)
pr_auc = sklearn_auc(recalls, precisions)
avg_precision = average_precision_score(y_test, y_scores)
plt.figure(figsize=(8, 5))
plt.plot(recalls, precisions, 'g-', linewidth=2, label=f'PR Curve (AP = {avg_precision:.4f})')
plt.axhline(y=y_test.mean(), color='r', linestyle='--', label=f'Random ({y_test.mean():.2f})')
plt.xlabel('Recall'); plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()
# ── Multi-class ROC-AUC ───────────────────────────────────────
# roc_auc_score(y_test, y_proba, multi_class='ovr', average='macro')
# 'ovr' = One-vs-Rest; 'ovo' = One-vs-OneROC-AUC vs PR-AUC for Imbalanced Data
ROC-AUC can be misleading for heavily imbalanced datasets. With 99% negative class, a model that predicts mostly negative will have a low FPR (good ROC) but terrible precision (bad PR). For imbalanced classification (fraud, rare disease), prefer PR-AUC (Average Precision) over ROC-AUC as your primary metric.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain roc-auc curve & precision-recall auc and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Regression Metrics
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Core Regression Metrics
$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$ $$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$ $$RMSE = \sqrt{MSE}$$ $$MAPE = \frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|$$ $$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$| Metric | Range | Unit | Pros | Cons |
|---|---|---|---|---|
| MAE | [0, ∞) | Same as target | Interpretable, robust to outliers | Not differentiable at 0, doesn't penalise large errors |
| MSE | [0, ∞) | Squared target | Differentiable, penalises large errors heavily | Unit is squared (hard to interpret), sensitive to outliers |
| RMSE | [0, ∞) | Same as target | Interpretable + penalises large errors | Still sensitive to outliers |
| MAPE | [0%, ∞%) | Percentage | Scale-independent, easy to explain to business | Explodes when y_i ≈ 0; biased toward negative errors |
| R² | (-∞, 1] | Unitless | Proportion of variance explained; 1.0 = perfect | Can be negative for worse-than-baseline models |
from sklearn.metrics import (mean_absolute_error, mean_squared_error,
r2_score, mean_absolute_percentage_error)
from sklearn.linear_model import Ridge
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import numpy as np
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# ── Compute all metrics ───────────────────────────────────────
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_test, y_pred) * 100 # Convert to %
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE: {mape:.2f}%")
print(f"R²: {r2:.4f}")
# ── Adjusted R² — penalises unnecessary features ──────────────
n = len(y_test) # Number of samples
p = X_test.shape[1] # Number of features
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"Adjusted R²: {adj_r2:.4f}")
# Adjusted R² penalises adding features that don't improve the model
# ── Residual plot — most important diagnostic ─────────────────
import matplotlib.pyplot as plt
residuals = y_test - y_pred
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals, alpha=0.6, s=20)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values'); plt.ylabel('Residuals')
plt.title('Residual Plot (should be random around 0)')
plt.subplot(1, 2, 2)
plt.hist(residuals, bins=30, edgecolor='black', color='#d4af37', alpha=0.8)
plt.xlabel('Residual'); plt.ylabel('Frequency')
plt.title('Residual Distribution (should be ~Normal)')
plt.tight_layout(); plt.show()Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain regression metrics and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 77 — Bias-Variance
Bias-Variance Tradeoff & Learning Curves
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Fundamental Decomposition
For any ML model, the expected generalisation error can be decomposed as:
$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$| Bias | Variance | |
|---|---|---|
| Definition | Error from incorrect assumptions in the model (wrong model family) | Error from sensitivity to fluctuations in training data |
| Symptom | Underfitting — poor on both train and test | Overfitting — great on train, poor on test |
| Example | Fitting a line to quadratic data | Decision tree with depth=30 memorising training noise |
| Fix | More complex model, more features, better features | Regularisation, dropout, more data, pruning, ensemble |
Learning Curves
Learning curves plot training and cross-validation scores as a function of training set size. They are the most powerful tool to diagnose bias vs variance.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
X, y = load_breast_cancer(return_X_y=True)
# ── Learning curve ────────────────────────────────────────────
pipeline = Pipeline([
('scaler', StandardScaler()),
('svc', SVC(kernel='rbf', C=1.0, random_state=42))
])
train_sizes, train_scores, val_scores = learning_curve(
pipeline, X, y,
train_sizes=np.linspace(0.1, 1.0, 10), # 10% to 100% of training data
cv=5,
scoring='accuracy',
n_jobs=-1,
shuffle=True, random_state=42
)
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
plt.figure(figsize=(9, 5))
plt.plot(train_sizes, train_mean, 'b-o', label='Training Score', linewidth=2)
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
plt.plot(train_sizes, val_mean, 'r-o', label='CV Score', linewidth=2)
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
plt.xlabel('Training Examples')
plt.ylabel('Accuracy')
plt.title('Learning Curve — SVM with RBF Kernel')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# ── Interpreting Learning Curves ──────────────────────────────
print("""
DIAGNOSIS GUIDE:
• Both curves plateau at HIGH score → Good model, no problem
• Both curves plateau at LOW score → High Bias (Underfitting)
Fix: More complex model, better features, remove regularisation
• Large gap between train and CV → High Variance (Overfitting)
Fix: More data, regularisation, simpler model, dropout
• CV score still improving with more data → Get more data!
""")Validation Curve — Best Parameter for Bias/Variance
Use sklearn.model_selection.validation_curve to plot train/CV scores vs a single hyperparameter (e.g., max_depth, C, alpha). This shows exactly where a parameter transitions from underfitting to overfitting — the optimal value is at the peak CV score.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain bias-variance tradeoff & learning curves and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 78 — GridSearchCV
GridSearchCV & RandomizedSearchCV
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
GridSearchCV — Exhaustive Search
GridSearchCV tries every combination of hyperparameters in a grid. With CV=5 and 3×3×3=27 parameter combinations, it trains 27×5=135 models. Always uses cross-validation internally to avoid overfitting to the validation set.
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from scipy.stats import uniform, randint
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
# ── GridSearchCV ──────────────────────────────────────────────
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.05, 0.1],
'min_samples_leaf': [1, 5]
}
# Total: 3×3×3×2 = 54 combinations × 5 folds = 270 model fits
grid_search = GridSearchCV(
GradientBoostingClassifier(random_state=42),
param_grid=param_grid,
cv=5, # 5-fold cross-validation
scoring='roc_auc', # Optimise for ROC-AUC
refit=True, # Refit best model on all training data
n_jobs=-1, # Parallelise across all CPU cores
verbose=1, # Print progress
return_train_score=True # Also track training scores
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV ROC-AUC: {grid_search.best_score_:.4f}")
print(f"Test ROC-AUC: {grid_search.score(X_test, y_test):.4f}")
# Access the best estimator directly
best_model = grid_search.best_estimator_
# Explore all results
import pandas as pd
cv_results = pd.DataFrame(grid_search.cv_results_)
top10 = cv_results.sort_values('mean_test_score', ascending=False).head(10)
print(top10[['params', 'mean_test_score', 'std_test_score', 'mean_train_score']])
# ── RandomizedSearchCV — for large search spaces ──────────────
param_dist = {
'n_estimators': randint(50, 500), # Sample from 50-500
'max_depth': randint(2, 12), # Sample from 2-12
'learning_rate': uniform(0.001, 0.3), # Sample from 0.001-0.301
'subsample': uniform(0.5, 0.5), # Sample from 0.5-1.0
'min_samples_leaf': randint(1, 20),
'max_features': uniform(0.3, 0.7)
}
random_search = RandomizedSearchCV(
GradientBoostingClassifier(random_state=42),
param_distributions=param_dist,
n_iter=50, # Try 50 random combinations (vs 54+ for grid)
cv=5,
scoring='roc_auc',
refit=True,
n_jobs=-1,
random_state=42,
verbose=1
)
random_search.fit(X_train, y_train)
print(f"
Randomized Best params: {random_search.best_params_}")
print(f"Randomized Best CV ROC-AUC: {random_search.best_score_:.4f}")Grid vs Random Search — When to Use Which
- GridSearchCV: Small parameter spaces (≤ 50 combinations), when you know the right ballpark for each parameter
- RandomizedSearchCV: Large spaces, many parameters — empirically finds equally good solutions in fewer iterations. Use for first-pass exploration.
- Optuna (Day 79): Best for large spaces with 10+ hyperparameters — uses Bayesian optimisation to focus on promising regions.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain gridsearchcv & randomizedsearchcv and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 79 — Optuna
Optuna — Bayesian Hyperparameter Optimisation
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Why Optuna?
Optuna uses Tree-structured Parzen Estimator (TPE) — a Bayesian optimisation algorithm that builds a probabilistic model of good hyperparameter regions and focuses search there. It is significantly more efficient than random search for large hyperparameter spaces.
# pip install optuna
import optuna
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# ── Define the objective function ─────────────────────────────
def objective(trial):
"""Optuna calls this function many times, each with different params."""
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 2, 10),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 30),
'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
'min_impurity_decrease': trial.suggest_float('min_impurity_decrease', 0.0, 0.1),
}
model = GradientBoostingClassifier(**params, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
return scores.mean() # Optuna maximises this by default
# ── Create and run the study ──────────────────────────────────
study = optuna.create_study(
direction='maximize', # We want max ROC-AUC
sampler=optuna.samplers.TPESampler(seed=42), # Bayesian (default)
pruner=optuna.pruners.MedianPruner(n_startup_trials=5) # Kill bad trials early
)
# Suppress verbose logging
optuna.logging.set_verbosity(optuna.logging.WARNING)
study.optimize(
objective,
n_trials=100, # Number of hyperparameter configurations to try
timeout=300, # Stop after 5 minutes (whichever comes first)
show_progress_bar=True
)
# ── Results ───────────────────────────────────────────────────
print(f"Best trial: #{study.best_trial.number}")
print(f"Best ROC-AUC (CV): {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
# ── Train final model with best params ────────────────────────
best_model = GradientBoostingClassifier(**study.best_params, random_state=42)
best_model.fit(X_train, y_train)
from sklearn.metrics import roc_auc_score
y_scores = best_model.predict_proba(X_test)[:, 1]
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_scores):.4f}")
# ── Optuna visualisations ──────────────────────────────────────
import optuna.visualization as vis
# vis.plot_optimization_history(study).show() # Loss over trials
# vis.plot_param_importances(study).show() # Which params matter most
# vis.plot_contour(study, params=['learning_rate', 'max_depth']).show()
# ── Suggest parameter types reference ─────────────────────────
print("""
trial.suggest_int(name, low, high) → integer in [low, high]
trial.suggest_float(name, low, high) → float in [low, high]
trial.suggest_float(name, low, high, log=True) → float in [low, high] (log scale)
trial.suggest_categorical(name, choices) → one of the choices
trial.suggest_discrete_uniform(name, low, high, q) → discrete grid
""")Optuna Integration with XGBoost + Early Stopping
For XGBoost/LightGBM, add a pruning callback inside the objective so Optuna can stop underperforming trials mid-training (saving significant compute):
pruning_callback = optuna.integration.XGBoostPruningCallback(trial, 'validation-auc')
model = xgb.XGBClassifier(..., callbacks=[pruning_callback])Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain optuna and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Advanced Ensembles — Stacking, Bagging, Voting
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Ensemble Taxonomy
| Method | Strategy | Reduces | Example |
|---|---|---|---|
| Bagging | Train multiple models on bootstrapped subsets; average predictions | Variance | Random Forest, BaggingClassifier |
| Boosting | Train models sequentially; each corrects errors of the previous | Bias + Variance | XGBoost, LightGBM, AdaBoost |
| Voting | Combine predictions from diverse models by majority vote or average | Both | VotingClassifier |
| Stacking | Use model predictions as features for a meta-learner | Both | StackingClassifier |
from sklearn.ensemble import (StackingClassifier, BaggingClassifier,
VotingClassifier, RandomForestClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
# ── Stacking Classifier ───────────────────────────────────────
# Level-0 (base) estimators
base_estimators = [
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('svm', Pipeline([('scaler', StandardScaler()), ('svc', SVC(probability=True, kernel='rbf'))])),
('knn', Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=11))]))
]
# Level-1 (meta) estimator — learns from base model predictions
meta_learner = LogisticRegression(C=1.0, max_iter=1000)
stacking = StackingClassifier(
estimators=base_estimators,
final_estimator=meta_learner,
cv=5, # Cross-validate base estimators to prevent leakage
stack_method='predict_proba', # Use probabilities as meta-features
passthrough=False, # Set True to also pass original features to meta-learner
n_jobs=-1
)
stacking.fit(X_train, y_train)
print(f"Stacking Test Accuracy: {stacking.score(X_test, y_test):.4f}")
# ── VotingClassifier ──────────────────────────────────────────
# Hard voting: majority vote of class predictions
# Soft voting: average probabilities (usually better)
voting = VotingClassifier(
estimators=base_estimators,
voting='soft', # 'hard' or 'soft'
n_jobs=-1
)
voting.fit(X_train, y_train)
print(f"Soft Voting Test Accuracy: {voting.score(X_test, y_test):.4f}")
# ── BaggingClassifier ─────────────────────────────────────────
# Bagging around any base estimator (e.g., Deep Decision Trees)
bagging = BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=None), # Unpruned tree
n_estimators=100,
max_samples=0.8, # Each tree sees 80% of training samples
max_features=0.8, # Each tree uses 80% of features
bootstrap=True, # Sample with replacement (bagging)
bootstrap_features=False,
random_state=42,
n_jobs=-1
)
bagging.fit(X_train, y_train)
print(f"Bagging Test Accuracy: {bagging.score(X_test, y_test):.4f}")
# ── Compare all ensembles with cross-validation ───────────────
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Bagging DT': bagging,
'Soft Voting': VotingClassifier(estimators=base_estimators, voting='soft', n_jobs=-1),
'Stacking': StackingClassifier(estimators=base_estimators, final_estimator=meta_learner, cv=5, n_jobs=-1)
}
print("
=== Cross-Validation Comparison ===")
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc', n_jobs=-1)
print(f"{name:20s}: {scores.mean():.4f} ± {scores.std():.4f}")Module 6 Key Takeaways
- Always use stratified cross-validation for classification — never just a single split
- Match your evaluation metric to your business goal (F1 ≠ AUC ≠ accuracy)
- Use learning curves to diagnose bias vs variance before tuning
- Start with RandomizedSearchCV for exploration, then refine with GridSearchCV or Optuna
- Stacking usually outperforms voting, but is more complex; use it as a final step
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain advanced ensembles and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Continue to the next day in this module.
