Module 4: Supervised Learning Algorithms
100 Days of ML Module 4 — Master Supervised Learning: Linear/Logistic Regression, Decision Trees, KNN, Naive Bayes, SVM, Random Forests, XGBoost, LightGBM, CatBoost, Stacking, and Ensemble Methods.
This is the core algorithms module — the heart of classical Machine Learning. You'll learn every major supervised algorithm from first principles: the mathematical intuition, how they work, their strengths and weaknesses, key hyperparameters, and practical sklearn implementation. By the end you'll know when to use each algorithm and why.
Linear Regression — The Foundation of Supervised Learning
Why this matters
Linear regression is the foundation of supervised learning: cost functions, gradients, assumptions, and residuals appear in every advanced algorithm interview.
Intuition
Linear regression fits a straight line (or hyperplane) through data points that minimizes the total squared distance between predictions and actual values. It models the relationship between input features $X$ and a continuous target $y$ as a linear function.
Worked example — One feature
Suppose y = 2 + 3x. For x=4, ŷ=14. MSE for predictions [14, 8] vs actual [15, 7] is mean((14−15)² + (8−7)²) = mean(1+1) = 1.0. Adding outliers squares large errors — why robust metrics (MAE) and residual checks matter.
Linear Regression Model:
$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n = \boldsymbol{\theta}^T \mathbf{x}$$$\theta_0$ = bias/intercept, $\theta_1 \ldots \theta_n$ = feature weights (slope parameters)
Cost Function — Mean Squared Error (MSE)
We want to find parameters $\boldsymbol{\theta}$ that minimize the Mean Squared Error — the average of squared differences between predictions and actuals:
$$J(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 = \frac{1}{2m} \|\mathbf{X}\boldsymbol{\theta} - \mathbf{y}\|^2$$Closed-Form Solution (OLS — Ordinary Least Squares)
For linear regression, there is an exact analytical solution — the Normal Equation:
$$\hat{\boldsymbol{\theta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$This is $O(n^3)$ due to matrix inversion — infeasible for very large feature sets ($n > 10,000$). In those cases, use gradient descent instead.
Assumptions of Linear Regression (LINE)
- L — Linearity: The relationship between X and y is linear.
- I — Independence: Observations are independent of each other.
- N — Normality: Residuals ($y - \hat{y}$) are normally distributed.
- E — Equal Variance (Homoscedasticity): Residual variance is constant across all fitted values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
# ── Load Dataset ────────────────────────────────────────────────
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target # median house value in $100k
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ── Train Linear Regression ──────────────────────────────────────
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
# ── Evaluation Metrics ───────────────────────────────────────────
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Linear Regression Results:")
print(f" MSE: {mse:.4f}")
print(f" RMSE: {rmse:.4f} (in same units as target)")
print(f" MAE: {mae:.4f} (more robust to outliers)")
print(f" R²: {r2:.4f} (1.0 = perfect, 0 = baseline mean model)")
# ── Model Coefficients ───────────────────────────────────────────
coef_df = pd.DataFrame({
'Feature': data.feature_names,
'Coefficient': lr.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(f"\nIntercept: {lr.intercept_:.4f}")
print("\nFeature Coefficients (sorted by absolute magnitude):")
print(coef_df.to_string(index=False))
# ── OLS Normal Equation (from scratch) ──────────────────────────
def normal_equation(X, y):
"""Closed-form OLS solution"""
X_b = np.c_[np.ones(X.shape[0]), X] # add bias column
theta = np.linalg.pinv(X_b.T @ X_b) @ X_b.T @ y
return theta[0], theta[1:] # intercept, coefficients
intercept_ols, coefs_ols = normal_equation(X_train.values, y_train.values)
print(f"\nNormal Equation intercept: {intercept_ols:.4f}")
print(f"sklearn intercept: {lr.intercept_:.4f}")
# ── Residual Analysis ────────────────────────────────────────────
residuals = y_test - y_pred
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Plot 1: Predicted vs Actual
axes[0].scatter(y_test, y_pred, alpha=0.4, color='#d4af37', s=20)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Values')
axes[0].set_ylabel('Predicted Values')
axes[0].set_title(f'Predicted vs Actual
R² = {r2:.4f}')
axes[0].legend()
# Plot 2: Residuals vs Fitted
axes[1].scatter(y_pred, residuals, alpha=0.4, color='#3a7bd5', s=20)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[1].set_xlabel('Fitted Values')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residuals vs Fitted
(Check: random scatter = good)')
# Plot 3: Residual distribution
axes[2].hist(residuals, bins=40, color='#d4af37', alpha=0.7, edgecolor='black')
axes[2].set_xlabel('Residuals')
axes[2].set_ylabel('Count')
axes[2].set_title(f'Residual Distribution
(Should be normal, mean≈0)')
axes[2].axvline(x=0, color='red', linestyle='--')
plt.suptitle('Linear Regression Diagnostics', fontweight='bold')
plt.tight_layout()
plt.show()✅ Pros
- Highly interpretable (coefficients = feature impact)
- Extremely fast to train and predict
- No hyperparameters (OLS)
- Works well when relationship is truly linear
❌ Cons
- Assumes linear relationship — fails on complex data
- Sensitive to outliers (squares errors)
- Requires feature scaling for gradient descent
- Assumptions (normality, homoscedasticity) rarely perfectly met
Common mistakes
- Using linear regression on clearly non-linear targets without transforms or other models.
- Ignoring residual plots — heteroscedasticity and non-linearity hide in aggregates.
- Applying OLS when features are collinear without regularization (Ridge/Lasso).
Interview checkpoints
- Q: Interpret a positive coefficient on `MedInc`. A: Holding other features fixed, one unit increase in MedInc associates with β increase in target.
- Q: R² vs RMSE? A: R² = variance explained (unitless); RMSE = error in target units, sensitive to outliers.
- Q: When does normal equation fail? A: Singular XᵀX (collinearity) or very large p — use pseudoinverse or gradient methods.
Practice
- Basic: Fit LinearRegression on California housing; report RMSE and R².
- Intermediate: Plot residuals vs fitted; list two assumption violations if patterns appear.
- Advanced: Implement OLS with np.linalg.pinv; compare coefficients to sklearn within 1e-6.
Recap
- ŷ = θᵀx; minimize MSE for OLS solution.
- Check LINE assumptions via residuals.
- Interpret coefficients only when features are scaled and comparable if needed.
Gradient Descent — Batch, Stochastic, and Mini-Batch
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Worked example — Learning rate
On a simple quadratic loss, η too large oscillates or diverges; η too small converges slowly. Mini-batch (32–256) is the practical default: stable like batch GD but faster per epoch than full-batch on large data.
The Core Idea
Gradient descent is an iterative optimization algorithm that finds the minimum of a cost function by repeatedly taking steps in the direction of steepest descent (negative gradient). Think of it as blindly walking down a hilly landscape in the direction that goes downhill most steeply.
Gradient Descent Update Rule:
$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta abla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$$$\eta$ = learning rate (step size), $ abla J$ = gradient of cost function with respect to $\boldsymbol{\theta}$
For linear regression with MSE cost, the gradient is: $$\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) \cdot x_j^{(i)}$$
| Variant | Batch Size | Update Frequency | Noise Level | Best For |
|---|---|---|---|---|
| Batch GD | All $m$ samples | Once per epoch | Low (smooth) | Small datasets, convex problems |
| Stochastic GD (SGD) | 1 sample | Every sample | High (noisy) | Online learning, very large datasets |
| Mini-Batch GD | 32–512 samples | Every batch | Medium | Default for deep learning; best balance |
import numpy as np
import matplotlib.pyplot as plt
# ══════════════════════════════════════
# GRADIENT DESCENT FROM SCRATCH
# ══════════════════════════════════════
np.random.seed(42)
m = 1000
X_gd = np.random.randn(m, 1)
y_gd = 3 + 2 * X_gd + np.random.randn(m, 1) * 0.5 # y = 3 + 2x + noise
X_b = np.c_[np.ones(m), X_gd] # add bias column
def batch_gradient_descent(X, y, eta=0.01, n_epochs=1000):
m = len(y)
theta = np.random.randn(X.shape[1], 1)
cost_history = []
for epoch in range(n_epochs):
y_pred = X @ theta
residuals = y_pred - y
gradients = (2/m) * X.T @ residuals
theta -= eta * gradients
cost = np.mean(residuals**2)
cost_history.append(cost)
return theta, cost_history
def stochastic_gradient_descent(X, y, eta=0.01, n_epochs=50):
m = len(y)
theta = np.random.randn(X.shape[1], 1)
cost_history = []
for epoch in range(n_epochs):
epoch_cost = 0
indices = np.random.permutation(m) # shuffle each epoch
for i in indices:
xi = X[i:i+1]
yi = y[i:i+1]
y_pred = xi @ theta
gradient = 2 * xi.T @ (y_pred - yi)
theta -= eta * gradient
epoch_cost += (y_pred - yi)**2
cost_history.append(float(epoch_cost / m))
return theta, cost_history
def mini_batch_gradient_descent(X, y, eta=0.01, n_epochs=100, batch_size=32):
m = len(y)
theta = np.random.randn(X.shape[1], 1)
cost_history = []
for epoch in range(n_epochs):
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
epoch_cost = 0
for i in range(0, m, batch_size):
xi = X_shuffled[i:i+batch_size]
yi = y_shuffled[i:i+batch_size]
y_pred = xi @ theta
gradients = (2/len(yi)) * xi.T @ (y_pred - yi)
theta -= eta * gradients
epoch_cost += np.mean((y_pred - yi)**2)
cost_history.append(epoch_cost / (m // batch_size))
return theta, cost_history
# Run all three
theta_batch, cost_batch = batch_gradient_descent(X_b, y_gd, eta=0.01, n_epochs=200)
theta_sgd, cost_sgd = stochastic_gradient_descent(X_b, y_gd, eta=0.01, n_epochs=200)
theta_mini, cost_mini = mini_batch_gradient_descent(X_b, y_gd, eta=0.01, n_epochs=200, batch_size=32)
print("Batch GD — True: [3, 2], Learned:", theta_batch.T.round(3))
print("SGD — True: [3, 2], Learned:", theta_sgd.T.round(3))
print("Mini-Batch — True: [3, 2], Learned:", theta_mini.T.round(3))
# ── Effect of Learning Rate ──────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Convergence curves
axes[0].plot(cost_batch, label='Batch GD', color='#d4af37', linewidth=2)
axes[0].plot(cost_sgd, label='SGD', color='#e74c3c', linewidth=1.5, alpha=0.7)
axes[0].plot(cost_mini, label='Mini-Batch', color='#2ecc71', linewidth=1.5)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('Convergence Comparison
(Note: SGD is noisy, Mini-Batch balances both)')
axes[0].legend()
axes[0].set_yscale('log')
# Learning rate comparison
for eta, color in [(0.001, '#e74c3c'), (0.01, '#d4af37'), (0.1, '#2ecc71')]:
_, costs = batch_gradient_descent(X_b, y_gd, eta=eta, n_epochs=200)
axes[1].plot(costs, label=f'η = {eta}', color=color, linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MSE Loss')
axes[1].set_title('Effect of Learning Rate
(Too small=slow, Too large=diverges)')
axes[1].legend()
axes[1].set_ylim(0, 5)
plt.suptitle('Gradient Descent Variants Comparison', fontweight='bold')
plt.tight_layout()
plt.show()Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain gradient descent and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 43 — Ridge & Lasso
Regularization — Ridge (L2), Lasso (L1), and ElasticNet
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Overfitting Problem
When a model is too complex (too many parameters) relative to the training data, it memorizes noise instead of learning the true pattern — this is overfitting. Regularization adds a penalty term to the cost function that discourages large parameter values, forcing the model to learn simpler, more generalizable patterns.
Ridge (L2): Penalize sum of squared coefficients
$$J_{Ridge}(\boldsymbol{\theta}) = MSE(\boldsymbol{\theta}) + \alpha \sum_{j=1}^{n} \theta_j^2$$Lasso (L1): Penalize sum of absolute coefficients
$$J_{Lasso}(\boldsymbol{\theta}) = MSE(\boldsymbol{\theta}) + \alpha \sum_{j=1}^{n} |\theta_j|$$ElasticNet: Combination of L1 and L2
$$J_{EN}(\boldsymbol{\theta}) = MSE(\boldsymbol{\theta}) + \alpha \cdot r \sum_{j=1}^{n} |\theta_j| + \alpha \cdot \frac{1-r}{2} \sum_{j=1}^{n} \theta_j^2$$Why Lasso Creates Sparsity (L1 ≠ L2 Behavior)
The key geometric insight: the L1 penalty (diamond shape in 2D) has corners at the axes. When the MSE loss contours touch the constraint region, they're likely to touch a corner — which lies exactly on one axis, meaning all other coefficients are exactly zero. L2's smooth circular constraint rarely yields exact zeros.
| Property | Ridge (L2) | Lasso (L1) | ElasticNet |
|---|---|---|---|
| Coefficient shrinkage | Shrinks toward zero but never exactly zero | Can drive coefficients to exactly zero (sparse) | Some zero, some shrunk |
| Feature selection | No (keeps all features, just small) | Yes (automatic feature selection) | Yes (partial) |
| Best for | Many small effects; correlated features | Few important features; high-dimensional data | Both large+small effects |
| Hyperparameter | $\alpha$ (strength of penalty) | $\alpha$ (strength of penalty) | $\alpha$ + $r$ (L1 ratio) |
| Correlated features | Distributes weight across correlated group | Arbitrarily picks one from correlated group | Handles better than Lasso alone |
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# ══════════════════════════════════════
# EFFECT OF ALPHA ON COEFFICIENTS
# ══════════════════════════════════════
alphas = np.logspace(-3, 3, 100)
ridge_coefs = []
lasso_coefs = []
for alpha in alphas:
ridge_coefs.append(Ridge(alpha=alpha).fit(X_train_sc, y_train).coef_)
lasso_coefs.append(Lasso(alpha=alpha, max_iter=10000).fit(X_train_sc, y_train).coef_)
ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for i, name in enumerate(data.feature_names):
axes[0].plot(alphas, ridge_coefs[:, i], linewidth=1.5, label=name)
axes[1].plot(alphas, lasso_coefs[:, i], linewidth=1.5, label=name)
for ax, title in zip(axes, ['Ridge (L2) — Coefficients shrink, never zero',
'Lasso (L1) — Coefficients → exact zero (sparse)']):
ax.set_xscale('log')
ax.axhline(y=0, color='white', linewidth=0.5, alpha=0.3)
ax.set_xlabel('Alpha (regularization strength →)')
ax.set_ylabel('Coefficient Value')
ax.set_title(title)
ax.legend(fontsize=7, loc='upper right')
plt.suptitle('Regularization Path — Effect of Alpha on Coefficients', fontweight='bold')
plt.tight_layout()
plt.show()
# ══════════════════════════════════════
# CROSS-VALIDATED ALPHA SELECTION
# ══════════════════════════════════════
ridge_cv = RidgeCV(alphas=np.logspace(-3, 3, 100), cv=5)
ridge_cv.fit(X_train_sc, y_train)
print(f"RidgeCV best alpha: {ridge_cv.alpha_:.4f}")
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_cv.fit(X_train_sc, y_train)
print(f"LassoCV best alpha: {lasso_cv.alpha_:.6f}")
en_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0], cv=5, random_state=42)
en_cv.fit(X_train_sc, y_train)
print(f"ElasticNetCV best alpha: {en_cv.alpha_:.6f}, l1_ratio: {en_cv.l1_ratio_:.2f}")
# ══════════════════════════════════════
# PERFORMANCE COMPARISON
# ══════════════════════════════════════
print("
Performance Comparison:")
print(f"{'Model':20} {'R² Test':10} {'RMSE Test':10} {'Non-zero Coefs':15}")
print("-" * 60)
for name, model in [
('LinearRegression', __import__('sklearn.linear_model', fromlist=['LinearRegression']).LinearRegression()),
('Ridge (best α)', Ridge(alpha=ridge_cv.alpha_)),
('Lasso (best α)', Lasso(alpha=lasso_cv.alpha_, max_iter=10000)),
('ElasticNet', ElasticNet(alpha=en_cv.alpha_, l1_ratio=en_cv.l1_ratio_, max_iter=10000))
]:
model.fit(X_train_sc, y_train)
y_pred = model.predict(X_test_sc)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
n_nonzero = np.sum(model.coef_ != 0)
print(f" {name:20} {r2:.4f} {rmse:.4f} {n_nonzero}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain regularization and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Logistic Regression — Classification with Probabilities
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Worked example — Odds and log-odds
If P(y=1) = 0.8, odds = 0.8/0.2 = 4, log-odds = ln(4) ≈ 1.39. Logistic regression models log-odds as a linear function of features; threshold 0.5 corresponds to log-odds = 0.
From Linear to Logistic
Logistic regression uses the sigmoid function to squeeze the linear combination of features into a probability between 0 and 1:
Sigmoid (Logistic) Function:
$$\sigma(z) = \frac{1}{1 + e^{-z}} \quad \text{where } z = \boldsymbol{\theta}^T \mathbf{x}$$Prediction Probability:
$$\hat{p} = P(y=1 | \mathbf{x}) = \sigma(\boldsymbol{\theta}^T \mathbf{x})$$Binary Cross-Entropy Loss (Log Loss):
$$J(\boldsymbol{\theta}) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log(\hat{p}^{(i)}) + (1-y^{(i)})\log(1-\hat{p}^{(i)})\right]$$The decision boundary is where $\hat{p} = 0.5$, i.e., where $\boldsymbol{\theta}^T \mathbf{x} = 0$. This is a hyperplane in feature space.
Multi-class Strategies
| Strategy | How It Works | Num Classifiers | sklearn parameter |
|---|---|---|---|
| One-vs-Rest (OvR) | Train N binary classifiers: each class vs all others. Predict class with highest score. | N (one per class) | multi_class='ovr' |
| One-vs-One (OvO) | Train a binary classifier for every pair of classes. Majority vote wins. | N(N-1)/2 | SVM default for multi-class |
| Softmax (Multinomial) | Single model with K output nodes. Use softmax to normalize to probability over all classes. | 1 (unified model) | multi_class='multinomial' |
Softmax Function (K classes):
$$P(y=k | \mathbf{x}) = \frac{e^{\boldsymbol{\theta}_k^T \mathbf{x}}}{\sum_{j=1}^{K} e^{\boldsymbol{\theta}_j^T \mathbf{x}}}$$import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (accuracy_score, classification_report,
confusion_matrix, roc_auc_score, roc_curve)
from sklearn.preprocessing import StandardScaler
# ── Binary Classification Example ────────────────────────────────
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# Train Logistic Regression
lr = LogisticRegression(
C=1.0, # Inverse regularization: smaller C = stronger regularization
penalty='l2', # L2 regularization (Ridge-style)
solver='lbfgs', # optimization algorithm
max_iter=1000,
random_state=42
)
lr.fit(X_train_sc, y_train)
# Predictions
y_pred = lr.predict(X_test_sc)
y_proba = lr.predict_proba(X_test_sc)[:, 1] # probability of class 1
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
print("
Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
# ── Visualize Sigmoid and Decision Boundary ──────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Sigmoid function
z = np.linspace(-10, 10, 200)
sigma = 1 / (1 + np.exp(-z))
axes[0].plot(z, sigma, color='#d4af37', linewidth=2.5)
axes[0].axhline(y=0.5, color='red', linestyle='--', label='threshold=0.5')
axes[0].axvline(x=0, color='white', linestyle='--', alpha=0.3)
axes[0].fill_between(z, sigma, 0.5, where=(sigma > 0.5), alpha=0.15, color='#2ecc71')
axes[0].fill_between(z, sigma, 0.5, where=(sigma < 0.5), alpha=0.15, color='#e74c3c')
axes[0].set_xlabel('z = θᵀx')
axes[0].set_ylabel('σ(z) = P(y=1|x)')
axes[0].set_title('Sigmoid Function')
axes[0].legend()
axes[0].grid(True, alpha=0.2)
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
axes[1].plot(fpr, tpr, color='#d4af37', linewidth=2, label=f'AUC = {auc:.4f}')
axes[1].plot([0,1], [0,1], 'r--', linewidth=1, label='Random (AUC=0.5)')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()
axes[1].fill_between(fpr, tpr, alpha=0.15, color='#d4af37')
# Confusion Matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[2],
xticklabels=data.target_names, yticklabels=data.target_names)
axes[2].set_title('Confusion Matrix')
axes[2].set_xlabel('Predicted')
axes[2].set_ylabel('Actual')
plt.suptitle('Logistic Regression — Performance Analysis', fontweight='bold')
plt.tight_layout()
plt.show()
# ── Multi-class: Iris dataset ────────────────────────────────────
from sklearn.datasets import load_iris
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)
scaler2 = StandardScaler()
lr_multi = LogisticRegression(multi_class='multinomial', solver='lbfgs',
max_iter=1000, C=1.0)
lr_multi.fit(scaler2.fit_transform(X_tr), y_tr)
y_pred_multi = lr_multi.predict(scaler2.transform(X_te))
print(f"
Iris Multi-class Accuracy: {accuracy_score(y_te, y_pred_multi):.4f}")
print(classification_report(y_te, y_pred_multi, target_names=iris.target_names))Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain logistic regression and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 45 — Decision Trees
Decision Trees — Gini Impurity, Entropy, and the CART Algorithm
Why this matters
Decision trees are the gateway to Random Forest and boosting — understanding splits, impurity, and overfitting is essential for tabular ML mastery.
How Decision Trees Work
A decision tree recursively partitions the feature space by asking binary questions: "Is feature X ≤ threshold?" At each node, the algorithm greedily selects the split that maximizes the purity of resulting child nodes.
Splitting Criteria
Gini Impurity (used by CART — sklearn's default):
$$G = 1 - \sum_{k=1}^{K} p_k^2$$$p_k$ = proportion of class $k$ in node. $G=0$ means perfectly pure. $G=0.5$ for equal 50-50 split (max impurity for binary).
Entropy (used by ID3, C4.5 algorithms):
$$H = -\sum_{k=1}^{K} p_k \log_2(p_k)$$Information Gain = $H(\text{parent}) - \frac{n_L}{n} H(\text{left}) - \frac{n_R}{n} H(\text{right})$
In practice, Gini and Entropy produce very similar trees. Gini is slightly faster (no log computation). sklearn uses CART (Classification and Regression Trees) which uses Gini for classification and MSE for regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# ══════════════════════════════════════
# TRAIN DECISION TREE
# ══════════════════════════════════════
dt = DecisionTreeClassifier(
criterion='gini', # 'gini' or 'entropy'
max_depth=4, # maximum depth of tree (controls complexity)
min_samples_split=5, # minimum samples required to split a node
min_samples_leaf=2, # minimum samples required in a leaf
random_state=42
)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# ── Text representation of the tree ─────────────────────────────
print("
Tree Rules (text format):")
print(export_text(dt, feature_names=iris.feature_names))
# ── Visualize the tree ───────────────────────────────────────────
fig, ax = plt.subplots(figsize=(20, 8))
plot_tree(
dt,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, # color nodes by class
rounded=True,
ax=ax,
fontsize=9
)
ax.set_title('Decision Tree — Iris Dataset (max_depth=4)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# ── Feature Importances ──────────────────────────────────────────
importances = pd.Series(dt.feature_importances_, index=iris.feature_names)
importances.sort_values().plot(kind='barh', color='#d4af37', alpha=0.8, edgecolor='black')
plt.title('Decision Tree Feature Importances
(Based on total Gini impurity reduction)')
plt.tight_layout()
plt.show()
# ── Decision Boundary Visualization (2D) ────────────────────────
X_2d = X[:, 2:] # petal length and petal width (most discriminative)
dt_2d = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_2d.fit(X_2d, y)
x1_range = np.linspace(X_2d[:,0].min()-0.5, X_2d[:,0].max()+0.5, 200)
x2_range = np.linspace(X_2d[:,1].min()-0.5, X_2d[:,1].max()+0.5, 200)
xx1, xx2 = np.meshgrid(x1_range, x2_range)
Z = dt_2d.predict(np.c_[xx1.ravel(), xx2.ravel()]).reshape(xx1.shape)
fig, ax = plt.subplots(figsize=(8, 6))
ax.contourf(xx1, xx2, Z, alpha=0.3, cmap='Set3')
colors = ['#e74c3c', '#3a7bd5', '#2ecc71']
for i, (cls, col) in enumerate(zip(iris.target_names, colors)):
mask = y == i
ax.scatter(X_2d[mask, 0], X_2d[mask, 1], c=col, label=cls, s=50, edgecolors='black', linewidth=0.5)
ax.set_xlabel('Petal Length (cm)')
ax.set_ylabel('Petal Width (cm)')
ax.set_title('Decision Tree — Decision Boundaries (max_depth=3)
Boundaries are always axis-aligned')
ax.legend()
plt.tight_layout()
plt.show()Common mistakes
- Unlimited tree depth on noisy data — memorization and terrible generalization.
- Using Gini vs entropy as if they always pick different trees (usually similar splits).
- Expecting stable predictions from a single deep tree (high variance).
Interview checkpoints
- Q: Gini vs entropy? A: Both measure impurity; entropy penalizes uncertain distributions slightly more; results often similar.
- Q: How to control overfitting in trees? A: max_depth, min_samples_leaf, min_samples_split, pruning, ensembles.
- Q: Why trees handle mixed data types? A: Axis-aligned splits on numeric thresholds and categorical partitions.
Practice
- Basic: Train DecisionTreeClassifier on iris; print depth and feature importances.
- Intermediate: Plot validation accuracy vs max_depth; pick depth with best bias-variance tradeoff.
- Advanced: Manually compute Gini for a binary node with 60/40 class counts.
Recap
- Trees partition feature space with greedy impurity reduction.
- Single trees overfit; ensembles (RF, boosting) fix variance.
- Always tune depth and leaf size on validation data.
Next: Day 46 — Gini vs Entropy
Decision Tree Hyperparameters — Controlling Overfitting
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Unconstrained decision trees will grow until every leaf is pure — perfectly memorizing training data (overfit). Hyperparameters control tree complexity via pre-pruning (stop early) or post-pruning (grow full then prune).
| Hyperparameter | Effect | Increase → | Decrease → | Typical Range |
|---|---|---|---|---|
max_depth | Maximum depth of tree | More complex, more overfitting | Simpler, more underfitting | 3–15 |
min_samples_split | Min samples to split a node | Simpler tree, less overfitting | More complex | 2–20 |
min_samples_leaf | Min samples in leaf node | Smoother boundaries, less overfitting | More complex | 1–10 |
max_features | Max features considered per split | Uses more features (expensive) | More randomness (used in Random Forest) | 'sqrt', 'log2', None |
max_leaf_nodes | Max number of leaf nodes | More complex | Simpler | None or 10–50 |
ccp_alpha | Minimal cost-complexity pruning | More pruning (simpler) | Less pruning | 0–0.05 |
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, validation_curve, GridSearchCV
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=2000, n_features=20, n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ══════════════════════════════════════
# BIAS-VARIANCE TRADEOFF — max_depth
# ══════════════════════════════════════
train_scores = []
test_scores = []
depths = range(1, 25)
for depth in depths:
dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
dt.fit(X_train, y_train)
train_scores.append(accuracy_score(y_train, dt.predict(X_train)))
test_scores.append(accuracy_score(y_test, dt.predict(X_test)))
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(depths, train_scores, 'o-', color='#d4af37', label='Train Accuracy', linewidth=2)
axes[0].plot(depths, test_scores, 's-', color='#3a7bd5', label='Test Accuracy', linewidth=2)
axes[0].axvline(x=depths[test_scores.index(max(test_scores))],
color='red', linestyle='--', alpha=0.7, label='Optimal depth')
axes[0].set_xlabel('max_depth')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Bias-Variance Tradeoff vs max_depth
(Training keeps rising, Test peaks then falls)')
axes[0].legend()
axes[0].grid(True, alpha=0.2)
# ══════════════════════════════════════
# COST-COMPLEXITY PRUNING (ccp_alpha)
# ══════════════════════════════════════
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)
path = dt_full.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas
prune_train_scores = []
prune_test_scores = []
for alpha in alphas:
dt = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
dt.fit(X_train, y_train)
prune_train_scores.append(accuracy_score(y_train, dt.predict(X_train)))
prune_test_scores.append(accuracy_score(y_test, dt.predict(X_test)))
axes[1].plot(alphas, prune_train_scores, 'o-', color='#d4af37', label='Train', linewidth=2)
axes[1].plot(alphas, prune_test_scores, 's-', color='#3a7bd5', label='Test', linewidth=2)
best_alpha = alphas[np.argmax(prune_test_scores)]
axes[1].axvline(x=best_alpha, color='red', linestyle='--', alpha=0.7,
label=f'Best α={best_alpha:.5f}')
axes[1].set_xlabel('ccp_alpha (pruning strength)')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Post-Pruning via Cost-Complexity
(ccp_alpha controls how much to prune)')
axes[1].legend()
axes[1].grid(True, alpha=0.2)
plt.suptitle('Decision Tree Overfitting Control', fontweight='bold')
plt.tight_layout()
plt.show()
# ══════════════════════════════════════
# GRID SEARCH — Optimal Hyperparameters
# ══════════════════════════════════════
param_grid = {
'max_depth': [3, 4, 5, 6, 8, 10, None],
'min_samples_split': [2, 5, 10, 20],
'min_samples_leaf': [1, 2, 5, 10],
'criterion': ['gini', 'entropy']
}
gs = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid,
cv=5, scoring='accuracy', n_jobs=-1, verbose=0)
gs.fit(X_train, y_train)
print(f"Best hyperparameters: {gs.best_params_}")
print(f"Best CV accuracy: {gs.best_score_:.4f}")
print(f"Test accuracy: {accuracy_score(y_test, gs.predict(X_test)):.4f}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain decision tree hyperparameters and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 47 — KNN Algorithm
K-Nearest Neighbors — Distance-Based Classification
Why this matters
K-Nearest Neighbors: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Algorithm
KNN is a non-parametric, lazy learning algorithm — it doesn't build an explicit model during training. Instead, it memorizes all training examples and at prediction time finds the $k$ closest training points to the query and takes a majority vote (classification) or average (regression).
Distance Metrics:
Euclidean: $d(p, q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}$ (L2 norm, default)
Manhattan: $d(p, q) = \sum_{i=1}^{n}|p_i - q_i|$ (L1 norm)
Minkowski: $d(p, q) = \left(\sum_{i=1}^{n}|p_i - q_i|^p\right)^{1/p}$ (generalizes both; p=2→Euclidean, p=1→Manhattan)
The Curse of Dimensionality
In high-dimensional spaces, all points become approximately equidistant from each other — the concept of "nearest" loses meaning. The volume of a unit hypersphere relative to the unit hypercube approaches 0 as dimensions increase, meaning most data is in the "corners" and distances become meaningless. This is why KNN degrades badly with many features and why feature selection/PCA helps.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# ALWAYS scale for KNN!
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# ══════════════════════════════════════
# FINDING OPTIMAL K — Elbow Method
# ══════════════════════════════════════
k_range = range(1, 51)
train_scores = []
test_scores = []
cv_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k, metric='euclidean', weights='uniform')
knn.fit(X_train_sc, y_train)
train_scores.append(accuracy_score(y_train, knn.predict(X_train_sc)))
test_scores.append(accuracy_score(y_test, knn.predict(X_test_sc)))
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(k_range, train_scores, 'o-', color='#d4af37', label='Train', linewidth=2, ms=4)
axes[0].plot(k_range, test_scores, 's-', color='#3a7bd5', label='Test', linewidth=2, ms=4)
optimal_k = k_range[test_scores.index(max(test_scores))]
axes[0].axvline(x=optimal_k, color='red', linestyle='--',
label=f'Optimal K={optimal_k}', alpha=0.8)
axes[0].set_xlabel('K (number of neighbors)')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Finding Optimal K
(Small K=complex/overfit, Large K=simple/underfit)')
axes[0].legend()
axes[0].grid(True, alpha=0.2)
# Distance metrics comparison
metrics = ['euclidean', 'manhattan', 'chebyshev']
metric_scores = {}
for metric in metrics:
knn = KNeighborsClassifier(n_neighbors=optimal_k, metric=metric)
scores = cross_val_score(knn, X_train_sc, y_train, cv=5, scoring='accuracy')
metric_scores[metric] = scores
axes[1].boxplot([metric_scores[m] for m in metrics], labels=metrics, patch_artist=True,
boxprops=dict(facecolor='rgba(212,175,55,0.3)', color='#d4af37'))
axes[1].set_title(f'Distance Metric Comparison (K={optimal_k})
5-fold CV Accuracy')
axes[1].set_ylabel('CV Accuracy')
plt.suptitle('KNN Hyperparameter Analysis', fontweight='bold')
plt.tight_layout()
plt.show()
# ══════════════════════════════════════
# BEST KNN MODEL
# ══════════════════════════════════════
best_knn = KNeighborsClassifier(
n_neighbors=optimal_k,
weights='distance', # closer neighbors have more influence
metric='euclidean',
algorithm='ball_tree', # faster for high dimensions
n_jobs=-1
)
best_knn.fit(X_train_sc, y_train)
print(f"KNN (K={optimal_k}, weighted) Accuracy: {accuracy_score(y_test, best_knn.predict(X_test_sc)):.4f}")
# ── Curse of Dimensionality Demo ─────────────────────────────────
print("
Curse of Dimensionality:")
print(f"{'Dimensions':12} {'Mean Distance':15} {'Std Distance':12} {'Ratio':10}")
print("-" * 55)
for n_dim in [2, 5, 10, 50, 100, 500]:
np.random.seed(42)
points = np.random.randn(1000, n_dim)
ref = np.random.randn(1, n_dim)
dists = np.sqrt(np.sum((points - ref)**2, axis=1))
ratio = dists.std() / dists.mean() if dists.mean() > 0 else 0
print(f" {n_dim:<12} {dists.mean():<15.4f} {dists.std():<12.4f} {ratio:.4f}")
print("→ As dimensions grow, std/mean → 0 (all points same distance!)")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain k-nearest neighbors and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 48 — Naive Bayes
Naive Bayes — Probabilistic Classification
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Bayes' Theorem
Bayes' Theorem:
$$P(y | \mathbf{x}) = \frac{P(\mathbf{x} | y) \cdot P(y)}{P(\mathbf{x})}$$Posterior ∝ Likelihood × Prior
$P(y|\mathbf{x})$ = posterior (what we want), $P(\mathbf{x}|y)$ = likelihood, $P(y)$ = prior
The "Naive" assumption: all features are conditionally independent given the class. This rarely holds in practice but the algorithm still works surprisingly well:
$$P(y | x_1, x_2, \ldots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i | y)$$| Variant | Assumption about P(xᵢ|y) | Use For |
|---|---|---|
| GaussianNB | Each feature is normally distributed within each class | Continuous features (measurements, sensor data) |
| MultinomialNB | Features are discrete counts or frequencies | Text classification (word counts, TF-IDF) |
| BernoulliNB | Features are binary (0/1) | Binary text features (word present/absent) |
| ComplementNB | Complement class statistics (more robust) | Imbalanced text classification |
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer, fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MinMaxScaler
# ══════════════════════════════════════
# GAUSSIAN NAIVE BAYES — Continuous Features
# ══════════════════════════════════════
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
gnb = GaussianNB(
var_smoothing=1e-9 # small value added to variance for numerical stability
)
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
y_proba = gnb.predict_proba(X_test)
print("GaussianNB:")
print(f" Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f" CV Score: {cross_val_score(gnb, X, y, cv=5, scoring='accuracy').mean():.4f}")
# GaussianNB learns class statistics:
print(f"
Class priors (P(y)): {gnb.class_prior_.round(4)}")
print(f" Mean of 'worst radius' per class: {gnb.theta_[:, 0].round(3)}")
print(f" Variance of 'worst radius' per class: {gnb.var_[:, 0].round(3)}")
# ══════════════════════════════════════
# MULTINOMIAL NAIVE BAYES — Text Classification
# ══════════════════════════════════════
categories = ['rec.sport.baseball', 'rec.sport.hockey', 'sci.med', 'sci.space']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, random_state=42)
vectorizer = CountVectorizer(stop_words='english', max_features=5000)
X_text = vectorizer.fit_transform(newsgroups.data)
y_text = newsgroups.target
X_tr, X_te, y_tr, y_te = train_test_split(X_text, y_text, test_size=0.2, random_state=42)
mnb = MultinomialNB(alpha=1.0) # alpha = Laplace smoothing (prevents zero probabilities)
mnb.fit(X_tr, y_tr)
y_pred_text = mnb.predict(X_te)
print("
MultinomialNB — 20 Newsgroups Text Classification:")
print(f" Accuracy: {accuracy_score(y_te, y_pred_text):.4f}")
print(classification_report(y_te, y_pred_text, target_names=newsgroups.target_names))
# Top words for each category
feature_names = vectorizer.get_feature_names_out()
print("Top 5 words per category:")
for i, category in enumerate(categories):
top5 = feature_names[np.argsort(mnb.feature_log_prob_[i])[-5:]][::-1]
print(f" {category}: {', '.join(top5)}")
# ══════════════════════════════════════
# NAIVE BAYES FOR REAL-TIME PREDICTION
# (near-instant inference makes it great for spam filtering)
# ══════════════════════════════════════
def classify_text(text, vectorizer, model, categories):
X = vectorizer.transform([text])
pred_class = model.predict(X)[0]
pred_probas = model.predict_proba(X)[0]
print(f"Text: '{text[:60]}...'")
print(f"Predicted: {categories[pred_class]}")
for cat, prob in zip(categories, pred_probas):
bar = '█' * int(prob * 30)
print(f" {cat:25} {bar} {prob:.4f}")
test_text = "The pitcher threw a fastball and the batter hit a home run"
classify_text(test_text, vectorizer, mnb, categories)Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain naive bayes and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 49 — SVM Theory
Support Vector Machines — Maximum Margin Classifier
Why this matters
Support Vector Machines: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Core Idea: Maximize the Margin
SVM finds the hyperplane that maximizes the margin — the distance between the hyperplane and the closest training points from each class. These closest points are called support vectors. A wider margin → better generalization.
SVM Decision Boundary (linear):
$$\mathbf{w}^T \mathbf{x} + b = 0$$Margin Width:
$$\text{margin} = \frac{2}{\|\mathbf{w}\|}$$Hard-margin SVM Optimization Problem:
$$\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to} \quad y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)} + b) \geq 1 \; \forall i$$Soft-margin SVM (with slack variables $\xi_i$):
$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{m}\xi_i \quad \text{subject to} \quad y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)} + b) \geq 1 - \xi_i$$The $C$ hyperparameter controls the trade-off: large $C$ = hard margin (few violations allowed, more complex boundary). Small $C$ = wide margin (more violations allowed, simpler boundary).
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, classification_report
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# ══════════════════════════════════════
# LINEAR SVM
# ══════════════════════════════════════
svm_linear = SVC(
kernel='linear',
C=1.0, # regularization — smaller = wider margin
random_state=42
)
svm_linear.fit(X_train_sc, y_train)
y_pred = svm_linear.predict(X_test_sc)
print("Linear SVM:")
print(f" Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f" Number of support vectors: {svm_linear.n_support_}")
print(f" Support vectors per class: {svm_linear.n_support_}")
# ══════════════════════════════════════
# EFFECT OF C PARAMETER
# ══════════════════════════════════════
C_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
train_accs = []
test_accs = []
for C in C_values:
svm = SVC(kernel='rbf', C=C, gamma='scale', random_state=42)
svm.fit(X_train_sc, y_train)
train_accs.append(accuracy_score(y_train, svm.predict(X_train_sc)))
test_accs.append(accuracy_score(y_test, svm.predict(X_test_sc)))
fig, ax = plt.subplots(figsize=(8, 5))
ax.semilogx(C_values, train_accs, 'o-', color='#d4af37', label='Train', linewidth=2)
ax.semilogx(C_values, test_accs, 's-', color='#3a7bd5', label='Test', linewidth=2)
ax.set_xlabel('C (regularization strength)')
ax.set_ylabel('Accuracy')
ax.set_title('SVM — Effect of C Parameter (RBF kernel)
Small C = wide margin; Large C = narrow margin')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()
# ══════════════════════════════════════
# 2D VISUALIZATION OF DECISION BOUNDARY
# ══════════════════════════════════════
from sklearn.datasets import make_classification
X_vis, y_vis = make_classification(n_samples=200, n_features=2, n_redundant=0,
n_informative=2, random_state=42)
X_vis_tr, X_vis_te, y_vis_tr, y_vis_te = train_test_split(X_vis, y_vis, test_size=0.2)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
for ax, C, title in zip(axes,
[0.01, 1.0, 100.0],
['C=0.01 (Wide margin,
more violations OK)',
'C=1.0 (Balanced)',
'C=100 (Narrow margin,
few violations)']):
svm = SVC(kernel='rbf', C=C, gamma='scale')
svm.fit(X_vis_tr, y_vis_tr)
x1 = np.linspace(X_vis[:,0].min()-1, X_vis[:,0].max()+1, 200)
x2 = np.linspace(X_vis[:,1].min()-1, X_vis[:,1].max()+1, 200)
xx1, xx2 = np.meshgrid(x1, x2)
Z = svm.predict(np.c_[xx1.ravel(), xx2.ravel()]).reshape(xx1.shape)
ax.contourf(xx1, xx2, Z, alpha=0.3, cmap='bwr')
ax.scatter(X_vis_tr[y_vis_tr==0, 0], X_vis_tr[y_vis_tr==0, 1], c='#e74c3c', s=30, label='Class 0')
ax.scatter(X_vis_tr[y_vis_tr==1, 0], X_vis_tr[y_vis_tr==1, 1], c='#3a7bd5', s=30, label='Class 1')
# Highlight support vectors
sv = svm.support_vectors_
ax.scatter(sv[:, 0], sv[:, 1], s=150, facecolors='none', edgecolors='#d4af37',
linewidths=2, label=f'SVs ({len(sv)})')
acc = accuracy_score(y_vis_te, svm.predict(X_vis_te))
ax.set_title(f'{title}
Test Acc={acc:.3f}')
ax.legend(fontsize=7)
plt.suptitle('SVM: Effect of C Hyperparameter on Decision Boundary', fontweight='bold')
plt.tight_layout()
plt.show()Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain support vector machines and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 50 — SVM Kernels
SVM Kernels — The Kernel Trick
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Kernel Trick
When data is not linearly separable in the original space, map it to a higher-dimensional space where it becomes separable. The kernel trick does this implicitly without computing the actual high-dimensional mapping — by replacing dot products $\mathbf{x}^T \mathbf{z}$ with a kernel function $K(\mathbf{x}, \mathbf{z})$.
Common Kernel Functions:
Linear: $K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z}$
Polynomial: $K(\mathbf{x}, \mathbf{z}) = (\gamma \mathbf{x}^T\mathbf{z} + r)^d$
RBF (Gaussian): $K(\mathbf{x}, \mathbf{z}) = \exp\left(-\gamma\|\mathbf{x} - \mathbf{z}\|^2\right)$
Sigmoid: $K(\mathbf{x}, \mathbf{z}) = \tanh(\gamma \mathbf{x}^T\mathbf{z} + r)$
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# ══════════════════════════════════════
# KERNEL COMPARISON ON NON-LINEAR DATA
# ══════════════════════════════════════
datasets = {
'Moons': make_moons(n_samples=200, noise=0.2, random_state=42),
'Circles': make_circles(n_samples=200, noise=0.1, factor=0.4, random_state=42),
'Linear blobs': make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=42)
}
kernels = ['linear', 'poly', 'rbf']
fig, axes = plt.subplots(len(datasets), len(kernels), figsize=(15, 12))
fig.suptitle('SVM Kernels on Different Datasets', fontsize=14, fontweight='bold')
for row, (ds_name, (X_ds, y_ds)) in enumerate(datasets.items()):
X_tr, X_te, y_tr, y_te = train_test_split(X_ds, y_ds, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_tr)
X_te_sc = scaler.transform(X_te)
for col, kernel in enumerate(kernels):
ax = axes[row][col]
params = {'linear': {'C':1}, 'poly': {'C':1,'degree':3,'coef0':1,'gamma':'scale'},
'rbf': {'C':1,'gamma':'scale'}}
svm = SVC(kernel=kernel, **params[kernel], random_state=42)
svm.fit(X_tr_sc, y_tr)
x1_range = np.linspace(X_tr_sc[:,0].min()-0.5, X_tr_sc[:,0].max()+0.5, 150)
x2_range = np.linspace(X_tr_sc[:,1].min()-0.5, X_tr_sc[:,1].max()+0.5, 150)
xx1, xx2 = np.meshgrid(x1_range, x2_range)
Z = svm.predict(np.c_[xx1.ravel(), xx2.ravel()]).reshape(xx1.shape)
ax.contourf(xx1, xx2, Z, alpha=0.3, cmap='bwr')
ax.scatter(X_tr_sc[y_tr==0,0], X_tr_sc[y_tr==0,1], c='#e74c3c', s=20, alpha=0.8)
ax.scatter(X_tr_sc[y_tr==1,0], X_tr_sc[y_tr==1,1], c='#3a7bd5', s=20, alpha=0.8)
acc = accuracy_score(y_te, svm.predict(X_te_sc))
if row == 0:
ax.set_title(f'Kernel: {kernel.upper()}
Acc={acc:.3f}', fontweight='bold')
else:
ax.set_title(f'Acc={acc:.3f}')
if col == 0:
ax.set_ylabel(ds_name, fontweight='bold')
plt.tight_layout()
plt.show()
# ══════════════════════════════════════
# RBF GAMMA EFFECT
# ══════════════════════════════════════
X_moon, y_moon = make_moons(n_samples=300, noise=0.15, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X_moon, y_moon, test_size=0.2)
sc = StandardScaler(); X_tr = sc.fit_transform(X_tr); X_te = sc.transform(X_te)
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
for ax, gamma in zip(axes, [0.01, 0.1, 1.0, 10.0]):
svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
svm.fit(X_tr, y_tr)
x1 = np.linspace(X_tr[:,0].min()-0.3, X_tr[:,0].max()+0.3, 150)
x2 = np.linspace(X_tr[:,1].min()-0.3, X_tr[:,1].max()+0.3, 150)
xx1, xx2 = np.meshgrid(x1, x2)
Z = svm.predict(np.c_[xx1.ravel(), xx2.ravel()]).reshape(xx1.shape)
ax.contourf(xx1, xx2, Z, alpha=0.3, cmap='bwr')
ax.scatter(X_tr[y_tr==0,0], X_tr[y_tr==0,1], c='#e74c3c', s=25)
ax.scatter(X_tr[y_tr==1,0], X_tr[y_tr==1,1], c='#3a7bd5', s=25)
acc = accuracy_score(y_te, svm.predict(X_te))
ax.set_title(f'γ={gamma}
Acc={acc:.3f}')
plt.suptitle('RBF Kernel: Effect of Gamma
(Low γ=smooth, High γ=complex/overfit)', fontweight='bold')
plt.tight_layout()
plt.show()
# ══════════════════════════════════════
# SVM HYPERPARAMETER TUNING
# ══════════════════════════════════════
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
'kernel': ['rbf', 'poly']
}
gs = GridSearchCV(SVC(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
gs.fit(X_tr, y_tr)
print(f"Best SVM params: {gs.best_params_}")
print(f"Best CV score: {gs.best_score_:.4f}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain svm kernels and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 51 — Random Forests
Random Forests — Bagging + Feature Subsampling
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Bagging (Bootstrap Aggregating)
Train multiple models on different bootstrap samples (random sampling with replacement) of the training data, then aggregate their predictions. Each bootstrap sample contains ~63.2% of unique training examples (the rest are out-of-bag).
Random Forest Additions to Bagging
Random Forest further de-correlates trees by also randomly subsampling features at each split:
- At each split, consider only $\sqrt{p}$ (classification) or $p/3$ (regression) features instead of all $p$.
- This prevents all trees from making the same top split (e.g., always splitting on the most important feature), leading to more diverse trees and better ensemble performance.
Out-of-Bag (OOB) Error
Since each tree is trained on ~63.2% of data, the remaining ~36.8% (OOB samples) can be used as a free validation set for that tree. The OOB error averages these errors across all trees — a nearly free cross-validation estimate without additional computation!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# ══════════════════════════════════════
# SINGLE TREE vs RANDOM FOREST
# ══════════════════════════════════════
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
rf = RandomForestClassifier(
n_estimators=100, # number of trees
max_features='sqrt', # sqrt(p) features per split
bootstrap=True, # use bootstrap samples
oob_score=True, # compute OOB score
max_depth=None, # no depth limit per tree
min_samples_leaf=1,
n_jobs=-1,
random_state=42
)
rf.fit(X_train, y_train)
print(f"Single Decision Tree — Test Accuracy: {accuracy_score(y_test, dt.predict(X_test)):.4f}")
print(f"Random Forest — Test Accuracy: {accuracy_score(y_test, rf.predict(X_test)):.4f}")
print(f"Random Forest — OOB Score: {rf.oob_score_:.4f}")
# ══════════════════════════════════════
# EFFECT OF N_ESTIMATORS
# ══════════════════════════════════════
oob_errors = []
test_errors = []
n_trees_range = range(1, 201, 5)
rf_growing = RandomForestClassifier(warm_start=True, oob_score=True, random_state=42, n_jobs=-1)
for n_trees in n_trees_range:
rf_growing.n_estimators = n_trees
rf_growing.fit(X_train, y_train)
oob_errors.append(1 - rf_growing.oob_score_)
test_errors.append(1 - accuracy_score(y_test, rf_growing.predict(X_test)))
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(n_trees_range, oob_errors, 'o-', color='#d4af37', label='OOB Error', linewidth=2, ms=4)
ax.plot(n_trees_range, test_errors, 's-', color='#3a7bd5', label='Test Error', linewidth=2, ms=4)
ax.axhline(y=1 - accuracy_score(y_test, dt.predict(X_test)), color='red',
linestyle='--', label='Single Tree Error', alpha=0.7)
ax.set_xlabel('Number of Trees')
ax.set_ylabel('Error Rate')
ax.set_title('Random Forest: Effect of n_estimators
(OOB error ≈ test error — free validation!)')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()
# ══════════════════════════════════════
# FEATURE IMPORTANCES
# ══════════════════════════════════════
rf_final = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=-1, random_state=42)
rf_final.fit(X_train, y_train)
importances = pd.Series(rf_final.feature_importances_, index=data.feature_names)
importances_std = pd.Series(
np.std([tree.feature_importances_ for tree in rf_final.estimators_], axis=0),
index=data.feature_names
)
top_features = importances.sort_values(ascending=False).head(10)
fig, ax = plt.subplots(figsize=(10, 5))
ax.barh(range(10), top_features.values,
xerr=importances_std[top_features.index].values,
color='#d4af37', alpha=0.8, edgecolor='black', capsize=4)
ax.set_yticks(range(10))
ax.set_yticklabels(top_features.index, fontsize=9)
ax.invert_yaxis()
ax.set_title('Random Forest — Feature Importances (Gini)
Error bars show std across trees')
ax.set_xlabel('Mean Decrease in Gini Impurity')
plt.tight_layout()
plt.show()
print(f"
OOB Score: {rf_final.oob_score_:.4f}")
print(f"Test Score: {accuracy_score(y_test, rf_final.predict(X_test)):.4f}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain random forests and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 52 — Bagging
Bagging vs Boosting — The Two Ensemble Paradigms
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
| Aspect | Bagging | Boosting |
|---|---|---|
| Training | Trees trained in parallel, independently | Trees trained sequentially, each corrects previous |
| Focus | Reduces variance (overfitting) | Reduces bias (underfitting) |
| Base learners | Strong, complex trees (low bias, high variance) | Weak learners — shallow trees (high bias, low variance) |
| Weight of samples | Equal weight (random sampling) | Misclassified samples get higher weight |
| Combination | Majority vote / simple average | Weighted vote / additive model |
| Speed | Parallelizable → Fast | Sequential → Slower |
| Overfitting risk | Low (averaging reduces variance) | Medium (can overfit if too many rounds) |
| Algorithms | Random Forest, BaggingClassifier | AdaBoost, GBM, XGBoost, LightGBM, CatBoost |
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain bagging vs boosting and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 53 — AdaBoost
AdaBoost — Adaptive Boosting
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Algorithm Intuition
AdaBoost trains a sequence of weak learners (typically decision stumps — trees with depth=1). After each round, it increases the weight of misclassified samples so the next learner focuses on them. Final prediction is a weighted vote of all learners.
- Initialize all sample weights: $w_i = 1/m$
- Train weak learner $h_t$ on weighted samples
- Compute weighted error: $\epsilon_t = \sum_{i=1}^m w_i \cdot \mathbb{1}[h_t(x_i) eq y_i]$
- Compute learner weight: $\alpha_t = \frac{1}{2}\ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$
- Update sample weights: $w_i \leftarrow w_i \cdot \exp(-\alpha_t y_i h_t(x_i))$, then normalize
- Final model: $H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)$
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
X, y = make_moons(n_samples=1000, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ══════════════════════════════════════
# ADABOOST — Base: Decision Stump (depth=1)
# ══════════════════════════════════════
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # weak learner
n_estimators=200, # number of boosting rounds
learning_rate=0.5, # shrinks contribution of each learner (like regularization)
algorithm='SAMME', # SAMME.R uses probabilities (better)
random_state=42
)
ada.fit(X_train, y_train)
print(f"AdaBoost Accuracy: {accuracy_score(y_test, ada.predict(X_test)):.4f}")
# ── Track accuracy vs number of estimators ──────────────────────
train_staged = list(ada.staged_score(X_train, y_train))
test_staged = list(ada.staged_score(X_test, y_test))
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(train_staged, color='#d4af37', label='Train Accuracy', linewidth=2)
ax.plot(test_staged, color='#3a7bd5', label='Test Accuracy', linewidth=2)
ax.set_xlabel('Number of Estimators')
ax.set_ylabel('Accuracy')
ax.set_title('AdaBoost — Staged Score vs Number of Estimators
(Note: Unlike random forests, boosting CAN overfit!)')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()
# ── Effect of max_depth (stump vs deeper trees) ──────────────────
print("
Depth Comparison (200 estimators):")
for depth in [1, 2, 3, 5]:
ada_d = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=depth),
n_estimators=200, learning_rate=0.5, random_state=42
)
cv = cross_val_score(ada_d, X_train, y_train, cv=5, scoring='accuracy')
print(f" max_depth={depth}: CV={cv.mean():.4f} ± {cv.std():.4f}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain adaboost and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Gradient Boosting Machines — Residual Fitting
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Gradient Boosting Idea
Gradient Boosting builds an additive model by fitting each new tree to the negative gradient of the loss function with respect to the current prediction — i.e., the residuals (for regression with MSE loss).
Algorithm:
- Initialize with a constant prediction: $F_0(x) = \arg\min_\gamma \sum_i L(y_i, \gamma)$ (e.g., mean for regression)
- For $m = 1, 2, \ldots, M$:
- Compute pseudo-residuals: $r_{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}$
- Fit a tree $h_m(x)$ to the pseudo-residuals
- Update: $F_m(x) = F_{m-1}(x) + u \cdot h_m(x)$ where $ u$ is the learning rate
- Final prediction: $F_M(x)$
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor, HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# ══════════════════════════════════════
# SKLEARN GRADIENT BOOSTING
# ══════════════════════════════════════
gbm = GradientBoostingClassifier(
n_estimators=200, # number of boosting stages (trees)
learning_rate=0.1, # shrinkage — smaller = more trees needed, more robust
max_depth=3, # depth of each tree (typically 3-5)
subsample=0.8, # stochastic GBM: use 80% of training data per tree
max_features='sqrt', # feature subsampling per tree
min_samples_leaf=5,
random_state=42
)
gbm.fit(X_train, y_train)
print(f"GBM Accuracy: {accuracy_score(y_test, gbm.predict(X_test)):.4f}")
# ── Staged predictions to find optimal n_estimators ──────────────
train_staged = [accuracy_score(y_train, y_p) for y_p in gbm.staged_predict(X_train)]
test_staged = [accuracy_score(y_test, y_p) for y_p in gbm.staged_predict(X_test)]
optimal_n = np.argmax(test_staged) + 1
print(f"Optimal n_estimators: {optimal_n}")
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(train_staged, color='#d4af37', label='Train', linewidth=2)
ax.plot(test_staged, color='#3a7bd5', label='Test', linewidth=2)
ax.axvline(x=optimal_n, color='red', linestyle='--',
label=f'Optimal N={optimal_n}', alpha=0.8)
ax.set_xlabel('Number of Boosting Rounds')
ax.set_ylabel('Accuracy')
ax.set_title('GBM — Staged Accuracy
(Use early stopping to find optimal N automatically)')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()
# ══════════════════════════════════════
# HISTOGRAM-BASED GBM (sklearn ≥ 0.23)
# Much faster for large datasets (like LightGBM)
# ══════════════════════════════════════
hgbm = HistGradientBoostingClassifier(
max_iter=200,
learning_rate=0.1,
max_depth=6,
l2_regularization=0.1,
random_state=42,
early_stopping=True, # automatically stop when validation score plateaus
validation_fraction=0.1, # fraction of training data for early stopping
n_iter_no_change=20 # patience
)
hgbm.fit(X_train, y_train)
print(f"
HistGBM Accuracy: {accuracy_score(y_test, hgbm.predict(X_test)):.4f}")
print(f"Actual iterations used: {hgbm.n_iter_}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain gradient boosting machines and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 55 — XGBoost
XGBoost — Regularized Gradient Boosting
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Failure mode — Overfitting with too many trees
Validation error drops then rises as n_estimators grows. Fix with early stopping (early_stopping_rounds), lower max_depth, or higher reg_lambda — do not tune on the test set.
XGBoost Innovations Over Traditional GBM
- Regularized Objective: Adds L1 and L2 penalties on leaf weights and tree complexity directly to the objective function
- Second-order gradients: Uses both first ($g_i$) and second ($h_i$) derivatives of the loss for better optimization
- Tree pruning with depth-first growth: Grows tree then prunes back based on gain threshold
- Sparse-aware algorithm: Handles missing values natively by learning optimal direction for missing values
- Column and row subsampling: Like LightGBM and Random Forest
- Parallel computation: Parallelizes split finding across features
XGBoost Regularized Objective:
$$\mathcal{L} = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k)$$ $$\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T} w_j^2$$$T$ = number of leaves, $w_j$ = leaf weight, $\gamma$ = min gain to split, $\lambda$ = L2 on weights
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
import xgboost as xgb
from xgboost import XGBClassifier, XGBRegressor
data = load_breast_cancer()
X, y = pd.DataFrame(data.data, columns=data.feature_names), data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# ══════════════════════════════════════
# XGBOOST CLASSIFIER
# ══════════════════════════════════════
xgb_clf = XGBClassifier(
n_estimators=500, # max trees (use early stopping)
learning_rate=0.05, # eta — smaller = more robust, needs more trees
max_depth=4, # tree depth (3-6 for classification)
min_child_weight=1, # minimum sum of instance weight in child (min_samples_leaf analog)
gamma=0.1, # minimum loss reduction to make a split (tree pruning)
subsample=0.8, # row subsampling per tree
colsample_bytree=0.8, # column subsampling per tree
colsample_bylevel=0.8, # column subsampling per level
reg_alpha=0.1, # L1 regularization on weights
reg_lambda=1.0, # L2 regularization on weights
scale_pos_weight=1, # for imbalanced: sum(neg)/sum(pos)
use_label_encoder=False,
eval_metric='logloss',
random_state=42,
n_jobs=-1
)
# Train with early stopping
xgb_clf.fit(
X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
verbose=50
)
y_pred = xgb_clf.predict(X_test)
y_proba = xgb_clf.predict_proba(X_test)[:, 1]
print(f"
XGBoost — Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"XGBoost — ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
# ── Learning Curves ──────────────────────────────────────────────
results = xgb_clf.evals_result()
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(results['validation_0']['logloss'], color='#d4af37', label='Train Log Loss', linewidth=2)
ax.plot(results['validation_1']['logloss'], color='#3a7bd5', label='Test Log Loss', linewidth=2)
ax.set_xlabel('Boosting Round')
ax.set_ylabel('Log Loss')
ax.set_title('XGBoost — Training Curves
(Early stopping prevents overfitting)')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()
# ── XGBoost Feature Importance (3 types) ────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
importance_types = ['weight', 'gain', 'cover']
labels = [
'weight = times feature used in split',
'gain = avg gain per use (most useful)',
'cover = avg samples covered'
]
for ax, imp_type, label in zip(axes, importance_types, labels):
importance = pd.Series(xgb_clf.get_booster().get_score(importance_type=imp_type))
importance.sort_values(ascending=False).head(10).plot(kind='barh', ax=ax,
color='#d4af37', alpha=0.8)
ax.invert_yaxis()
ax.set_title(f'Importance Type: {imp_type}
({label})', fontsize=9)
plt.suptitle('XGBoost Feature Importance — Three Types', fontweight='bold')
plt.tight_layout()
plt.show()
# ── Cross-Validated Hyperparameter Tuning ───────────────────────
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
param_dist = {
'n_estimators': randint(100, 500),
'max_depth': randint(3, 8),
'learning_rate': uniform(0.01, 0.2),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.6, 0.4),
'gamma': uniform(0, 0.3),
'reg_alpha': uniform(0, 1),
'reg_lambda': uniform(0, 2)
}
xgb_base = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42, n_jobs=-1)
random_search = RandomizedSearchCV(xgb_base, param_distributions=param_dist,
n_iter=30, cv=5, scoring='roc_auc',
random_state=42, n_jobs=-1, verbose=0)
random_search.fit(X_train, y_train)
print(f"
Best XGBoost params: {random_search.best_params_}")
print(f"Best CV ROC-AUC: {random_search.best_score_:.4f}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain xgboost and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 56 — LightGBM
LightGBM — Faster, Better Gradient Boosting
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
LightGBM's Key Innovations over XGBoost
| Innovation | XGBoost | LightGBM | Benefit |
|---|---|---|---|
| Tree growth | Level-wise (breadth-first) | Leaf-wise (best-first) | Faster convergence; lower loss |
| GOSS | Uses all instances for gradient computation | Gradient-based One-Side Sampling (keeps high-gradient instances) | Less data used → faster |
| EFB | Uses all features | Exclusive Feature Bundling (bundles mutually exclusive sparse features) | Fewer features → faster |
| Histogram | Pre-sorted algorithm (slow on large data) | Histogram-based (bins continuous values) | Memory efficient; faster splits |
| Categorical | Must manually encode | Native categorical handling | No manual encoding needed |
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# ══════════════════════════════════════
# LIGHTGBM CLASSIFIER
# ══════════════════════════════════════
lgbm = LGBMClassifier(
n_estimators=500, # max trees
learning_rate=0.05, # shrinkage
max_depth=-1, # -1 = unlimited (leaf-wise growth handles this)
num_leaves=31, # key parameter: max leaves per tree (2^max_depth)
min_child_samples=20, # min samples per leaf
feature_fraction=0.8, # colsample_bytree analog
bagging_fraction=0.8, # subsample analog
bagging_freq=5, # apply bagging every 5 iterations
reg_alpha=0.1, # L1 regularization
reg_lambda=0.1, # L2 regularization
subsample_for_bin=200000, # samples for constructing histograms
class_weight='balanced', # handle class imbalance
random_state=42,
n_jobs=-1,
verbose=-1 # suppress verbose output
)
# LightGBM's native early stopping via callbacks
callbacks = [
lgb.early_stopping(stopping_rounds=50, verbose=True),
lgb.log_evaluation(period=100)
]
lgbm.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
callbacks=callbacks
)
y_pred = lgbm.predict(X_test)
y_proba = lgbm.predict_proba(X_test)[:, 1]
print(f"
LightGBM — Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"LightGBM — ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
print(f"Best iteration: {lgbm.best_iteration_}")
# ── Speed Benchmark: LightGBM vs XGBoost vs sklearn GBM ─────────
import time
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from sklearn.datasets import make_classification
X_large, y_large = make_classification(n_samples=50000, n_features=30, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X_large, y_large, test_size=0.2)
models = {
'sklearn GBM': GradientBoostingClassifier(n_estimators=100, max_depth=4, random_state=42),
'XGBoost': xgb.XGBClassifier(n_estimators=100, max_depth=4, use_label_encoder=False,
eval_metric='logloss', n_jobs=-1, random_state=42),
'LightGBM': LGBMClassifier(n_estimators=100, num_leaves=31, n_jobs=-1,
random_state=42, verbose=-1)
}
print("
Speed Benchmark (n=50,000, p=30):")
print(f"{'Model':15} {'Train Time':12} {'Accuracy':10}")
print("-" * 40)
for name, model in models.items():
start = time.time()
model.fit(X_tr, y_tr)
elapsed = time.time() - start
acc = accuracy_score(y_te, model.predict(X_te))
print(f" {name:15} {elapsed:8.3f}s {acc:.4f}")
# LightGBM is typically 10-100x faster than sklearn GBM
# ── num_leaves — the most important LightGBM param ──────────────
print("
num_leaves tuning (controls model complexity):")
for nl in [7, 15, 31, 63, 127]:
m = LGBMClassifier(n_estimators=100, num_leaves=nl, random_state=42, verbose=-1)
cv = cross_val_score(m, X_train, y_train, cv=5, scoring='accuracy')
print(f" num_leaves={nl:3d}: CV={cv.mean():.4f} ± {cv.std():.4f}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain lightgbm and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 57 — CatBoost
CatBoost — Native Categorical Handling & Ordered Boosting
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
CatBoost's Key Innovations
- Ordered Boosting: Uses a permutation-based approach to avoid target leakage during training — each object is predicted using only models trained on previous objects in a random order.
- Native Categorical Features: Automatically handles categorical features using statistics from the target (similar to target encoding) without manual preprocessing.
- Symmetric Trees: All splits at the same depth use the same splitting criterion — faster prediction and more regularized.
- No Feature Scaling Needed: Works with raw features without normalization.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from catboost import CatBoostClassifier, Pool, cv as catboost_cv
# ══════════════════════════════════════
# CATBOOST WITH CATEGORICAL FEATURES
# (No manual encoding needed!)
# ══════════════════════════════════════
df = pd.read_csv('titanic.csv')
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
cat_features = ['Sex', 'Embarked', 'Pclass']
X = df[features].copy()
y = df['Survived']
mask = y.notna()
X, y = X[mask], y[mask]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# Fill missing values (CatBoost handles NaN in numeric features too!)
X_train = X_train.fillna(X_train.median(numeric_only=True))
X_test = X_test.fillna(X_train.median(numeric_only=True))
X_train['Embarked'].fillna('S', inplace=True)
X_test['Embarked'].fillna('S', inplace=True)
# CatBoost can receive category names as strings — no encoding needed!
cat_feature_indices = [X_train.columns.tolist().index(col) for col in cat_features]
cb = CatBoostClassifier(
iterations=500, # n_estimators
learning_rate=0.05,
depth=6, # tree depth (symmetric trees)
l2_leaf_reg=3.0, # L2 regularization
border_count=254, # number of bins for numeric features
bagging_temperature=1.0, # controls Bayesian bootstrap
random_strength=1, # adds randomness to split selection
cat_features=cat_feature_indices, # indices of categorical features
eval_metric='AUC',
random_seed=42,
verbose=100,
early_stopping_rounds=50
)
cb.fit(
X_train, y_train,
eval_set=(X_test, y_test)
)
y_pred = cb.predict(X_test)
y_proba = cb.predict_proba(X_test)[:, 1]
print(f"
CatBoost — Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"CatBoost — ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
# ── Feature Importance ───────────────────────────────────────────
fi = pd.Series(cb.feature_importances_, index=X_train.columns)
fi.sort_values(ascending=False).plot(kind='barh', color='#d4af37', alpha=0.8, figsize=(8,4))
import matplotlib.pyplot as plt
plt.title('CatBoost Feature Importances')
plt.tight_layout()
plt.show()
# ── SHAP Values for model explanation ───────────────────────────
try:
import shap
explainer = shap.TreeExplainer(cb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type='bar')
shap.summary_plot(shap_values, X_test)
except ImportError:
print("Install shap: pip install shap")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain catboost and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 58 — Stacking
Stacking & Blending — Meta-Learner Ensembles
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Stacking Architecture
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, StackingClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# ══════════════════════════════════════
# SKLEARN StackingClassifier
# Uses cross-val predictions to avoid data leakage
# ══════════════════════════════════════
base_learners = [
('lr', make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, C=1.0))),
('rf', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)),
('svm', make_pipeline(StandardScaler(), SVC(probability=True, kernel='rbf', C=1.0))),
('knn', make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))),
('gbm', GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42))
]
meta_learner = LogisticRegression(C=1.0, max_iter=1000)
stacking = StackingClassifier(
estimators=base_learners,
final_estimator=meta_learner,
cv=5, # 5-fold CV for out-of-fold predictions
stack_method='predict_proba', # pass probabilities to meta-learner
n_jobs=-1,
passthrough=False # True = also pass original features to meta-learner
)
stacking.fit(X_train, y_train)
print("Stacking Ensemble:")
print(f" Accuracy: {accuracy_score(y_test, stacking.predict(X_test)):.4f}")
print(f" ROC-AUC: {roc_auc_score(y_test, stacking.predict_proba(X_test)[:,1]):.4f}")
# ── Compare base learners vs stack ──────────────────────────────
print("
Individual vs Stacked Performance:")
print(f"{'Model':30} {'CV Accuracy':15}")
print("-" * 48)
for name, model in base_learners:
cv = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(f" {name:30} {cv.mean():.4f} ± {cv.std():.4f}")
cv_stack = cross_val_score(stacking, X_train, y_train, cv=5, scoring='accuracy')
print(f" {'STACKING (all combined)':30} {cv_stack.mean():.4f} ± {cv_stack.std():.4f}")
# ══════════════════════════════════════
# BLENDING — Simpler than stacking
# Use a holdout set instead of cross-validation
# ══════════════════════════════════════
X_tr_bl, X_hold, y_tr_bl, y_hold = train_test_split(X_train, y_train,
test_size=0.2, random_state=42)
# Train base models on X_tr_bl
blend_predictions_hold = np.zeros((len(X_hold), len(base_learners)))
blend_predictions_test = np.zeros((len(X_test), len(base_learners)))
scaler = StandardScaler()
X_tr_bl_sc = scaler.fit_transform(X_tr_bl)
X_hold_sc = scaler.transform(X_hold)
X_test_sc = scaler.transform(X_test)
for i, (name, model) in enumerate(base_learners):
model.fit(X_tr_bl, y_tr_bl)
blend_predictions_hold[:, i] = model.predict_proba(X_hold)[:, 1]
blend_predictions_test[:, i] = model.predict_proba(X_test)[:, 1]
# Train meta-learner on holdout predictions
meta_blend = LogisticRegression(C=1.0)
meta_blend.fit(blend_predictions_hold, y_hold)
y_blend_pred = meta_blend.predict(blend_predictions_test)
print(f"
Blending Accuracy: {accuracy_score(y_test, y_blend_pred):.4f}")
print(f"Meta-learner weights: {meta_blend.coef_[0].round(3)}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain stacking & blending and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Voting Classifiers — Hard and Soft Voting
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
import numpy as np
from sklearn.ensemble import VotingClassifier, VotingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# ══════════════════════════════════════
# HARD VOTING — Majority class wins
# ══════════════════════════════════════
hard_vote = VotingClassifier(
estimators=[
('lr', make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))),
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('gbm', GradientBoostingClassifier(n_estimators=100, random_state=42)),
('svm', make_pipeline(StandardScaler(), SVC(kernel='rbf', probability=False)))
],
voting='hard'
)
hard_vote.fit(X_train, y_train)
print(f"Hard Voting Accuracy: {accuracy_score(y_test, hard_vote.predict(X_test)):.4f}")
# ══════════════════════════════════════
# SOFT VOTING — Average probabilities (better!)
# Requires probability estimates from all models
# ══════════════════════════════════════
soft_vote = VotingClassifier(
estimators=[
('lr', make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))),
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('gbm', GradientBoostingClassifier(n_estimators=100, random_state=42)),
('svm', make_pipeline(StandardScaler(), SVC(kernel='rbf', probability=True)))
],
voting='soft'
)
soft_vote.fit(X_train, y_train)
print(f"Soft Voting Accuracy: {accuracy_score(y_test, soft_vote.predict(X_test)):.4f}")
# ══════════════════════════════════════
# WEIGHTED SOFT VOTING
# Give more weight to stronger models
# ══════════════════════════════════════
weighted_vote = VotingClassifier(
estimators=[
('lr', make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))),
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('gbm', GradientBoostingClassifier(n_estimators=100, random_state=42)),
],
voting='soft',
weights=[1, 2, 3] # GBM gets 3x weight (best individual model)
)
weighted_vote.fit(X_train, y_train)
print(f"Weighted Soft Voting Accuracy: {accuracy_score(y_test, weighted_vote.predict(X_test)):.4f}")
# ── Find optimal weights via CV ──────────────────────────────────
from itertools import product
best_score = 0
best_weights = None
for w1, w2, w3 in product([1, 2, 3], repeat=3):
if w1 + w2 + w3 == 0:
continue
wv = VotingClassifier(
estimators=[
('lr', make_pipeline(StandardScaler(), LogisticRegression(max_iter=500))),
('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
('gbm', GradientBoostingClassifier(n_estimators=50, random_state=42)),
],
voting='soft',
weights=[w1, w2, w3]
)
score = cross_val_score(wv, X_train, y_train, cv=5, scoring='accuracy').mean()
if score > best_score:
best_score = score
best_weights = [w1, w2, w3]
print(f"
Optimal weights: {best_weights}, CV Score: {best_score:.4f}")Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain voting classifiers and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Algorithm Selection Framework — Which Algorithm for Which Problem?
Why this matters
Algorithm Selection Framework: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Master Decision Guide
| Scenario | First Try | If Still Needs Work | Avoid |
|---|---|---|---|
| Tabular data, <10k rows, classification | Logistic Regression + Random Forest | XGBoost, LightGBM | Deep Learning (overkill) |
| Tabular data, >100k rows | LightGBM, XGBoost | CatBoost, Stacking | KNN (too slow), SVM (scaling issues) |
| High cardinality categoricals | CatBoost, LightGBM (native categ.) | Target encoding + XGBoost | OneHot + LR (too many features) |
| High-dimensional sparse data (text) | Naive Bayes, Logistic Regression | Linear SVM | Random Forest (slow on sparse), KNN |
| Small dataset (<1k samples) | SVM (RBF), Logistic Regression | Decision Tree with pruning | Deep Learning (insufficient data) |
| Need probability estimates | Logistic Regression, Random Forest | Calibrated SVM, GBM | Hard-margin SVM, uncalibrated models |
| Need full interpretability | Logistic Regression, Decision Tree | XGBoost + SHAP values | Black-box ensembles in regulated domains |
| Fast inference needed | Logistic Regression, Decision Tree | Random Forest (parallel) | KNN (stores all data), SVM on large datasets |
| Noisy, many irrelevant features | Random Forest, Lasso | Feature selection + any model | KNN (cursed by irrelevant dims) |
| Regression, continuous target | Linear Regression, Ridge/Lasso | XGBoost/LightGBM regressor | Logistic Regression |
Algorithm Complexity Summary
| Algorithm | Train Time | Predict Time | Memory | Scales to Big Data |
|---|---|---|---|---|
| Linear/Logistic Regression | O(mnp) fast | O(p) | O(p) | ✅ Very well (SGD) |
| KNN | O(1) — lazy! | O(mn) — slow | O(mn) | ❌ Very poorly |
| Naive Bayes | O(mn) fast | O(np) | O(np) | ✅ Very well |
| SVM | O(m²–m³) slow | O(m_sv × p) | O(m_sv) | ❌ Poor (>100k) |
| Decision Tree | O(mnp log m) | O(depth) | O(m) | ⚠️ Moderate |
| Random Forest | O(B×mnp log m) | O(B×depth) | O(B×m) | ✅ Good (parallel) |
| XGBoost/LightGBM | O(B×mnp) | O(B×depth) | O(B×p) | ✅ Excellent |
m = samples, n = features, p = features, B = trees, m_sv = support vectors
- Always start with a baseline: DummyClassifier or simple heuristic. Know your floor.
- Quickly try Logistic Regression (with scaling) — often surprises you with strong performance.
- Try Random Forest — usually strong, requires no scaling, provides feature importances.
- If you need top performance: XGBoost or LightGBM with proper hyperparameter tuning.
- Stacking for final 0.5-2% gain in competitive settings (Kaggle).
- Use SHAP values to explain any model's predictions for stakeholders.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
import time
# ══════════════════════════════════════
# COMPLETE ALGORITHM COMPARISON FRAMEWORK
# ══════════════════════════════════════
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15,
n_redundant=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
models = {
'Baseline (Most Frequent)': DummyClassifier(strategy='most_frequent'),
'Naive Bayes': GaussianNB(),
'Logistic Regression': make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, C=1)),
'Decision Tree (d=5)': DecisionTreeClassifier(max_depth=5, random_state=42),
'KNN (k=7)': make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=7)),
'SVM (RBF)': make_pipeline(StandardScaler(), SVC(kernel='rbf', C=1, probability=True)),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42),
}
# Try importing XGBoost and LightGBM
try:
from xgboost import XGBClassifier
models['XGBoost'] = XGBClassifier(n_estimators=100, max_depth=4, use_label_encoder=False,
eval_metric='logloss', random_state=42, n_jobs=-1)
except: pass
try:
from lightgbm import LGBMClassifier
models['LightGBM'] = LGBMClassifier(n_estimators=100, num_leaves=31, random_state=42,
n_jobs=-1, verbose=-1)
except: pass
print("="*70)
print(f"{'Algorithm':35} {'CV Acc':10} {'Test Acc':10} {'Train Time':12}")
print("="*70)
results = {}
for name, model in models.items():
start = time.time()
cv_score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)
elapsed_cv = time.time() - start
model.fit(X_train, y_train)
test_score = accuracy_score(y_test, model.predict(X_test))
results[name] = {'cv': cv_score.mean(), 'test': test_score}
print(f" {name:35} {cv_score.mean():.4f} {test_score:.4f} {elapsed_cv:.2f}s")
# ── Rank models ──────────────────────────────────────────────────
best = sorted(results.items(), key=lambda x: x[1]['test'], reverse=True)
print(f"
{'='*40}")
print("RANKING (by Test Accuracy):")
for rank, (name, scores) in enumerate(best, 1):
print(f" {rank}. {name}: {scores['test']:.4f}")Module 4 Complete — You're Now a Supervised ML Expert!
You've covered every major supervised learning algorithm from first principles. The next step is learning how to properly evaluate these models (Module 6: Evaluation & Tuning) and discovering structure in data without labels (Module 5: Unsupervised Learning). The algorithms you learned here — especially XGBoost and LightGBM — dominate Kaggle competitions and real-world ML projects worldwide.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain algorithm selection framework and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Continue to the next day in this module.
