100 Days of ML · Module 4 (60)

Module 4: Supervised Learning Algorithms

100 Days of ML Module 4 — Master Supervised Learning: Linear/Logistic Regression, Decision Trees, KNN, Naive Bayes, SVM, Random Forests, XGBoost, LightGBM, CatBoost, Stacking, and Ensemble Methods.

⏱ 120 Min Read • 60 • Updated: May 2026

This is the core algorithms module — the heart of classical Machine Learning. You'll learn every major supervised algorithm from first principles: the mathematical intuition, how they work, their strengths and weaknesses, key hyperparameters, and practical sklearn implementation. By the end you'll know when to use each algorithm and why.

Linear Regression — The Foundation of Supervised Learning

Why this matters

Linear regression is the foundation of supervised learning: cost functions, gradients, assumptions, and residuals appear in every advanced algorithm interview.

Intuition

Linear regression fits a straight line (or hyperplane) through data points that minimizes the total squared distance between predictions and actual values. It models the relationship between input features $X$ and a continuous target $y$ as a linear function.

Worked example — One feature

Suppose y = 2 + 3x. For x=4, ŷ=14. MSE for predictions [14, 8] vs actual [15, 7] is mean((14−15)² + (8−7)²) = mean(1+1) = 1.0. Adding outliers squares large errors — why robust metrics (MAE) and residual checks matter.

Linear Regression Model:

$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n = \boldsymbol{\theta}^T \mathbf{x}$$

$\theta_0$ = bias/intercept, $\theta_1 \ldots \theta_n$ = feature weights (slope parameters)

Cost Function — Mean Squared Error (MSE)

We want to find parameters $\boldsymbol{\theta}$ that minimize the Mean Squared Error — the average of squared differences between predictions and actuals:

$$J(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 = \frac{1}{2m} \|\mathbf{X}\boldsymbol{\theta} - \mathbf{y}\|^2$$

Closed-Form Solution (OLS — Ordinary Least Squares)

For linear regression, there is an exact analytical solution — the Normal Equation:

$$\hat{\boldsymbol{\theta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

This is $O(n^3)$ due to matrix inversion — infeasible for very large feature sets ($n > 10,000$). In those cases, use gradient descent instead.

Assumptions of Linear Regression (LINE)

L — Linearity: The relationship between X and y is linear.
I — Independence: Observations are independent of each other.
N — Normality: Residuals ($y - \hat{y}$) are normally distributed.
E — Equal Variance (Homoscedasticity): Residual variance is constant across all fitted values.

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

# ── Load Dataset ────────────────────────────────────────────────
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target  # median house value in $100k

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Train Linear Regression ──────────────────────────────────────
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# ── Evaluation Metrics ───────────────────────────────────────────
mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae  = mean_absolute_error(y_test, y_pred)
r2   = r2_score(y_test, y_pred)

print("Linear Regression Results:")
print(f"  MSE:  {mse:.4f}")
print(f"  RMSE: {rmse:.4f}  (in same units as target)")
print(f"  MAE:  {mae:.4f}  (more robust to outliers)")
print(f"  R²:   {r2:.4f}  (1.0 = perfect, 0 = baseline mean model)")

# ── Model Coefficients ───────────────────────────────────────────
coef_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Coefficient': lr.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print(f"\nIntercept: {lr.intercept_:.4f}")
print("\nFeature Coefficients (sorted by absolute magnitude):")
print(coef_df.to_string(index=False))

# ── OLS Normal Equation (from scratch) ──────────────────────────
def normal_equation(X, y):
    """Closed-form OLS solution"""
    X_b = np.c_[np.ones(X.shape[0]), X]  # add bias column
    theta = np.linalg.pinv(X_b.T @ X_b) @ X_b.T @ y
    return theta[0], theta[1:]  # intercept, coefficients

intercept_ols, coefs_ols = normal_equation(X_train.values, y_train.values)
print(f"\nNormal Equation intercept: {intercept_ols:.4f}")
print(f"sklearn intercept:          {lr.intercept_:.4f}")

# ── Residual Analysis ────────────────────────────────────────────
residuals = y_test - y_pred

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Predicted vs Actual
axes[0].scatter(y_test, y_pred, alpha=0.4, color='#d4af37', s=20)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Values')
axes[0].set_ylabel('Predicted Values')
axes[0].set_title(f'Predicted vs Actual
R² = {r2:.4f}')
axes[0].legend()

# Plot 2: Residuals vs Fitted
axes[1].scatter(y_pred, residuals, alpha=0.4, color='#3a7bd5', s=20)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[1].set_xlabel('Fitted Values')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residuals vs Fitted
(Check: random scatter = good)')

# Plot 3: Residual distribution
axes[2].hist(residuals, bins=40, color='#d4af37', alpha=0.7, edgecolor='black')
axes[2].set_xlabel('Residuals')
axes[2].set_ylabel('Count')
axes[2].set_title(f'Residual Distribution
(Should be normal, mean≈0)')
axes[2].axvline(x=0, color='red', linestyle='--')

plt.suptitle('Linear Regression Diagnostics', fontweight='bold')
plt.tight_layout()
plt.show()

✅ Pros

Highly interpretable (coefficients = feature impact)
Extremely fast to train and predict
No hyperparameters (OLS)
Works well when relationship is truly linear

❌ Cons

Assumes linear relationship — fails on complex data
Sensitive to outliers (squares errors)
Requires feature scaling for gradient descent
Assumptions (normality, homoscedasticity) rarely perfectly met

Common mistakes

Using linear regression on clearly non-linear targets without transforms or other models.
Ignoring residual plots — heteroscedasticity and non-linearity hide in aggregates.
Applying OLS when features are collinear without regularization (Ridge/Lasso).

Interview checkpoints

Q: Interpret a positive coefficient on `MedInc`. A: Holding other features fixed, one unit increase in MedInc associates with β increase in target.
Q: R² vs RMSE? A: R² = variance explained (unitless); RMSE = error in target units, sensitive to outliers.
Q: When does normal equation fail? A: Singular XᵀX (collinearity) or very large p — use pseudoinverse or gradient methods.

Practice

Basic: Fit LinearRegression on California housing; report RMSE and R².
Intermediate: Plot residuals vs fitted; list two assumption violations if patterns appear.
Advanced: Implement OLS with np.linalg.pinv; compare coefficients to sklearn within 1e-6.

Recap

ŷ = θᵀx; minimize MSE for OLS solution.
Check LINE assumptions via residuals.
Interpret coefficients only when features are scaled and comparable if needed.

Next: Day 42 — Gradient Descent

Gradient Descent — Batch, Stochastic, and Mini-Batch

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Worked example — Learning rate

On a simple quadratic loss, η too large oscillates or diverges; η too small converges slowly. Mini-batch (32–256) is the practical default: stable like batch GD but faster per epoch than full-batch on large data.

The Core Idea

Gradient descent is an iterative optimization algorithm that finds the minimum of a cost function by repeatedly taking steps in the direction of steepest descent (negative gradient). Think of it as blindly walking down a hilly landscape in the direction that goes downhill most steeply.

Gradient Descent Update Rule:

$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta abla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$$

$\eta$ = learning rate (step size), $ abla J$ = gradient of cost function with respect to $\boldsymbol{\theta}$

For linear regression with MSE cost, the gradient is: $$\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) \cdot x_j^{(i)}$$

Variant	Batch Size	Update Frequency	Noise Level	Best For
Batch GD	All $m$ samples	Once per epoch	Low (smooth)	Small datasets, convex problems
Stochastic GD (SGD)	1 sample	Every sample	High (noisy)	Online learning, very large datasets
Mini-Batch GD	32–512 samples	Every batch	Medium	Default for deep learning; best balance

Code Example

import numpy as np
import matplotlib.pyplot as plt

# ══════════════════════════════════════
# GRADIENT DESCENT FROM SCRATCH
# ══════════════════════════════════════
np.random.seed(42)
m = 1000
X_gd = np.random.randn(m, 1)
y_gd = 3 + 2 * X_gd + np.random.randn(m, 1) * 0.5   # y = 3 + 2x + noise
X_b = np.c_[np.ones(m), X_gd]   # add bias column

def batch_gradient_descent(X, y, eta=0.01, n_epochs=1000):
    m = len(y)
    theta = np.random.randn(X.shape[1], 1)
    cost_history = []
    
    for epoch in range(n_epochs):
        y_pred = X @ theta
        residuals = y_pred - y
        gradients = (2/m) * X.T @ residuals
        theta -= eta * gradients
        
        cost = np.mean(residuals**2)
        cost_history.append(cost)
    
    return theta, cost_history

def stochastic_gradient_descent(X, y, eta=0.01, n_epochs=50):
    m = len(y)
    theta = np.random.randn(X.shape[1], 1)
    cost_history = []
    
    for epoch in range(n_epochs):
        epoch_cost = 0
        indices = np.random.permutation(m)   # shuffle each epoch
        
        for i in indices:
            xi = X[i:i+1]
            yi = y[i:i+1]
            y_pred = xi @ theta
            gradient = 2 * xi.T @ (y_pred - yi)
            theta -= eta * gradient
            epoch_cost += (y_pred - yi)**2
        
        cost_history.append(float(epoch_cost / m))
    
    return theta, cost_history

def mini_batch_gradient_descent(X, y, eta=0.01, n_epochs=100, batch_size=32):
    m = len(y)
    theta = np.random.randn(X.shape[1], 1)
    cost_history = []
    
    for epoch in range(n_epochs):
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        epoch_cost = 0
        
        for i in range(0, m, batch_size):
            xi = X_shuffled[i:i+batch_size]
            yi = y_shuffled[i:i+batch_size]
            y_pred = xi @ theta
            gradients = (2/len(yi)) * xi.T @ (y_pred - yi)
            theta -= eta * gradients
            epoch_cost += np.mean((y_pred - yi)**2)
        
        cost_history.append(epoch_cost / (m // batch_size))
    
    return theta, cost_history

# Run all three
theta_batch, cost_batch   = batch_gradient_descent(X_b, y_gd, eta=0.01, n_epochs=200)
theta_sgd,   cost_sgd     = stochastic_gradient_descent(X_b, y_gd, eta=0.01, n_epochs=200)
theta_mini,  cost_mini    = mini_batch_gradient_descent(X_b, y_gd, eta=0.01, n_epochs=200, batch_size=32)

print("Batch GD   — True: [3, 2], Learned:", theta_batch.T.round(3))
print("SGD        — True: [3, 2], Learned:", theta_sgd.T.round(3))
print("Mini-Batch — True: [3, 2], Learned:", theta_mini.T.round(3))

# ── Effect of Learning Rate ──────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Convergence curves
axes[0].plot(cost_batch, label='Batch GD',    color='#d4af37', linewidth=2)
axes[0].plot(cost_sgd,   label='SGD',         color='#e74c3c', linewidth=1.5, alpha=0.7)
axes[0].plot(cost_mini,  label='Mini-Batch',  color='#2ecc71', linewidth=1.5)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('Convergence Comparison
(Note: SGD is noisy, Mini-Batch balances both)')
axes[0].legend()
axes[0].set_yscale('log')

# Learning rate comparison
for eta, color in [(0.001, '#e74c3c'), (0.01, '#d4af37'), (0.1, '#2ecc71')]:
    _, costs = batch_gradient_descent(X_b, y_gd, eta=eta, n_epochs=200)
    axes[1].plot(costs, label=f'η = {eta}', color=color, linewidth=2)

axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MSE Loss')
axes[1].set_title('Effect of Learning Rate
(Too small=slow, Too large=diverges)')
axes[1].legend()
axes[1].set_ylim(0, 5)

plt.suptitle('Gradient Descent Variants Comparison', fontweight='bold')
plt.tight_layout()
plt.show()

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain gradient descent and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 43 — Ridge & Lasso

Regularization — Ridge (L2), Lasso (L1), and ElasticNet

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Overfitting Problem

When a model is too complex (too many parameters) relative to the training data, it memorizes noise instead of learning the true pattern — this is overfitting. Regularization adds a penalty term to the cost function that discourages large parameter values, forcing the model to learn simpler, more generalizable patterns.

Ridge (L2): Penalize sum of squared coefficients

$$J_{Ridge}(\boldsymbol{\theta}) = MSE(\boldsymbol{\theta}) + \alpha \sum_{j=1}^{n} \theta_j^2$$

Lasso (L1): Penalize sum of absolute coefficients

$$J_{Lasso}(\boldsymbol{\theta}) = MSE(\boldsymbol{\theta}) + \alpha \sum_{j=1}^{n} |\theta_j|$$

ElasticNet: Combination of L1 and L2

$$J_{EN}(\boldsymbol{\theta}) = MSE(\boldsymbol{\theta}) + \alpha \cdot r \sum_{j=1}^{n} |\theta_j| + \alpha \cdot \frac{1-r}{2} \sum_{j=1}^{n} \theta_j^2$$

Why Lasso Creates Sparsity (L1 ≠ L2 Behavior)

The key geometric insight: the L1 penalty (diamond shape in 2D) has corners at the axes. When the MSE loss contours touch the constraint region, they're likely to touch a corner — which lies exactly on one axis, meaning all other coefficients are exactly zero. L2's smooth circular constraint rarely yields exact zeros.

Property	Ridge (L2)	Lasso (L1)	ElasticNet
Coefficient shrinkage	Shrinks toward zero but never exactly zero	Can drive coefficients to exactly zero (sparse)	Some zero, some shrunk
Feature selection	No (keeps all features, just small)	Yes (automatic feature selection)	Yes (partial)
Best for	Many small effects; correlated features	Few important features; high-dimensional data	Both large+small effects
Hyperparameter	$\alpha$ (strength of penalty)	$\alpha$ (strength of penalty)	$\alpha$ + $r$ (L1 ratio)
Correlated features	Distributes weight across correlated group	Arbitrarily picks one from correlated group	Handles better than Lasso alone

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# ══════════════════════════════════════
# EFFECT OF ALPHA ON COEFFICIENTS
# ══════════════════════════════════════
alphas = np.logspace(-3, 3, 100)

ridge_coefs = []
lasso_coefs = []

for alpha in alphas:
    ridge_coefs.append(Ridge(alpha=alpha).fit(X_train_sc, y_train).coef_)
    lasso_coefs.append(Lasso(alpha=alpha, max_iter=10000).fit(X_train_sc, y_train).coef_)

ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for i, name in enumerate(data.feature_names):
    axes[0].plot(alphas, ridge_coefs[:, i], linewidth=1.5, label=name)
    axes[1].plot(alphas, lasso_coefs[:, i], linewidth=1.5, label=name)

for ax, title in zip(axes, ['Ridge (L2) — Coefficients shrink, never zero',
                              'Lasso (L1) — Coefficients → exact zero (sparse)']):
    ax.set_xscale('log')
    ax.axhline(y=0, color='white', linewidth=0.5, alpha=0.3)
    ax.set_xlabel('Alpha (regularization strength →)')
    ax.set_ylabel('Coefficient Value')
    ax.set_title(title)
    ax.legend(fontsize=7, loc='upper right')

plt.suptitle('Regularization Path — Effect of Alpha on Coefficients', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# CROSS-VALIDATED ALPHA SELECTION
# ══════════════════════════════════════
ridge_cv = RidgeCV(alphas=np.logspace(-3, 3, 100), cv=5)
ridge_cv.fit(X_train_sc, y_train)
print(f"RidgeCV best alpha: {ridge_cv.alpha_:.4f}")

lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_cv.fit(X_train_sc, y_train)
print(f"LassoCV best alpha: {lasso_cv.alpha_:.6f}")

en_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0], cv=5, random_state=42)
en_cv.fit(X_train_sc, y_train)
print(f"ElasticNetCV best alpha: {en_cv.alpha_:.6f}, l1_ratio: {en_cv.l1_ratio_:.2f}")

# ══════════════════════════════════════
# PERFORMANCE COMPARISON
# ══════════════════════════════════════
print("
Performance Comparison:")
print(f"{'Model':20} {'R² Test':10} {'RMSE Test':10} {'Non-zero Coefs':15}")
print("-" * 60)
for name, model in [
    ('LinearRegression', __import__('sklearn.linear_model', fromlist=['LinearRegression']).LinearRegression()),
    ('Ridge (best α)',   Ridge(alpha=ridge_cv.alpha_)),
    ('Lasso (best α)',   Lasso(alpha=lasso_cv.alpha_, max_iter=10000)),
    ('ElasticNet',       ElasticNet(alpha=en_cv.alpha_, l1_ratio=en_cv.l1_ratio_, max_iter=10000))
]:
    model.fit(X_train_sc, y_train)
    y_pred = model.predict(X_test_sc)
    r2   = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    n_nonzero = np.sum(model.coef_ != 0)
    print(f"  {name:20} {r2:.4f}     {rmse:.4f}     {n_nonzero}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain regularization and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 44 — Logistic Regression

Logistic Regression — Classification with Probabilities

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Worked example — Odds and log-odds

If P(y=1) = 0.8, odds = 0.8/0.2 = 4, log-odds = ln(4) ≈ 1.39. Logistic regression models log-odds as a linear function of features; threshold 0.5 corresponds to log-odds = 0.

From Linear to Logistic

Logistic regression uses the sigmoid function to squeeze the linear combination of features into a probability between 0 and 1:

Sigmoid (Logistic) Function:

$$\sigma(z) = \frac{1}{1 + e^{-z}} \quad \text{where } z = \boldsymbol{\theta}^T \mathbf{x}$$

Prediction Probability:

$$\hat{p} = P(y=1 | \mathbf{x}) = \sigma(\boldsymbol{\theta}^T \mathbf{x})$$

Binary Cross-Entropy Loss (Log Loss):

$$J(\boldsymbol{\theta}) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log(\hat{p}^{(i)}) + (1-y^{(i)})\log(1-\hat{p}^{(i)})\right]$$

The decision boundary is where $\hat{p} = 0.5$, i.e., where $\boldsymbol{\theta}^T \mathbf{x} = 0$. This is a hyperplane in feature space.

Multi-class Strategies

Strategy	How It Works	Num Classifiers	sklearn parameter
One-vs-Rest (OvR)	Train N binary classifiers: each class vs all others. Predict class with highest score.	N (one per class)	`multi_class='ovr'`
One-vs-One (OvO)	Train a binary classifier for every pair of classes. Majority vote wins.	N(N-1)/2	SVM default for multi-class
Softmax (Multinomial)	Single model with K output nodes. Use softmax to normalize to probability over all classes.	1 (unified model)	`multi_class='multinomial'`

Softmax Function (K classes):

$$P(y=k | \mathbf{x}) = \frac{e^{\boldsymbol{\theta}_k^T \mathbf{x}}}{\sum_{j=1}^{K} e^{\boldsymbol{\theta}_j^T \mathbf{x}}}$$

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (accuracy_score, classification_report, 
                              confusion_matrix, roc_auc_score, roc_curve)
from sklearn.preprocessing import StandardScaler

# ── Binary Classification Example ────────────────────────────────
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# Train Logistic Regression
lr = LogisticRegression(
    C=1.0,           # Inverse regularization: smaller C = stronger regularization
    penalty='l2',    # L2 regularization (Ridge-style)
    solver='lbfgs',  # optimization algorithm
    max_iter=1000,
    random_state=42
)
lr.fit(X_train_sc, y_train)

# Predictions
y_pred  = lr.predict(X_test_sc)
y_proba = lr.predict_proba(X_test_sc)[:, 1]  # probability of class 1

print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba):.4f}")
print("
Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# ── Visualize Sigmoid and Decision Boundary ──────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Sigmoid function
z = np.linspace(-10, 10, 200)
sigma = 1 / (1 + np.exp(-z))
axes[0].plot(z, sigma, color='#d4af37', linewidth=2.5)
axes[0].axhline(y=0.5, color='red', linestyle='--', label='threshold=0.5')
axes[0].axvline(x=0, color='white', linestyle='--', alpha=0.3)
axes[0].fill_between(z, sigma, 0.5, where=(sigma > 0.5), alpha=0.15, color='#2ecc71')
axes[0].fill_between(z, sigma, 0.5, where=(sigma < 0.5), alpha=0.15, color='#e74c3c')
axes[0].set_xlabel('z = θᵀx')
axes[0].set_ylabel('σ(z) = P(y=1|x)')
axes[0].set_title('Sigmoid Function')
axes[0].legend()
axes[0].grid(True, alpha=0.2)

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
axes[1].plot(fpr, tpr, color='#d4af37', linewidth=2, label=f'AUC = {auc:.4f}')
axes[1].plot([0,1], [0,1], 'r--', linewidth=1, label='Random (AUC=0.5)')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()
axes[1].fill_between(fpr, tpr, alpha=0.15, color='#d4af37')

# Confusion Matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[2],
            xticklabels=data.target_names, yticklabels=data.target_names)
axes[2].set_title('Confusion Matrix')
axes[2].set_xlabel('Predicted')
axes[2].set_ylabel('Actual')

plt.suptitle('Logistic Regression — Performance Analysis', fontweight='bold')
plt.tight_layout()
plt.show()

# ── Multi-class: Iris dataset ────────────────────────────────────
from sklearn.datasets import load_iris
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

X_tr, X_te, y_tr, y_te = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)
scaler2 = StandardScaler()

lr_multi = LogisticRegression(multi_class='multinomial', solver='lbfgs', 
                               max_iter=1000, C=1.0)
lr_multi.fit(scaler2.fit_transform(X_tr), y_tr)
y_pred_multi = lr_multi.predict(scaler2.transform(X_te))
print(f"
Iris Multi-class Accuracy: {accuracy_score(y_te, y_pred_multi):.4f}")
print(classification_report(y_te, y_pred_multi, target_names=iris.target_names))

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain logistic regression and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 45 — Decision Trees

Decision Trees — Gini Impurity, Entropy, and the CART Algorithm

Why this matters

Decision trees are the gateway to Random Forest and boosting — understanding splits, impurity, and overfitting is essential for tabular ML mastery.

How Decision Trees Work

A decision tree recursively partitions the feature space by asking binary questions: "Is feature X ≤ threshold?" At each node, the algorithm greedily selects the split that maximizes the purity of resulting child nodes.

Splitting Criteria

Gini Impurity (used by CART — sklearn's default):

$$G = 1 - \sum_{k=1}^{K} p_k^2$$

$p_k$ = proportion of class $k$ in node. $G=0$ means perfectly pure. $G=0.5$ for equal 50-50 split (max impurity for binary).

Entropy (used by ID3, C4.5 algorithms):

$$H = -\sum_{k=1}^{K} p_k \log_2(p_k)$$

Information Gain = $H(\text{parent}) - \frac{n_L}{n} H(\text{left}) - \frac{n_R}{n} H(\text{right})$

In practice, Gini and Entropy produce very similar trees. Gini is slightly faster (no log computation). sklearn uses CART (Classification and Regression Trees) which uses Gini for classification and MSE for regression.

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# ══════════════════════════════════════
# TRAIN DECISION TREE
# ══════════════════════════════════════
dt = DecisionTreeClassifier(
    criterion='gini',       # 'gini' or 'entropy'
    max_depth=4,            # maximum depth of tree (controls complexity)
    min_samples_split=5,    # minimum samples required to split a node
    min_samples_leaf=2,     # minimum samples required in a leaf
    random_state=42
)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# ── Text representation of the tree ─────────────────────────────
print("
Tree Rules (text format):")
print(export_text(dt, feature_names=iris.feature_names))

# ── Visualize the tree ───────────────────────────────────────────
fig, ax = plt.subplots(figsize=(20, 8))
plot_tree(
    dt,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,          # color nodes by class
    rounded=True,
    ax=ax,
    fontsize=9
)
ax.set_title('Decision Tree — Iris Dataset (max_depth=4)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# ── Feature Importances ──────────────────────────────────────────
importances = pd.Series(dt.feature_importances_, index=iris.feature_names)
importances.sort_values().plot(kind='barh', color='#d4af37', alpha=0.8, edgecolor='black')
plt.title('Decision Tree Feature Importances
(Based on total Gini impurity reduction)')
plt.tight_layout()
plt.show()

# ── Decision Boundary Visualization (2D) ────────────────────────
X_2d = X[:, 2:]  # petal length and petal width (most discriminative)
dt_2d = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_2d.fit(X_2d, y)

x1_range = np.linspace(X_2d[:,0].min()-0.5, X_2d[:,0].max()+0.5, 200)
x2_range = np.linspace(X_2d[:,1].min()-0.5, X_2d[:,1].max()+0.5, 200)
xx1, xx2 = np.meshgrid(x1_range, x2_range)
Z = dt_2d.predict(np.c_[xx1.ravel(), xx2.ravel()]).reshape(xx1.shape)

fig, ax = plt.subplots(figsize=(8, 6))
ax.contourf(xx1, xx2, Z, alpha=0.3, cmap='Set3')
colors = ['#e74c3c', '#3a7bd5', '#2ecc71']
for i, (cls, col) in enumerate(zip(iris.target_names, colors)):
    mask = y == i
    ax.scatter(X_2d[mask, 0], X_2d[mask, 1], c=col, label=cls, s=50, edgecolors='black', linewidth=0.5)
ax.set_xlabel('Petal Length (cm)')
ax.set_ylabel('Petal Width (cm)')
ax.set_title('Decision Tree — Decision Boundaries (max_depth=3)
Boundaries are always axis-aligned')
ax.legend()
plt.tight_layout()
plt.show()

Common mistakes

Unlimited tree depth on noisy data — memorization and terrible generalization.
Using Gini vs entropy as if they always pick different trees (usually similar splits).
Expecting stable predictions from a single deep tree (high variance).

Interview checkpoints

Q: Gini vs entropy? A: Both measure impurity; entropy penalizes uncertain distributions slightly more; results often similar.
Q: How to control overfitting in trees? A: max_depth, min_samples_leaf, min_samples_split, pruning, ensembles.
Q: Why trees handle mixed data types? A: Axis-aligned splits on numeric thresholds and categorical partitions.

Practice

Basic: Train DecisionTreeClassifier on iris; print depth and feature importances.
Intermediate: Plot validation accuracy vs max_depth; pick depth with best bias-variance tradeoff.
Advanced: Manually compute Gini for a binary node with 60/40 class counts.

Recap

Trees partition feature space with greedy impurity reduction.
Single trees overfit; ensembles (RF, boosting) fix variance.
Always tune depth and leaf size on validation data.

Next: Day 46 — Gini vs Entropy

Model Fitting: Underfitting (High Bias) vs. Overfitting (High Variance)

Decision Tree Hyperparameters — Controlling Overfitting

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Unconstrained decision trees will grow until every leaf is pure — perfectly memorizing training data (overfit). Hyperparameters control tree complexity via pre-pruning (stop early) or post-pruning (grow full then prune).

Hyperparameter	Effect	Increase →	Decrease →	Typical Range
`max_depth`	Maximum depth of tree	More complex, more overfitting	Simpler, more underfitting	3–15
`min_samples_split`	Min samples to split a node	Simpler tree, less overfitting	More complex	2–20
`min_samples_leaf`	Min samples in leaf node	Smoother boundaries, less overfitting	More complex	1–10
`max_features`	Max features considered per split	Uses more features (expensive)	More randomness (used in Random Forest)	'sqrt', 'log2', None
`max_leaf_nodes`	Max number of leaf nodes	More complex	Simpler	None or 10–50
`ccp_alpha`	Minimal cost-complexity pruning	More pruning (simpler)	Less pruning	0–0.05

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, validation_curve, GridSearchCV
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=2000, n_features=20, n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ══════════════════════════════════════
# BIAS-VARIANCE TRADEOFF — max_depth
# ══════════════════════════════════════
train_scores = []
test_scores  = []
depths = range(1, 25)

for depth in depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, dt.predict(X_train)))
    test_scores.append(accuracy_score(y_test, dt.predict(X_test)))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(depths, train_scores, 'o-', color='#d4af37', label='Train Accuracy', linewidth=2)
axes[0].plot(depths, test_scores,  's-', color='#3a7bd5', label='Test Accuracy',  linewidth=2)
axes[0].axvline(x=depths[test_scores.index(max(test_scores))], 
                color='red', linestyle='--', alpha=0.7, label='Optimal depth')
axes[0].set_xlabel('max_depth')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Bias-Variance Tradeoff vs max_depth
(Training keeps rising, Test peaks then falls)')
axes[0].legend()
axes[0].grid(True, alpha=0.2)

# ══════════════════════════════════════
# COST-COMPLEXITY PRUNING (ccp_alpha)
# ══════════════════════════════════════
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)

path = dt_full.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas

prune_train_scores = []
prune_test_scores  = []
for alpha in alphas:
    dt = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    dt.fit(X_train, y_train)
    prune_train_scores.append(accuracy_score(y_train, dt.predict(X_train)))
    prune_test_scores.append(accuracy_score(y_test, dt.predict(X_test)))

axes[1].plot(alphas, prune_train_scores, 'o-', color='#d4af37', label='Train', linewidth=2)
axes[1].plot(alphas, prune_test_scores,  's-', color='#3a7bd5', label='Test',  linewidth=2)
best_alpha = alphas[np.argmax(prune_test_scores)]
axes[1].axvline(x=best_alpha, color='red', linestyle='--', alpha=0.7,
                label=f'Best α={best_alpha:.5f}')
axes[1].set_xlabel('ccp_alpha (pruning strength)')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Post-Pruning via Cost-Complexity
(ccp_alpha controls how much to prune)')
axes[1].legend()
axes[1].grid(True, alpha=0.2)

plt.suptitle('Decision Tree Overfitting Control', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# GRID SEARCH — Optimal Hyperparameters
# ══════════════════════════════════════
param_grid = {
    'max_depth':          [3, 4, 5, 6, 8, 10, None],
    'min_samples_split':  [2, 5, 10, 20],
    'min_samples_leaf':   [1, 2, 5, 10],
    'criterion':          ['gini', 'entropy']
}

gs = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid,
                  cv=5, scoring='accuracy', n_jobs=-1, verbose=0)
gs.fit(X_train, y_train)

print(f"Best hyperparameters: {gs.best_params_}")
print(f"Best CV accuracy: {gs.best_score_:.4f}")
print(f"Test accuracy: {accuracy_score(y_test, gs.predict(X_test)):.4f}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain decision tree hyperparameters and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 47 — KNN Algorithm

K-Nearest Neighbors — Distance-Based Classification

Why this matters

K-Nearest Neighbors: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Algorithm

KNN is a non-parametric, lazy learning algorithm — it doesn't build an explicit model during training. Instead, it memorizes all training examples and at prediction time finds the $k$ closest training points to the query and takes a majority vote (classification) or average (regression).

Distance Metrics:

Euclidean: $d(p, q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}$ (L2 norm, default)

Manhattan: $d(p, q) = \sum_{i=1}^{n}|p_i - q_i|$ (L1 norm)

Minkowski: $d(p, q) = \left(\sum_{i=1}^{n}|p_i - q_i|^p\right)^{1/p}$ (generalizes both; p=2→Euclidean, p=1→Manhattan)

The Curse of Dimensionality

In high-dimensional spaces, all points become approximately equidistant from each other — the concept of "nearest" loses meaning. The volume of a unit hypersphere relative to the unit hypercube approaches 0 as dimensions increase, meaning most data is in the "corners" and distances become meaningless. This is why KNN degrades badly with many features and why feature selection/PCA helps.

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# ALWAYS scale for KNN!
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# ══════════════════════════════════════
# FINDING OPTIMAL K — Elbow Method
# ══════════════════════════════════════
k_range = range(1, 51)
train_scores = []
test_scores  = []
cv_scores    = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k, metric='euclidean', weights='uniform')
    knn.fit(X_train_sc, y_train)
    train_scores.append(accuracy_score(y_train, knn.predict(X_train_sc)))
    test_scores.append(accuracy_score(y_test, knn.predict(X_test_sc)))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(k_range, train_scores, 'o-', color='#d4af37', label='Train', linewidth=2, ms=4)
axes[0].plot(k_range, test_scores,  's-', color='#3a7bd5', label='Test',  linewidth=2, ms=4)
optimal_k = k_range[test_scores.index(max(test_scores))]
axes[0].axvline(x=optimal_k, color='red', linestyle='--', 
                label=f'Optimal K={optimal_k}', alpha=0.8)
axes[0].set_xlabel('K (number of neighbors)')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Finding Optimal K
(Small K=complex/overfit, Large K=simple/underfit)')
axes[0].legend()
axes[0].grid(True, alpha=0.2)

# Distance metrics comparison
metrics = ['euclidean', 'manhattan', 'chebyshev']
metric_scores = {}
for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=optimal_k, metric=metric)
    scores = cross_val_score(knn, X_train_sc, y_train, cv=5, scoring='accuracy')
    metric_scores[metric] = scores

axes[1].boxplot([metric_scores[m] for m in metrics], labels=metrics, patch_artist=True,
                boxprops=dict(facecolor='rgba(212,175,55,0.3)', color='#d4af37'))
axes[1].set_title(f'Distance Metric Comparison (K={optimal_k})
5-fold CV Accuracy')
axes[1].set_ylabel('CV Accuracy')

plt.suptitle('KNN Hyperparameter Analysis', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# BEST KNN MODEL
# ══════════════════════════════════════
best_knn = KNeighborsClassifier(
    n_neighbors=optimal_k,
    weights='distance',       # closer neighbors have more influence
    metric='euclidean',
    algorithm='ball_tree',    # faster for high dimensions
    n_jobs=-1
)
best_knn.fit(X_train_sc, y_train)
print(f"KNN (K={optimal_k}, weighted) Accuracy: {accuracy_score(y_test, best_knn.predict(X_test_sc)):.4f}")

# ── Curse of Dimensionality Demo ─────────────────────────────────
print("
Curse of Dimensionality:")
print(f"{'Dimensions':12} {'Mean Distance':15} {'Std Distance':12} {'Ratio':10}")
print("-" * 55)
for n_dim in [2, 5, 10, 50, 100, 500]:
    np.random.seed(42)
    points = np.random.randn(1000, n_dim)
    ref    = np.random.randn(1, n_dim)
    dists  = np.sqrt(np.sum((points - ref)**2, axis=1))
    ratio  = dists.std() / dists.mean() if dists.mean() > 0 else 0
    print(f"  {n_dim:<12} {dists.mean():<15.4f} {dists.std():<12.4f} {ratio:.4f}")
print("→ As dimensions grow, std/mean → 0 (all points same distance!)")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain k-nearest neighbors and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 48 — Naive Bayes

Naive Bayes — Probabilistic Classification

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Bayes' Theorem

Bayes' Theorem:

$$P(y | \mathbf{x}) = \frac{P(\mathbf{x} | y) \cdot P(y)}{P(\mathbf{x})}$$

Posterior ∝ Likelihood × Prior
$P(y|\mathbf{x})$ = posterior (what we want), $P(\mathbf{x}|y)$ = likelihood, $P(y)$ = prior

The "Naive" assumption: all features are conditionally independent given the class. This rarely holds in practice but the algorithm still works surprisingly well:

$$P(y | x_1, x_2, \ldots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i | y)$$

Variant	Assumption about P(xᵢ\|y)	Use For
GaussianNB	Each feature is normally distributed within each class	Continuous features (measurements, sensor data)
MultinomialNB	Features are discrete counts or frequencies	Text classification (word counts, TF-IDF)
BernoulliNB	Features are binary (0/1)	Binary text features (word present/absent)
ComplementNB	Complement class statistics (more robust)	Imbalanced text classification

Code Example

import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer, fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MinMaxScaler

# ══════════════════════════════════════
# GAUSSIAN NAIVE BAYES — Continuous Features
# ══════════════════════════════════════
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gnb = GaussianNB(
    var_smoothing=1e-9   # small value added to variance for numerical stability
)
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
y_proba = gnb.predict_proba(X_test)

print("GaussianNB:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"  CV Score: {cross_val_score(gnb, X, y, cv=5, scoring='accuracy').mean():.4f}")

# GaussianNB learns class statistics:
print(f"
  Class priors (P(y)): {gnb.class_prior_.round(4)}")
print(f"  Mean of 'worst radius' per class: {gnb.theta_[:, 0].round(3)}")
print(f"  Variance of 'worst radius' per class: {gnb.var_[:, 0].round(3)}")

# ══════════════════════════════════════
# MULTINOMIAL NAIVE BAYES — Text Classification
# ══════════════════════════════════════
categories = ['rec.sport.baseball', 'rec.sport.hockey', 'sci.med', 'sci.space']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, random_state=42)

vectorizer = CountVectorizer(stop_words='english', max_features=5000)
X_text = vectorizer.fit_transform(newsgroups.data)
y_text = newsgroups.target

X_tr, X_te, y_tr, y_te = train_test_split(X_text, y_text, test_size=0.2, random_state=42)

mnb = MultinomialNB(alpha=1.0)   # alpha = Laplace smoothing (prevents zero probabilities)
mnb.fit(X_tr, y_tr)
y_pred_text = mnb.predict(X_te)

print("
MultinomialNB — 20 Newsgroups Text Classification:")
print(f"  Accuracy: {accuracy_score(y_te, y_pred_text):.4f}")
print(classification_report(y_te, y_pred_text, target_names=newsgroups.target_names))

# Top words for each category
feature_names = vectorizer.get_feature_names_out()
print("Top 5 words per category:")
for i, category in enumerate(categories):
    top5 = feature_names[np.argsort(mnb.feature_log_prob_[i])[-5:]][::-1]
    print(f"  {category}: {', '.join(top5)}")

# ══════════════════════════════════════
# NAIVE BAYES FOR REAL-TIME PREDICTION
# (near-instant inference makes it great for spam filtering)
# ══════════════════════════════════════
def classify_text(text, vectorizer, model, categories):
    X = vectorizer.transform([text])
    pred_class  = model.predict(X)[0]
    pred_probas = model.predict_proba(X)[0]
    print(f"Text: '{text[:60]}...'")
    print(f"Predicted: {categories[pred_class]}")
    for cat, prob in zip(categories, pred_probas):
        bar = '█' * int(prob * 30)
        print(f"  {cat:25} {bar} {prob:.4f}")

test_text = "The pitcher threw a fastball and the batter hit a home run"
classify_text(test_text, vectorizer, mnb, categories)

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain naive bayes and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 49 — SVM Theory

Support Vector Machines — Maximum Margin Classifier

Why this matters

Support Vector Machines: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Core Idea: Maximize the Margin

SVM finds the hyperplane that maximizes the margin — the distance between the hyperplane and the closest training points from each class. These closest points are called support vectors. A wider margin → better generalization.

SVM Decision Boundary (linear):

$$\mathbf{w}^T \mathbf{x} + b = 0$$

Margin Width:

$$\text{margin} = \frac{2}{\|\mathbf{w}\|}$$

Hard-margin SVM Optimization Problem:

$$\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to} \quad y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)} + b) \geq 1 \; \forall i$$

Soft-margin SVM (with slack variables $\xi_i$):

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{m}\xi_i \quad \text{subject to} \quad y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)} + b) \geq 1 - \xi_i$$

The $C$ hyperparameter controls the trade-off: large $C$ = hard margin (few violations allowed, more complex boundary). Small $C$ = wide margin (more violations allowed, simpler boundary).

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, classification_report

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# ══════════════════════════════════════
# LINEAR SVM
# ══════════════════════════════════════
svm_linear = SVC(
    kernel='linear',
    C=1.0,           # regularization — smaller = wider margin
    random_state=42
)
svm_linear.fit(X_train_sc, y_train)
y_pred = svm_linear.predict(X_test_sc)

print("Linear SVM:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"  Number of support vectors: {svm_linear.n_support_}")
print(f"  Support vectors per class: {svm_linear.n_support_}")

# ══════════════════════════════════════
# EFFECT OF C PARAMETER
# ══════════════════════════════════════
C_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
train_accs = []
test_accs  = []

for C in C_values:
    svm = SVC(kernel='rbf', C=C, gamma='scale', random_state=42)
    svm.fit(X_train_sc, y_train)
    train_accs.append(accuracy_score(y_train, svm.predict(X_train_sc)))
    test_accs.append(accuracy_score(y_test, svm.predict(X_test_sc)))

fig, ax = plt.subplots(figsize=(8, 5))
ax.semilogx(C_values, train_accs, 'o-', color='#d4af37', label='Train', linewidth=2)
ax.semilogx(C_values, test_accs,  's-', color='#3a7bd5', label='Test',  linewidth=2)
ax.set_xlabel('C (regularization strength)')
ax.set_ylabel('Accuracy')
ax.set_title('SVM — Effect of C Parameter (RBF kernel)
Small C = wide margin; Large C = narrow margin')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# 2D VISUALIZATION OF DECISION BOUNDARY
# ══════════════════════════════════════
from sklearn.datasets import make_classification

X_vis, y_vis = make_classification(n_samples=200, n_features=2, n_redundant=0,
                                    n_informative=2, random_state=42)
X_vis_tr, X_vis_te, y_vis_tr, y_vis_te = train_test_split(X_vis, y_vis, test_size=0.2)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for ax, C, title in zip(axes,
                         [0.01, 1.0, 100.0],
                         ['C=0.01 (Wide margin,
more violations OK)',
                          'C=1.0 (Balanced)',
                          'C=100 (Narrow margin,
few violations)']):
    svm = SVC(kernel='rbf', C=C, gamma='scale')
    svm.fit(X_vis_tr, y_vis_tr)
    
    x1 = np.linspace(X_vis[:,0].min()-1, X_vis[:,0].max()+1, 200)
    x2 = np.linspace(X_vis[:,1].min()-1, X_vis[:,1].max()+1, 200)
    xx1, xx2 = np.meshgrid(x1, x2)
    Z = svm.predict(np.c_[xx1.ravel(), xx2.ravel()]).reshape(xx1.shape)
    
    ax.contourf(xx1, xx2, Z, alpha=0.3, cmap='bwr')
    ax.scatter(X_vis_tr[y_vis_tr==0, 0], X_vis_tr[y_vis_tr==0, 1], c='#e74c3c', s=30, label='Class 0')
    ax.scatter(X_vis_tr[y_vis_tr==1, 0], X_vis_tr[y_vis_tr==1, 1], c='#3a7bd5', s=30, label='Class 1')
    # Highlight support vectors
    sv = svm.support_vectors_
    ax.scatter(sv[:, 0], sv[:, 1], s=150, facecolors='none', edgecolors='#d4af37', 
               linewidths=2, label=f'SVs ({len(sv)})')
    acc = accuracy_score(y_vis_te, svm.predict(X_vis_te))
    ax.set_title(f'{title}
Test Acc={acc:.3f}')
    ax.legend(fontsize=7)

plt.suptitle('SVM: Effect of C Hyperparameter on Decision Boundary', fontweight='bold')
plt.tight_layout()
plt.show()

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain support vector machines and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 50 — SVM Kernels

SVM Kernels — The Kernel Trick

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Kernel Trick

When data is not linearly separable in the original space, map it to a higher-dimensional space where it becomes separable. The kernel trick does this implicitly without computing the actual high-dimensional mapping — by replacing dot products $\mathbf{x}^T \mathbf{z}$ with a kernel function $K(\mathbf{x}, \mathbf{z})$.

Common Kernel Functions:

Linear: $K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z}$

Polynomial: $K(\mathbf{x}, \mathbf{z}) = (\gamma \mathbf{x}^T\mathbf{z} + r)^d$

RBF (Gaussian): $K(\mathbf{x}, \mathbf{z}) = \exp\left(-\gamma\|\mathbf{x} - \mathbf{z}\|^2\right)$

Sigmoid: $K(\mathbf{x}, \mathbf{z}) = \tanh(\gamma \mathbf{x}^T\mathbf{z} + r)$

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# ══════════════════════════════════════
# KERNEL COMPARISON ON NON-LINEAR DATA
# ══════════════════════════════════════
datasets = {
    'Moons':        make_moons(n_samples=200, noise=0.2, random_state=42),
    'Circles':      make_circles(n_samples=200, noise=0.1, factor=0.4, random_state=42),
    'Linear blobs': make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=42)
}

kernels = ['linear', 'poly', 'rbf']

fig, axes = plt.subplots(len(datasets), len(kernels), figsize=(15, 12))
fig.suptitle('SVM Kernels on Different Datasets', fontsize=14, fontweight='bold')

for row, (ds_name, (X_ds, y_ds)) in enumerate(datasets.items()):
    X_tr, X_te, y_tr, y_te = train_test_split(X_ds, y_ds, test_size=0.2, random_state=42)
    scaler = StandardScaler()
    X_tr_sc = scaler.fit_transform(X_tr)
    X_te_sc = scaler.transform(X_te)
    
    for col, kernel in enumerate(kernels):
        ax = axes[row][col]
        
        params = {'linear': {'C':1}, 'poly': {'C':1,'degree':3,'coef0':1,'gamma':'scale'}, 
                  'rbf': {'C':1,'gamma':'scale'}}
        svm = SVC(kernel=kernel, **params[kernel], random_state=42)
        svm.fit(X_tr_sc, y_tr)
        
        x1_range = np.linspace(X_tr_sc[:,0].min()-0.5, X_tr_sc[:,0].max()+0.5, 150)
        x2_range = np.linspace(X_tr_sc[:,1].min()-0.5, X_tr_sc[:,1].max()+0.5, 150)
        xx1, xx2 = np.meshgrid(x1_range, x2_range)
        Z = svm.predict(np.c_[xx1.ravel(), xx2.ravel()]).reshape(xx1.shape)
        
        ax.contourf(xx1, xx2, Z, alpha=0.3, cmap='bwr')
        ax.scatter(X_tr_sc[y_tr==0,0], X_tr_sc[y_tr==0,1], c='#e74c3c', s=20, alpha=0.8)
        ax.scatter(X_tr_sc[y_tr==1,0], X_tr_sc[y_tr==1,1], c='#3a7bd5', s=20, alpha=0.8)
        
        acc = accuracy_score(y_te, svm.predict(X_te_sc))
        if row == 0:
            ax.set_title(f'Kernel: {kernel.upper()}
Acc={acc:.3f}', fontweight='bold')
        else:
            ax.set_title(f'Acc={acc:.3f}')
        if col == 0:
            ax.set_ylabel(ds_name, fontweight='bold')

plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# RBF GAMMA EFFECT
# ══════════════════════════════════════
X_moon, y_moon = make_moons(n_samples=300, noise=0.15, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X_moon, y_moon, test_size=0.2)
sc = StandardScaler(); X_tr = sc.fit_transform(X_tr); X_te = sc.transform(X_te)

fig, axes = plt.subplots(1, 4, figsize=(18, 4))
for ax, gamma in zip(axes, [0.01, 0.1, 1.0, 10.0]):
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
    svm.fit(X_tr, y_tr)
    x1 = np.linspace(X_tr[:,0].min()-0.3, X_tr[:,0].max()+0.3, 150)
    x2 = np.linspace(X_tr[:,1].min()-0.3, X_tr[:,1].max()+0.3, 150)
    xx1, xx2 = np.meshgrid(x1, x2)
    Z = svm.predict(np.c_[xx1.ravel(), xx2.ravel()]).reshape(xx1.shape)
    ax.contourf(xx1, xx2, Z, alpha=0.3, cmap='bwr')
    ax.scatter(X_tr[y_tr==0,0], X_tr[y_tr==0,1], c='#e74c3c', s=25)
    ax.scatter(X_tr[y_tr==1,0], X_tr[y_tr==1,1], c='#3a7bd5', s=25)
    acc = accuracy_score(y_te, svm.predict(X_te))
    ax.set_title(f'γ={gamma}
Acc={acc:.3f}')

plt.suptitle('RBF Kernel: Effect of Gamma
(Low γ=smooth, High γ=complex/overfit)', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# SVM HYPERPARAMETER TUNING
# ══════════════════════════════════════
param_grid = {
    'C':     [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
    'kernel': ['rbf', 'poly']
}
gs = GridSearchCV(SVC(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
gs.fit(X_tr, y_tr)
print(f"Best SVM params: {gs.best_params_}")
print(f"Best CV score:   {gs.best_score_:.4f}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain svm kernels and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 51 — Random Forests

Random Forests — Bagging + Feature Subsampling

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Bagging (Bootstrap Aggregating)

Train multiple models on different bootstrap samples (random sampling with replacement) of the training data, then aggregate their predictions. Each bootstrap sample contains ~63.2% of unique training examples (the rest are out-of-bag).

Random Forest Additions to Bagging

Random Forest further de-correlates trees by also randomly subsampling features at each split:

At each split, consider only $\sqrt{p}$ (classification) or $p/3$ (regression) features instead of all $p$.
This prevents all trees from making the same top split (e.g., always splitting on the most important feature), leading to more diverse trees and better ensemble performance.

Out-of-Bag (OOB) Error

Since each tree is trained on ~63.2% of data, the remaining ~36.8% (OOB samples) can be used as a free validation set for that tree. The OOB error averages these errors across all trees — a nearly free cross-validation estimate without additional computation!

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

# ══════════════════════════════════════
# SINGLE TREE vs RANDOM FOREST
# ══════════════════════════════════════
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

rf = RandomForestClassifier(
    n_estimators=100,      # number of trees
    max_features='sqrt',   # sqrt(p) features per split
    bootstrap=True,        # use bootstrap samples
    oob_score=True,        # compute OOB score
    max_depth=None,        # no depth limit per tree
    min_samples_leaf=1,
    n_jobs=-1,
    random_state=42
)
rf.fit(X_train, y_train)

print(f"Single Decision Tree — Test Accuracy: {accuracy_score(y_test, dt.predict(X_test)):.4f}")
print(f"Random Forest       — Test Accuracy: {accuracy_score(y_test, rf.predict(X_test)):.4f}")
print(f"Random Forest       — OOB Score:     {rf.oob_score_:.4f}")

# ══════════════════════════════════════
# EFFECT OF N_ESTIMATORS
# ══════════════════════════════════════
oob_errors = []
test_errors = []
n_trees_range = range(1, 201, 5)

rf_growing = RandomForestClassifier(warm_start=True, oob_score=True, random_state=42, n_jobs=-1)
for n_trees in n_trees_range:
    rf_growing.n_estimators = n_trees
    rf_growing.fit(X_train, y_train)
    oob_errors.append(1 - rf_growing.oob_score_)
    test_errors.append(1 - accuracy_score(y_test, rf_growing.predict(X_test)))

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(n_trees_range, oob_errors,  'o-', color='#d4af37', label='OOB Error',  linewidth=2, ms=4)
ax.plot(n_trees_range, test_errors, 's-', color='#3a7bd5', label='Test Error', linewidth=2, ms=4)
ax.axhline(y=1 - accuracy_score(y_test, dt.predict(X_test)), color='red', 
           linestyle='--', label='Single Tree Error', alpha=0.7)
ax.set_xlabel('Number of Trees')
ax.set_ylabel('Error Rate')
ax.set_title('Random Forest: Effect of n_estimators
(OOB error ≈ test error — free validation!)')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# FEATURE IMPORTANCES
# ══════════════════════════════════════
rf_final = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=-1, random_state=42)
rf_final.fit(X_train, y_train)

importances = pd.Series(rf_final.feature_importances_, index=data.feature_names)
importances_std = pd.Series(
    np.std([tree.feature_importances_ for tree in rf_final.estimators_], axis=0),
    index=data.feature_names
)

top_features = importances.sort_values(ascending=False).head(10)

fig, ax = plt.subplots(figsize=(10, 5))
ax.barh(range(10), top_features.values, 
        xerr=importances_std[top_features.index].values,
        color='#d4af37', alpha=0.8, edgecolor='black', capsize=4)
ax.set_yticks(range(10))
ax.set_yticklabels(top_features.index, fontsize=9)
ax.invert_yaxis()
ax.set_title('Random Forest — Feature Importances (Gini)
Error bars show std across trees')
ax.set_xlabel('Mean Decrease in Gini Impurity')
plt.tight_layout()
plt.show()

print(f"
OOB Score: {rf_final.oob_score_:.4f}")
print(f"Test Score: {accuracy_score(y_test, rf_final.predict(X_test)):.4f}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain random forests and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 52 — Bagging

Bagging vs Boosting — The Two Ensemble Paradigms

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Aspect	Bagging	Boosting
Training	Trees trained in parallel, independently	Trees trained sequentially, each corrects previous
Focus	Reduces variance (overfitting)	Reduces bias (underfitting)
Base learners	Strong, complex trees (low bias, high variance)	Weak learners — shallow trees (high bias, low variance)
Weight of samples	Equal weight (random sampling)	Misclassified samples get higher weight
Combination	Majority vote / simple average	Weighted vote / additive model
Speed	Parallelizable → Fast	Sequential → Slower
Overfitting risk	Low (averaging reduces variance)	Medium (can overfit if too many rounds)
Algorithms	Random Forest, BaggingClassifier	AdaBoost, GBM, XGBoost, LightGBM, CatBoost

Key Insight: Bagging works best when base learners are high variance (deep trees that overfit). By averaging many overfit models, variance cancels out. Boosting works best when base learners are high bias (shallow trees/stumps). By combining many biased models each correcting the last, bias is gradually reduced.
      

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain bagging vs boosting and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 53 — AdaBoost

AdaBoost — Adaptive Boosting

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Algorithm Intuition

AdaBoost trains a sequence of weak learners (typically decision stumps — trees with depth=1). After each round, it increases the weight of misclassified samples so the next learner focuses on them. Final prediction is a weighted vote of all learners.

Initialize all sample weights: $w_i = 1/m$
Train weak learner $h_t$ on weighted samples
Compute weighted error: $\epsilon_t = \sum_{i=1}^m w_i \cdot \mathbb{1}[h_t(x_i) eq y_i]$
Compute learner weight: $\alpha_t = \frac{1}{2}\ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$
Update sample weights: $w_i \leftarrow w_i \cdot \exp(-\alpha_t y_i h_t(x_i))$, then normalize
Final model: $H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)$

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=1000, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ══════════════════════════════════════
# ADABOOST — Base: Decision Stump (depth=1)
# ══════════════════════════════════════
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # weak learner
    n_estimators=200,      # number of boosting rounds
    learning_rate=0.5,     # shrinks contribution of each learner (like regularization)
    algorithm='SAMME',     # SAMME.R uses probabilities (better)
    random_state=42
)
ada.fit(X_train, y_train)
print(f"AdaBoost Accuracy: {accuracy_score(y_test, ada.predict(X_test)):.4f}")

# ── Track accuracy vs number of estimators ──────────────────────
train_staged = list(ada.staged_score(X_train, y_train))
test_staged  = list(ada.staged_score(X_test,  y_test))

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(train_staged, color='#d4af37', label='Train Accuracy', linewidth=2)
ax.plot(test_staged,  color='#3a7bd5', label='Test Accuracy',  linewidth=2)
ax.set_xlabel('Number of Estimators')
ax.set_ylabel('Accuracy')
ax.set_title('AdaBoost — Staged Score vs Number of Estimators
(Note: Unlike random forests, boosting CAN overfit!)')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

# ── Effect of max_depth (stump vs deeper trees) ──────────────────
print("
Depth Comparison (200 estimators):")
for depth in [1, 2, 3, 5]:
    ada_d = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=depth),
        n_estimators=200, learning_rate=0.5, random_state=42
    )
    cv = cross_val_score(ada_d, X_train, y_train, cv=5, scoring='accuracy')
    print(f"  max_depth={depth}: CV={cv.mean():.4f} ± {cv.std():.4f}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain adaboost and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 54 — Gradient Boosting

Gradient Boosting Machines — Residual Fitting

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Gradient Boosting Idea

Gradient Boosting builds an additive model by fitting each new tree to the negative gradient of the loss function with respect to the current prediction — i.e., the residuals (for regression with MSE loss).

Algorithm:

Initialize with a constant prediction: $F_0(x) = \arg\min_\gamma \sum_i L(y_i, \gamma)$ (e.g., mean for regression)
For $m = 1, 2, \ldots, M$:
1. Compute pseudo-residuals: $r_{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}$
2. Fit a tree $h_m(x)$ to the pseudo-residuals
3. Update: $F_m(x) = F_{m-1}(x) + u \cdot h_m(x)$ where $ u$ is the learning rate
Final prediction: $F_M(x)$

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor, HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

# ══════════════════════════════════════
# SKLEARN GRADIENT BOOSTING
# ══════════════════════════════════════
gbm = GradientBoostingClassifier(
    n_estimators=200,          # number of boosting stages (trees)
    learning_rate=0.1,         # shrinkage — smaller = more trees needed, more robust
    max_depth=3,               # depth of each tree (typically 3-5)
    subsample=0.8,             # stochastic GBM: use 80% of training data per tree
    max_features='sqrt',       # feature subsampling per tree
    min_samples_leaf=5,
    random_state=42
)
gbm.fit(X_train, y_train)
print(f"GBM Accuracy: {accuracy_score(y_test, gbm.predict(X_test)):.4f}")

# ── Staged predictions to find optimal n_estimators ──────────────
train_staged = [accuracy_score(y_train, y_p) for y_p in gbm.staged_predict(X_train)]
test_staged  = [accuracy_score(y_test,  y_p) for y_p in gbm.staged_predict(X_test)]

optimal_n = np.argmax(test_staged) + 1
print(f"Optimal n_estimators: {optimal_n}")

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(train_staged, color='#d4af37', label='Train', linewidth=2)
ax.plot(test_staged,  color='#3a7bd5', label='Test',  linewidth=2)
ax.axvline(x=optimal_n, color='red', linestyle='--', 
           label=f'Optimal N={optimal_n}', alpha=0.8)
ax.set_xlabel('Number of Boosting Rounds')
ax.set_ylabel('Accuracy')
ax.set_title('GBM — Staged Accuracy
(Use early stopping to find optimal N automatically)')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# HISTOGRAM-BASED GBM (sklearn ≥ 0.23)
# Much faster for large datasets (like LightGBM)
# ══════════════════════════════════════
hgbm = HistGradientBoostingClassifier(
    max_iter=200,
    learning_rate=0.1,
    max_depth=6,
    l2_regularization=0.1,
    random_state=42,
    early_stopping=True,          # automatically stop when validation score plateaus
    validation_fraction=0.1,      # fraction of training data for early stopping
    n_iter_no_change=20           # patience
)
hgbm.fit(X_train, y_train)
print(f"
HistGBM Accuracy: {accuracy_score(y_test, hgbm.predict(X_test)):.4f}")
print(f"Actual iterations used: {hgbm.n_iter_}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain gradient boosting machines and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 55 — XGBoost

XGBoost — Regularized Gradient Boosting

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Failure mode — Overfitting with too many trees

Validation error drops then rises as n_estimators grows. Fix with early stopping (early_stopping_rounds), lower max_depth, or higher reg_lambda — do not tune on the test set.

XGBoost Innovations Over Traditional GBM

Regularized Objective: Adds L1 and L2 penalties on leaf weights and tree complexity directly to the objective function
Second-order gradients: Uses both first ($g_i$) and second ($h_i$) derivatives of the loss for better optimization
Tree pruning with depth-first growth: Grows tree then prunes back based on gain threshold
Sparse-aware algorithm: Handles missing values natively by learning optimal direction for missing values
Column and row subsampling: Like LightGBM and Random Forest
Parallel computation: Parallelizes split finding across features

XGBoost Regularized Objective:

$$\mathcal{L} = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k)$$ $$\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T} w_j^2$$

$T$ = number of leaves, $w_j$ = leaf weight, $\gamma$ = min gain to split, $\lambda$ = L2 on weights

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
import xgboost as xgb
from xgboost import XGBClassifier, XGBRegressor

data = load_breast_cancer()
X, y = pd.DataFrame(data.data, columns=data.feature_names), data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

# ══════════════════════════════════════
# XGBOOST CLASSIFIER
# ══════════════════════════════════════
xgb_clf = XGBClassifier(
    n_estimators=500,         # max trees (use early stopping)
    learning_rate=0.05,       # eta — smaller = more robust, needs more trees
    max_depth=4,              # tree depth (3-6 for classification)
    min_child_weight=1,       # minimum sum of instance weight in child (min_samples_leaf analog)
    gamma=0.1,                # minimum loss reduction to make a split (tree pruning)
    subsample=0.8,            # row subsampling per tree
    colsample_bytree=0.8,     # column subsampling per tree
    colsample_bylevel=0.8,    # column subsampling per level
    reg_alpha=0.1,            # L1 regularization on weights
    reg_lambda=1.0,           # L2 regularization on weights
    scale_pos_weight=1,       # for imbalanced: sum(neg)/sum(pos)
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1
)

# Train with early stopping
xgb_clf.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    verbose=50
)

y_pred  = xgb_clf.predict(X_test)
y_proba = xgb_clf.predict_proba(X_test)[:, 1]
print(f"
XGBoost — Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"XGBoost — ROC-AUC:  {roc_auc_score(y_test, y_proba):.4f}")

# ── Learning Curves ──────────────────────────────────────────────
results = xgb_clf.evals_result()
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(results['validation_0']['logloss'], color='#d4af37', label='Train Log Loss', linewidth=2)
ax.plot(results['validation_1']['logloss'], color='#3a7bd5', label='Test Log Loss',  linewidth=2)
ax.set_xlabel('Boosting Round')
ax.set_ylabel('Log Loss')
ax.set_title('XGBoost — Training Curves
(Early stopping prevents overfitting)')
ax.legend()
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

# ── XGBoost Feature Importance (3 types) ────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
importance_types = ['weight', 'gain', 'cover']
labels = [
    'weight = times feature used in split',
    'gain = avg gain per use (most useful)',
    'cover = avg samples covered'
]
for ax, imp_type, label in zip(axes, importance_types, labels):
    importance = pd.Series(xgb_clf.get_booster().get_score(importance_type=imp_type))
    importance.sort_values(ascending=False).head(10).plot(kind='barh', ax=ax, 
                                                           color='#d4af37', alpha=0.8)
    ax.invert_yaxis()
    ax.set_title(f'Importance Type: {imp_type}
({label})', fontsize=9)

plt.suptitle('XGBoost Feature Importance — Three Types', fontweight='bold')
plt.tight_layout()
plt.show()

# ── Cross-Validated Hyperparameter Tuning ───────────────────────
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_dist = {
    'n_estimators':      randint(100, 500),
    'max_depth':         randint(3, 8),
    'learning_rate':     uniform(0.01, 0.2),
    'subsample':         uniform(0.6, 0.4),
    'colsample_bytree':  uniform(0.6, 0.4),
    'gamma':             uniform(0, 0.3),
    'reg_alpha':         uniform(0, 1),
    'reg_lambda':        uniform(0, 2)
}

xgb_base = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42, n_jobs=-1)
random_search = RandomizedSearchCV(xgb_base, param_distributions=param_dist,
                                   n_iter=30, cv=5, scoring='roc_auc',
                                   random_state=42, n_jobs=-1, verbose=0)
random_search.fit(X_train, y_train)
print(f"
Best XGBoost params: {random_search.best_params_}")
print(f"Best CV ROC-AUC: {random_search.best_score_:.4f}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain xgboost and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 56 — LightGBM

LightGBM — Faster, Better Gradient Boosting

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

LightGBM's Key Innovations over XGBoost

Innovation	XGBoost	LightGBM	Benefit
Tree growth	Level-wise (breadth-first)	Leaf-wise (best-first)	Faster convergence; lower loss
GOSS	Uses all instances for gradient computation	Gradient-based One-Side Sampling (keeps high-gradient instances)	Less data used → faster
EFB	Uses all features	Exclusive Feature Bundling (bundles mutually exclusive sparse features)	Fewer features → faster
Histogram	Pre-sorted algorithm (slow on large data)	Histogram-based (bins continuous values)	Memory efficient; faster splits
Categorical	Must manually encode	Native categorical handling	No manual encoding needed

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

# ══════════════════════════════════════
# LIGHTGBM CLASSIFIER
# ══════════════════════════════════════
lgbm = LGBMClassifier(
    n_estimators=500,           # max trees
    learning_rate=0.05,         # shrinkage
    max_depth=-1,               # -1 = unlimited (leaf-wise growth handles this)
    num_leaves=31,              # key parameter: max leaves per tree (2^max_depth)
    min_child_samples=20,       # min samples per leaf
    feature_fraction=0.8,       # colsample_bytree analog
    bagging_fraction=0.8,       # subsample analog
    bagging_freq=5,             # apply bagging every 5 iterations
    reg_alpha=0.1,              # L1 regularization
    reg_lambda=0.1,             # L2 regularization
    subsample_for_bin=200000,   # samples for constructing histograms
    class_weight='balanced',    # handle class imbalance
    random_state=42,
    n_jobs=-1,
    verbose=-1                  # suppress verbose output
)

# LightGBM's native early stopping via callbacks
callbacks = [
    lgb.early_stopping(stopping_rounds=50, verbose=True),
    lgb.log_evaluation(period=100)
]

lgbm.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    callbacks=callbacks
)

y_pred  = lgbm.predict(X_test)
y_proba = lgbm.predict_proba(X_test)[:, 1]
print(f"
LightGBM — Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"LightGBM — ROC-AUC:  {roc_auc_score(y_test, y_proba):.4f}")
print(f"Best iteration: {lgbm.best_iteration_}")

# ── Speed Benchmark: LightGBM vs XGBoost vs sklearn GBM ─────────
import time
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb

from sklearn.datasets import make_classification
X_large, y_large = make_classification(n_samples=50000, n_features=30, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X_large, y_large, test_size=0.2)

models = {
    'sklearn GBM':  GradientBoostingClassifier(n_estimators=100, max_depth=4, random_state=42),
    'XGBoost':      xgb.XGBClassifier(n_estimators=100, max_depth=4, use_label_encoder=False, 
                                       eval_metric='logloss', n_jobs=-1, random_state=42),
    'LightGBM':     LGBMClassifier(n_estimators=100, num_leaves=31, n_jobs=-1, 
                                    random_state=42, verbose=-1)
}

print("
Speed Benchmark (n=50,000, p=30):")
print(f"{'Model':15} {'Train Time':12} {'Accuracy':10}")
print("-" * 40)
for name, model in models.items():
    start = time.time()
    model.fit(X_tr, y_tr)
    elapsed = time.time() - start
    acc = accuracy_score(y_te, model.predict(X_te))
    print(f"  {name:15} {elapsed:8.3f}s    {acc:.4f}")
# LightGBM is typically 10-100x faster than sklearn GBM

# ── num_leaves — the most important LightGBM param ──────────────
print("
num_leaves tuning (controls model complexity):")
for nl in [7, 15, 31, 63, 127]:
    m = LGBMClassifier(n_estimators=100, num_leaves=nl, random_state=42, verbose=-1)
    cv = cross_val_score(m, X_train, y_train, cv=5, scoring='accuracy')
    print(f"  num_leaves={nl:3d}: CV={cv.mean():.4f} ± {cv.std():.4f}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain lightgbm and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 57 — CatBoost

CatBoost — Native Categorical Handling & Ordered Boosting

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

CatBoost's Key Innovations

Ordered Boosting: Uses a permutation-based approach to avoid target leakage during training — each object is predicted using only models trained on previous objects in a random order.
Native Categorical Features: Automatically handles categorical features using statistics from the target (similar to target encoding) without manual preprocessing.
Symmetric Trees: All splits at the same depth use the same splitting criterion — faster prediction and more regularized.
No Feature Scaling Needed: Works with raw features without normalization.

Code Example

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from catboost import CatBoostClassifier, Pool, cv as catboost_cv

# ══════════════════════════════════════
# CATBOOST WITH CATEGORICAL FEATURES
# (No manual encoding needed!)
# ══════════════════════════════════════
df = pd.read_csv('titanic.csv')
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
cat_features = ['Sex', 'Embarked', 'Pclass']

X = df[features].copy()
y = df['Survived']

mask = y.notna()
X, y = X[mask], y[mask]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

# Fill missing values (CatBoost handles NaN in numeric features too!)
X_train = X_train.fillna(X_train.median(numeric_only=True))
X_test  = X_test.fillna(X_train.median(numeric_only=True))
X_train['Embarked'].fillna('S', inplace=True)
X_test['Embarked'].fillna('S', inplace=True)

# CatBoost can receive category names as strings — no encoding needed!
cat_feature_indices = [X_train.columns.tolist().index(col) for col in cat_features]

cb = CatBoostClassifier(
    iterations=500,          # n_estimators
    learning_rate=0.05,
    depth=6,                 # tree depth (symmetric trees)
    l2_leaf_reg=3.0,         # L2 regularization
    border_count=254,        # number of bins for numeric features
    bagging_temperature=1.0, # controls Bayesian bootstrap
    random_strength=1,       # adds randomness to split selection
    cat_features=cat_feature_indices,  # indices of categorical features
    eval_metric='AUC',
    random_seed=42,
    verbose=100,
    early_stopping_rounds=50
)

cb.fit(
    X_train, y_train,
    eval_set=(X_test, y_test)
)

y_pred  = cb.predict(X_test)
y_proba = cb.predict_proba(X_test)[:, 1]
print(f"
CatBoost — Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"CatBoost — ROC-AUC:  {roc_auc_score(y_test, y_proba):.4f}")

# ── Feature Importance ───────────────────────────────────────────
fi = pd.Series(cb.feature_importances_, index=X_train.columns)
fi.sort_values(ascending=False).plot(kind='barh', color='#d4af37', alpha=0.8, figsize=(8,4))
import matplotlib.pyplot as plt
plt.title('CatBoost Feature Importances')
plt.tight_layout()
plt.show()

# ── SHAP Values for model explanation ───────────────────────────
try:
    import shap
    explainer = shap.TreeExplainer(cb)
    shap_values = explainer.shap_values(X_test)
    shap.summary_plot(shap_values, X_test, plot_type='bar')
    shap.summary_plot(shap_values, X_test)
except ImportError:
    print("Install shap: pip install shap")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain catboost and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 58 — Stacking

Stacking & Blending — Meta-Learner Ensembles

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Stacking Architecture

LR (Level 0)

RF (Level 0)

XGBoost (Level 0)

SVM (Level 0)

↓ predictions as new features ↓

Meta-Learner: Logistic Regression (Level 1)

↓

Final Prediction

Code Example

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, StackingClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

# ══════════════════════════════════════
# SKLEARN StackingClassifier
# Uses cross-val predictions to avoid data leakage
# ══════════════════════════════════════
base_learners = [
    ('lr',  make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, C=1.0))),
    ('rf',  RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)),
    ('svm', make_pipeline(StandardScaler(), SVC(probability=True, kernel='rbf', C=1.0))),
    ('knn', make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))),
    ('gbm', GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42))
]

meta_learner = LogisticRegression(C=1.0, max_iter=1000)

stacking = StackingClassifier(
    estimators=base_learners,
    final_estimator=meta_learner,
    cv=5,                         # 5-fold CV for out-of-fold predictions
    stack_method='predict_proba', # pass probabilities to meta-learner
    n_jobs=-1,
    passthrough=False             # True = also pass original features to meta-learner
)
stacking.fit(X_train, y_train)

print("Stacking Ensemble:")
print(f"  Accuracy: {accuracy_score(y_test, stacking.predict(X_test)):.4f}")
print(f"  ROC-AUC:  {roc_auc_score(y_test, stacking.predict_proba(X_test)[:,1]):.4f}")

# ── Compare base learners vs stack ──────────────────────────────
print("
Individual vs Stacked Performance:")
print(f"{'Model':30} {'CV Accuracy':15}")
print("-" * 48)
for name, model in base_learners:
    cv = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    print(f"  {name:30} {cv.mean():.4f} ± {cv.std():.4f}")

cv_stack = cross_val_score(stacking, X_train, y_train, cv=5, scoring='accuracy')
print(f"  {'STACKING (all combined)':30} {cv_stack.mean():.4f} ± {cv_stack.std():.4f}")

# ══════════════════════════════════════
# BLENDING — Simpler than stacking
# Use a holdout set instead of cross-validation
# ══════════════════════════════════════
X_tr_bl, X_hold, y_tr_bl, y_hold = train_test_split(X_train, y_train, 
                                                      test_size=0.2, random_state=42)

# Train base models on X_tr_bl
blend_predictions_hold = np.zeros((len(X_hold), len(base_learners)))
blend_predictions_test = np.zeros((len(X_test), len(base_learners)))

scaler = StandardScaler()
X_tr_bl_sc = scaler.fit_transform(X_tr_bl)
X_hold_sc  = scaler.transform(X_hold)
X_test_sc  = scaler.transform(X_test)

for i, (name, model) in enumerate(base_learners):
    model.fit(X_tr_bl, y_tr_bl)
    blend_predictions_hold[:, i] = model.predict_proba(X_hold)[:, 1]
    blend_predictions_test[:, i] = model.predict_proba(X_test)[:, 1]

# Train meta-learner on holdout predictions
meta_blend = LogisticRegression(C=1.0)
meta_blend.fit(blend_predictions_hold, y_hold)
y_blend_pred = meta_blend.predict(blend_predictions_test)
print(f"
Blending Accuracy: {accuracy_score(y_test, y_blend_pred):.4f}")
print(f"Meta-learner weights: {meta_blend.coef_[0].round(3)}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain stacking & blending and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 59 — Voting Classifier

Voting Classifiers — Hard and Soft Voting

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Code Example

import numpy as np
from sklearn.ensemble import VotingClassifier, VotingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

# ══════════════════════════════════════
# HARD VOTING — Majority class wins
# ══════════════════════════════════════
hard_vote = VotingClassifier(
    estimators=[
        ('lr',  make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))),
        ('rf',  RandomForestClassifier(n_estimators=100, random_state=42)),
        ('gbm', GradientBoostingClassifier(n_estimators=100, random_state=42)),
        ('svm', make_pipeline(StandardScaler(), SVC(kernel='rbf', probability=False)))
    ],
    voting='hard'
)
hard_vote.fit(X_train, y_train)
print(f"Hard Voting Accuracy: {accuracy_score(y_test, hard_vote.predict(X_test)):.4f}")

# ══════════════════════════════════════
# SOFT VOTING — Average probabilities (better!)
# Requires probability estimates from all models
# ══════════════════════════════════════
soft_vote = VotingClassifier(
    estimators=[
        ('lr',  make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))),
        ('rf',  RandomForestClassifier(n_estimators=100, random_state=42)),
        ('gbm', GradientBoostingClassifier(n_estimators=100, random_state=42)),
        ('svm', make_pipeline(StandardScaler(), SVC(kernel='rbf', probability=True)))
    ],
    voting='soft'
)
soft_vote.fit(X_train, y_train)
print(f"Soft Voting Accuracy: {accuracy_score(y_test, soft_vote.predict(X_test)):.4f}")

# ══════════════════════════════════════
# WEIGHTED SOFT VOTING
# Give more weight to stronger models
# ══════════════════════════════════════
weighted_vote = VotingClassifier(
    estimators=[
        ('lr',  make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))),
        ('rf',  RandomForestClassifier(n_estimators=100, random_state=42)),
        ('gbm', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ],
    voting='soft',
    weights=[1, 2, 3]   # GBM gets 3x weight (best individual model)
)
weighted_vote.fit(X_train, y_train)
print(f"Weighted Soft Voting Accuracy: {accuracy_score(y_test, weighted_vote.predict(X_test)):.4f}")

# ── Find optimal weights via CV ──────────────────────────────────
from itertools import product
best_score = 0
best_weights = None

for w1, w2, w3 in product([1, 2, 3], repeat=3):
    if w1 + w2 + w3 == 0:
        continue
    wv = VotingClassifier(
        estimators=[
            ('lr',  make_pipeline(StandardScaler(), LogisticRegression(max_iter=500))),
            ('rf',  RandomForestClassifier(n_estimators=50, random_state=42)),
            ('gbm', GradientBoostingClassifier(n_estimators=50, random_state=42)),
        ],
        voting='soft',
        weights=[w1, w2, w3]
    )
    score = cross_val_score(wv, X_train, y_train, cv=5, scoring='accuracy').mean()
    if score > best_score:
        best_score = score
        best_weights = [w1, w2, w3]

print(f"
Optimal weights: {best_weights}, CV Score: {best_score:.4f}")

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain voting classifiers and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 60 — Algorithm Comparison

Algorithm Selection Framework — Which Algorithm for Which Problem?

Why this matters

Algorithm Selection Framework: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Master Decision Guide

Scenario	First Try	If Still Needs Work	Avoid
Tabular data, <10k rows, classification	Logistic Regression + Random Forest	XGBoost, LightGBM	Deep Learning (overkill)
Tabular data, >100k rows	LightGBM, XGBoost	CatBoost, Stacking	KNN (too slow), SVM (scaling issues)
High cardinality categoricals	CatBoost, LightGBM (native categ.)	Target encoding + XGBoost	OneHot + LR (too many features)
High-dimensional sparse data (text)	Naive Bayes, Logistic Regression	Linear SVM	Random Forest (slow on sparse), KNN
Small dataset (<1k samples)	SVM (RBF), Logistic Regression	Decision Tree with pruning	Deep Learning (insufficient data)
Need probability estimates	Logistic Regression, Random Forest	Calibrated SVM, GBM	Hard-margin SVM, uncalibrated models
Need full interpretability	Logistic Regression, Decision Tree	XGBoost + SHAP values	Black-box ensembles in regulated domains
Fast inference needed	Logistic Regression, Decision Tree	Random Forest (parallel)	KNN (stores all data), SVM on large datasets
Noisy, many irrelevant features	Random Forest, Lasso	Feature selection + any model	KNN (cursed by irrelevant dims)
Regression, continuous target	Linear Regression, Ridge/Lasso	XGBoost/LightGBM regressor	Logistic Regression

Algorithm Complexity Summary

Algorithm	Train Time	Predict Time	Memory	Scales to Big Data
Linear/Logistic Regression	O(mnp) fast	O(p)	O(p)	✅ Very well (SGD)
KNN	O(1) — lazy!	O(mn) — slow	O(mn)	❌ Very poorly
Naive Bayes	O(mn) fast	O(np)	O(np)	✅ Very well
SVM	O(m²–m³) slow	O(m_sv × p)	O(m_sv)	❌ Poor (>100k)
Decision Tree	O(mnp log m)	O(depth)	O(m)	⚠️ Moderate
Random Forest	O(B×mnp log m)	O(B×depth)	O(B×m)	✅ Good (parallel)
XGBoost/LightGBM	O(B×mnp)	O(B×depth)	O(B×p)	✅ Excellent

m = samples, n = features, p = features, B = trees, m_sv = support vectors

The Practical Rule of Thumb:
Always start with a baseline: DummyClassifier or simple heuristic. Know your floor.
Quickly try Logistic Regression (with scaling) — often surprises you with strong performance.
Try Random Forest — usually strong, requires no scaling, provides feature importances.
If you need top performance: XGBoost or LightGBM with proper hyperparameter tuning.
Stacking for final 0.5-2% gain in competitive settings (Kaggle).
Use SHAP values to explain any model's predictions for stakeholders.

Code Example

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
import time

# ══════════════════════════════════════
# COMPLETE ALGORITHM COMPARISON FRAMEWORK
# ══════════════════════════════════════
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15,
                            n_redundant=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {
    'Baseline (Most Frequent)':  DummyClassifier(strategy='most_frequent'),
    'Naive Bayes':               GaussianNB(),
    'Logistic Regression':       make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, C=1)),
    'Decision Tree (d=5)':       DecisionTreeClassifier(max_depth=5, random_state=42),
    'KNN (k=7)':                 make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=7)),
    'SVM (RBF)':                 make_pipeline(StandardScaler(), SVC(kernel='rbf', C=1, probability=True)),
    'Random Forest':             RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting':         GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42),
}

# Try importing XGBoost and LightGBM
try:
    from xgboost import XGBClassifier
    models['XGBoost'] = XGBClassifier(n_estimators=100, max_depth=4, use_label_encoder=False,
                                       eval_metric='logloss', random_state=42, n_jobs=-1)
except: pass

try:
    from lightgbm import LGBMClassifier
    models['LightGBM'] = LGBMClassifier(n_estimators=100, num_leaves=31, random_state=42,
                                         n_jobs=-1, verbose=-1)
except: pass

print("="*70)
print(f"{'Algorithm':35} {'CV Acc':10} {'Test Acc':10} {'Train Time':12}")
print("="*70)

results = {}
for name, model in models.items():
    start = time.time()
    cv_score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)
    elapsed_cv = time.time() - start

    model.fit(X_train, y_train)
    test_score = accuracy_score(y_test, model.predict(X_test))
    results[name] = {'cv': cv_score.mean(), 'test': test_score}

    print(f"  {name:35} {cv_score.mean():.4f}     {test_score:.4f}     {elapsed_cv:.2f}s")

# ── Rank models ──────────────────────────────────────────────────
best = sorted(results.items(), key=lambda x: x[1]['test'], reverse=True)
print(f"
{'='*40}")
print("RANKING (by Test Accuracy):")
for rank, (name, scores) in enumerate(best, 1):
    print(f"  {rank}. {name}: {scores['test']:.4f}")

🚀

Module 4 Complete — You're Now a Supervised ML Expert!

You've covered every major supervised learning algorithm from first principles. The next step is learning how to properly evaluate these models (Module 6: Evaluation & Tuning) and discovering structure in data without labels (Module 5: Unsupervised Learning). The algorithms you learned here — especially XGBoost and LightGBM — dominate Kaggle competitions and real-world ML projects worldwide.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain algorithm selection framework and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Continue to the next day in this module.

Supervised Learning: Linear Regression Best-Fit Line

Preprocessing & Feature Engineering → Unsupervised Learning →