Search topics…
Tutorials
Explore
June 6 Offline Event →
100 Days of ML · Module 3 (40)

Module 3: Data Preprocessing & Feature Engineering

100 Days of ML Module 3 — Master Data Preprocessing and Feature Engineering: imputation, scaling, encoding, pipelines, class imbalance, SMOTE, and feature selection.

⏱ 80 Min Read 40 Updated: May 2026

Raw data is almost never in a form that algorithms can consume. Preprocessing transforms raw data into clean, consistent, properly scaled, and properly encoded representations. This module bridges EDA (understanding data) and modeling (learning from data). Poor preprocessing leads to poor models regardless of algorithm sophistication.

Why Preprocessing Matters — The Pipeline Overview

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Preprocessing Pipeline

📥 Raw Data (missing values, wrong types, outliers)
🔍 Handling Missing Values (Imputation)
🏷️ Encoding Categorical Variables
📏 Feature Scaling (Normalization/Standardization)
🔄 Feature Transformation (Log, Box-Cox)
⚖️ Handling Class Imbalance
🎯 Feature Selection
✅ Clean, Model-Ready Feature Matrix X

Why Does Preprocessing Matter So Much?

Problem What Happens Without Fixing It Solution
Missing values Most sklearn algorithms throw errors; tree models may handle them, but performance suffers Imputation (mean/median/KNN/MICE)
Unscaled features Features with large ranges dominate distance-based algorithms (KNN, SVM, K-means). Gradient descent converges slowly. StandardScaler / MinMaxScaler
Categorical strings Models require numeric input; "Male"/"Female" can't be used directly OneHotEncoder / OrdinalEncoder
Skewed distributions Linear models assume normality; extreme values pull the fit Log transform, Box-Cox, Yeo-Johnson
Class imbalance Model predicts majority class always; good accuracy, terrible recall on minority class SMOTE, class weights, resampling
Irrelevant features Curse of dimensionality; overfitting; slower training; confuses some models Feature selection (filter/wrapper/embedded)
⚠️
The Golden Rule: Fit on Train, Transform on Test

Always compute preprocessing statistics (mean, std, categories, etc.) from the training set only, then apply those same statistics to transform the test set. Fitting on test data = data leakage — your model sees future information, inflating metrics and causing poor production performance.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain why preprocessing matters and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 27 — SimpleImputer

SimpleImputer — Mean, Median, Mode, and Constant Strategies

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Choosing the Right Imputation Strategy

Strategy Formula / Method Best For Weakness
Mean $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ Normally distributed, MCAR data Sensitive to outliers; reduces variance
Median Middle value of sorted data Skewed distributions, data with outliers Ignores feature relationships
Most Frequent (Mode) Most common category/value Categorical features Can create artificial imbalance in categories
Constant Replace with a fixed value (0, "Unknown", etc.) When missing = a meaningful category (MNAR) Needs domain knowledge to choose the right constant
Code Example
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Create sample dataset with missing values
df = pd.read_csv('titanic.csv')
X = df[['Age', 'Fare', 'Pclass', 'SibSp', 'Embarked', 'Sex']].copy()
y = df['Survived']

# Split FIRST — imputer stats learned from train only!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Missing values in X_train:
{X_train.isnull().sum()}")

# ══════════════════════════════════════
# NUMERICAL IMPUTATION
# ══════════════════════════════════════

# Strategy 1: Mean imputation (good for normal distributions)
mean_imputer = SimpleImputer(strategy='mean')
X_train_num = X_train[['Age', 'Fare']].copy()
X_test_num  = X_test[['Age', 'Fare']].copy()

X_train_num_imp = mean_imputer.fit_transform(X_train_num)  # learn mean from train
X_test_num_imp  = mean_imputer.transform(X_test_num)       # apply train mean to test

print(f"
Mean imputer statistics: {mean_imputer.statistics_}")
# For Age: ~29.7, For Fare: ~32.2

# Strategy 2: Median imputation (better for skewed data)
median_imputer = SimpleImputer(strategy='median')
X_age_train = X_train[['Age']].copy()
X_age_train_imp = median_imputer.fit_transform(X_age_train)
print(f"Median of Age: {median_imputer.statistics_[0]:.2f}")

# ══════════════════════════════════════
# CATEGORICAL IMPUTATION
# ══════════════════════════════════════

# Strategy 3: Most frequent (mode) for categorical
mode_imputer = SimpleImputer(strategy='most_frequent')
X_embarked_train = X_train[['Embarked']].copy()
X_embarked_train_imp = mode_imputer.fit_transform(X_embarked_train)
print(f"
Most frequent Embarked value: {mode_imputer.statistics_[0]}")  # 'S'

# Strategy 4: Constant — fill with 'Unknown' for MNAR data
const_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
X_embarked_const = const_imputer.fit_transform(X_train[['Embarked']])

# ══════════════════════════════════════
# PANDAS ALTERNATIVES (quick and easy)
# ══════════════════════════════════════

df_copy = df.copy()

# Mean imputation
df_copy['Age'].fillna(df_copy['Age'].mean(), inplace=True)

# Median imputation  
df_copy['Fare'].fillna(df_copy['Fare'].median(), inplace=True)

# Group-specific imputation — better approach!
# Fill age based on passenger class median (more realistic)
df_copy['Age'] = df_copy.groupby('Pclass')['Age'].transform(
    lambda x: x.fillna(x.median())
)
print(f"
Group-based imputation (Age by Class):")
print(df_copy.groupby('Pclass')['Age'].median())

# Forward fill (for time-series data)
df_copy['Age_ffill'] = df['Age'].ffill()  # propagate last valid observation

# Backward fill
df_copy['Age_bfill'] = df['Age'].bfill()  # use next valid observation

# ══════════════════════════════════════
# COMPARING IMPUTATION STRATEGIES
# ══════════════════════════════════════
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

original = df['Age'].dropna()
mean_val = df['Age'].mean()
median_val = df['Age'].median()

strategies = {
    'Original (no missing)': original,
    f'Mean imputed ({mean_val:.1f})': df['Age'].fillna(mean_val),
    f'Median imputed ({median_val:.1f})': df['Age'].fillna(median_val)
}

for ax, (title, data) in zip(axes, strategies.items()):
    ax.hist(data.dropna(), bins=30, color='#d4af37', alpha=0.7, edgecolor='black')
    ax.axvline(data.mean(), color='red', linestyle='--', label=f'Mean: {data.mean():.1f}')
    ax.set_title(title, fontsize=9)
    ax.legend(fontsize=8)

plt.suptitle('Effect of Different Imputation Strategies on Age Distribution', fontweight='bold')
plt.tight_layout()
plt.show()

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain simpleimputer and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 28 — KNN Imputer

KNN Imputer & Iterative Imputer (MICE) — Advanced Imputation

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Simple Imputation Falls Short

Mean/median imputation ignores relationships between features. For example, if Age is missing, it's more accurate to impute based on Pclass and Fare rather than using the overall mean.

KNN Imputer

Finds the $k$ nearest neighbors (using Euclidean distance on non-missing features) and uses their weighted average to impute the missing value. Captures feature interactions but is slow on large datasets ($O(n^2)$).

Iterative Imputer (MICE — Multiple Imputation by Chained Equations)

Models each feature with missing values as a function of all other features. Iteratively fills missing values by training a regression model (e.g., BayesianRidge) for each feature. Multiple iterations until convergence — considered the gold standard for imputation.

Code Example
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer  # Required!
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

df = pd.read_csv('titanic.csv')

# Encode categoricals for imputers (they need numbers)
df_enc = df[['Age', 'Fare', 'Pclass', 'SibSp', 'Parch']].copy()

print(f"Missing before imputation:
{df_enc.isnull().sum()}")

# ══════════════════════════════════════
# KNN IMPUTER
# ══════════════════════════════════════
# n_neighbors: number of similar samples to use
# weights: 'uniform' (equal) or 'distance' (closer = more weight)
knn_imputer = KNNImputer(n_neighbors=5, weights='distance')
df_knn = pd.DataFrame(
    knn_imputer.fit_transform(df_enc),
    columns=df_enc.columns
)

print(f"
KNN Imputed Age — mean: {df_knn['Age'].mean():.2f}, std: {df_knn['Age'].std():.2f}")
print(f"Missing after KNN: {df_knn.isnull().sum().sum()}")

# ══════════════════════════════════════
# ITERATIVE IMPUTER (MICE)
# ══════════════════════════════════════
# max_iter: how many rounds of imputation
# estimator: the regression model used for each feature
# random_state: for reproducibility

# With BayesianRidge (default, fast)
mice_imputer = IterativeImputer(
    estimator=BayesianRidge(),
    max_iter=10,
    random_state=42,
    sample_posterior=False
)
df_mice = pd.DataFrame(
    mice_imputer.fit_transform(df_enc),
    columns=df_enc.columns
)

print(f"
MICE Imputed Age — mean: {df_mice['Age'].mean():.2f}, std: {df_mice['Age'].std():.2f}")

# With Random Forest (more powerful but slower)
rf_imputer = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=50, random_state=42),
    max_iter=5,
    random_state=42
)
df_rf_imp = pd.DataFrame(
    rf_imputer.fit_transform(df_enc),
    columns=df_enc.columns
)

# ══════════════════════════════════════
# COMPARING ALL IMPUTATION METHODS
# ══════════════════════════════════════
fig, axes = plt.subplots(1, 4, figsize=(18, 4))

original_age = df['Age'].dropna()
methods = {
    'Original
(no missing)': original_age,
    'Mean Imputed': df['Age'].fillna(df['Age'].mean()),
    'KNN Imputed': df_knn['Age'],
    'MICE Imputed': df_mice['Age']
}

for ax, (title, data) in zip(axes, methods.items()):
    ax.hist(data, bins=25, color='#d4af37', alpha=0.7, edgecolor='black', density=True)
    ax.set_title(title, fontsize=10)
    ax.set_xlabel('Age')
    stat = f"μ={data.mean():.1f}
σ={data.std():.1f}"
    ax.text(0.95, 0.95, stat, transform=ax.transAxes, ha='right', va='top', fontsize=8)

plt.suptitle('Comparison: Imputation Methods Effect on Age Distribution', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# BEST PRACTICE: Evaluate imputation quality
# ══════════════════════════════════════
# Artificially introduce missing values and check recovery
np.random.seed(42)
df_complete = df_enc.dropna()

# Randomly remove 20% of Age values
mask = np.random.rand(len(df_complete)) < 0.2
df_masked = df_complete.copy()
df_masked.loc[mask, 'Age'] = np.nan

true_ages = df_complete.loc[mask, 'Age']

# Test different methods
results = {}
for name, imputer in [('KNN-5', KNNImputer(n_neighbors=5)),
                       ('MICE', IterativeImputer(max_iter=10, random_state=42))]:
    imputed = imputer.fit_transform(df_masked)
    imputed_df = pd.DataFrame(imputed, columns=df_masked.columns)
    predicted_ages = imputed_df.loc[mask, 'Age']
    rmse = np.sqrt(np.mean((true_ages.values - predicted_ages.values)**2))
    mae  = np.mean(np.abs(true_ages.values - predicted_ages.values))
    results[name] = {'RMSE': rmse, 'MAE': mae}

print("
Imputation Method Comparison (lower = better):")
for method, metrics in results.items():
    print(f"  {method}: RMSE={metrics['RMSE']:.3f}, MAE={metrics['MAE']:.3f}")
💡
Imputation Hierarchy

Simple first: Mean/median often works well and is fast. KNN: When features are correlated and dataset <50k rows. MICE: When you have many features with complex relationships and need the most accurate imputation. Always add a feature_was_missing indicator column for MNAR data.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain knn imputer & iterative imputer (mice) and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 29 — Feature Scaling

Feature Scaling — StandardScaler, MinMaxScaler, RobustScaler

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Scaling Matters

Consider a dataset where Income ranges from $20,000–$200,000 and Age ranges from 18–80. Without scaling, income dominates any distance calculation — a difference of 1 year in age = $1, but a $10,000 salary difference = 10,000. This creates fundamental bias in:

  • K-Nearest Neighbors: Distance = dominated by large-scale features
  • Support Vector Machines: Kernel functions are distance-sensitive
  • Gradient Descent: Elongated loss surface → oscillations → slow convergence
  • PCA: Principal components dominated by high-variance features
  • Neural Networks: Activation saturation, gradient vanishing
Note: Tree-based models (Decision Trees, Random Forests, XGBoost) do NOT require feature scaling because they use thresholds, not distances. Naive Bayes and other probabilistic models are also scale-invariant.

StandardScaler (Z-score Normalization)

Transforms each feature to have zero mean and unit variance. Doesn't bound values — outliers can still be extreme (just in std units).

$$z = \frac{x - \mu}{\sigma}$$

where $\mu$ = feature mean, $\sigma$ = feature standard deviation (computed on training data)

MinMaxScaler

Scales all values to a fixed range, typically [0, 1]. Sensitive to outliers since min/max are used.

$$x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$$

Result is always in [0, 1]. Use feature_range=(-1, 1) to scale to [-1, 1].

RobustScaler

Uses the median and IQR instead of mean and std — robust to outliers. Best when your data has significant outliers you want to keep.

$$x' = \frac{x - \text{median}(x)}{IQR} \quad \text{where } IQR = Q_3 - Q_1$$
Code Example
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
import matplotlib.pyplot as plt

# Create example dataset with outliers
np.random.seed(42)
normal_data = np.random.normal(50, 15, 1000)
# Add some outliers
data_with_outliers = np.concatenate([normal_data, [150, 160, 175, -20, -30]])

df = pd.DataFrame({'value': data_with_outliers})

# ══════════════════════════════════════
# APPLY ALL SCALERS
# ══════════════════════════════════════
scalers = {
    'Original': None,
    'StandardScaler
(z-score)': StandardScaler(),
    'MinMaxScaler
([0,1])': MinMaxScaler(),
    'RobustScaler
(median/IQR)': RobustScaler(),
}

fig, axes = plt.subplots(1, 4, figsize=(18, 4))

for ax, (name, scaler) in zip(axes, scalers.items()):
    if scaler is None:
        values = df['value'].values
    else:
        values = scaler.fit_transform(df[['value']]).flatten()
    
    ax.hist(values, bins=40, color='#d4af37', alpha=0.7, edgecolor='black')
    ax.set_title(f'{name}
Range: [{values.min():.1f}, {values.max():.1f}]', fontsize=9)
    ax.set_xlabel('Value')
    ax.text(0.95, 0.95, f'μ={values.mean():.2f}
σ={values.std():.2f}', 
            transform=ax.transAxes, ha='right', va='top', fontsize=8)

plt.suptitle('Effect of Scaling on Distribution (data has outliers)', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# PRACTICAL SKLEARN USAGE
# ══════════════════════════════════════
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# StandardScaler — ALWAYS fit only on training data!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # learns μ and σ from train
X_test_scaled  = scaler.transform(X_test)        # applies train μ and σ to test

print("After StandardScaling:")
print(f"  Training mean (should be ≈0): {X_train_scaled.mean():.6f}")
print(f"  Training std  (should be ≈1): {X_train_scaled.std():.6f}")
print(f"  Test mean (NOT 0, that's normal!): {X_test_scaled.mean():.6f}")

# Access learned statistics
print(f"
Learned means (first 5): {scaler.mean_[:5].round(3)}")
print(f"Learned stds  (first 5): {scaler.scale_[:5].round(3)}")

# Inverse transform — get back original values
X_original = scaler.inverse_transform(X_train_scaled)
print(f"
Inverse transform check: {np.allclose(X_original, X_train)}")  # True

# ══════════════════════════════════════
# COMPARISON: Effect on KNN Performance
# ══════════════════════════════════════
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=5)

# Without scaling
knn.fit(X_train, y_train)
acc_unscaled = accuracy_score(y_test, knn.predict(X_test))

# With StandardScaler
knn.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn.predict(X_test_scaled))

print(f"
KNN Accuracy WITHOUT scaling: {acc_unscaled:.4f}")
print(f"KNN Accuracy WITH StandardScaler: {acc_scaled:.4f}")
print(f"Improvement: +{(acc_scaled - acc_unscaled)*100:.2f}%")
# Typical result: ~5-10% improvement from proper scaling

# ══════════════════════════════════════
# WHICH SCALER TO USE — Decision Guide
# ══════════════════════════════════════
print("""
SCALER SELECTION GUIDE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
StandardScaler:  Default choice; works for most algorithms
                 Assumes approximately normal distribution
                 
MinMaxScaler:    When you need bounded output [0,1]
                 Good for image pixels (0-255 → 0-1)
                 Neural network inputs
                 
RobustScaler:    When you have significant outliers
                 you cannot or don't want to remove
                 
MaxAbsScaler:    When data is already centered at 0
                 Good for sparse matrices (doesn't break sparsity)
                 
None needed:     Decision Trees, Random Forests, XGBoost,
                 LightGBM, CatBoost — scale-invariant!
""")
Scaler Formula Output Range Outlier Robust? Use When
StandardScaler $(x - \mu) / \sigma$ ~[-3, 3] No ❌ Default; normally distributed; SVM, LR, KNN
MinMaxScaler $(x - x_{min}) / (x_{max} - x_{min})$ [0, 1] No ❌ Neural networks; image data; known bounded range
RobustScaler $(x - Q_{0.5}) / IQR$ ~[-2, 2] Yes ✅ Data with outliers you cannot remove
MaxAbsScaler $x / |x_{max}|$ [-1, 1] No ❌ Sparse data; data already centered at 0

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain feature scaling and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 30 — Encoding Categoricals

Introduction to Categorical Encoding — Nominal vs Ordinal

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Types of Categorical Variables

Type Definition Examples Key Property Encoding
Nominal No natural order between categories Color (Red, Green, Blue), Country, Gender, Embarked port No ordering relationship OneHotEncoder, TargetEncoder
Ordinal Categories have a natural meaningful order Education (High School < Bachelor's < Master's < PhD), Rating (Low/Medium/High) Order matters; distances may not be equal OrdinalEncoder with explicit mapping
High Cardinality Very many unique categories (100+) City, ZIP code, Product ID, User ID OneHot creates too many columns TargetEncoder, FrequencyEncoder, Hashing
Binary Exactly two categories Gender (Male/Female), Yes/No, True/False Only one bit of information LabelEncoder (0/1) or OneHot (same result)
⚠️
Never Use LabelEncoder for Nominal Features in Linear Models

Encoding [Red=0, Green=1, Blue=2] implies Blue > Green > Red — a mathematical relationship that doesn't exist! Linear models and distance-based models will treat this as a numeric ordering. Use OneHotEncoder for nominal features in these models. Tree-based models can sometimes handle ordinal integer encoding for nominals.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain introduction to categorical encoding and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 31 — OHE & Label Encoding

OneHotEncoder, LabelEncoder, and OrdinalEncoder

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Code Example
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder

df = pd.DataFrame({
    'Color':     ['Red', 'Green', 'Blue', 'Red', 'Green'],
    'Size':      ['Small', 'Large', 'Medium', 'Large', 'Small'],
    'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'],
    'Survived':  [0, 1, 1, 0, 1]
})

# ══════════════════════════════════════
# 1. OneHotEncoder — for NOMINAL features
# Creates N binary columns (one per category)
# ══════════════════════════════════════
ohe = OneHotEncoder(
    sparse_output=False,       # return dense array, not sparse matrix
    handle_unknown='ignore',   # ignore unseen categories in test (don't error)
    drop='first'               # drop first category to avoid multicollinearity
)

color_encoded = ohe.fit_transform(df[['Color']])
print("OneHotEncoder — Color (drop='first'):")
print(pd.DataFrame(color_encoded, columns=ohe.get_feature_names_out(['Color'])))
#    Color_Green  Color_Red
# 0          0.0        1.0   ← Red
# 1          1.0        0.0   ← Green
# 2          0.0        0.0   ← Blue (dropped — baseline category)
# ...

# All categories (no drop)
ohe_full = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
color_full = ohe_full.fit_transform(df[['Color']])
print("
OneHotEncoder — Color (all categories):")
print(pd.DataFrame(color_full, columns=ohe_full.get_feature_names_out(['Color'])))

# ══════════════════════════════════════
# 2. Pandas get_dummies — Quick alternative
# ══════════════════════════════════════
# For multiple columns at once
dummies = pd.get_dummies(df, columns=['Color', 'Size'], drop_first=True, dtype=int)
print("
pd.get_dummies result:")
print(dummies.head())

# ══════════════════════════════════════
# 3. LabelEncoder — for BINARY features or TARGET variable
# WARNING: Only use for binary categories or the target!
# ══════════════════════════════════════
le = LabelEncoder()
size_encoded = le.fit_transform(df['Size'])
print(f"
LabelEncoder — Size: {dict(zip(le.classes_, range(len(le.classes_))))}")
print(f"Encoded: {size_encoded}")
# Large=0, Medium=1, Small=2 — arbitrary numeric ordering!

# ══════════════════════════════════════
# 4. OrdinalEncoder — for ORDINAL features
# Requires explicit ordering specification!
# ══════════════════════════════════════
education_order = [['High School', 'Bachelor', 'Master', 'PhD']]
oe = OrdinalEncoder(categories=education_order, handle_unknown='use_encoded_value', unknown_value=-1)
edu_encoded = oe.fit_transform(df[['Education']])
print("
OrdinalEncoder — Education (explicit order):")
for orig, enc in zip(df['Education'], edu_encoded.flatten()):
    print(f"  {orig:12} → {enc:.0f}")
# High School → 0, Bachelor → 1, Master → 2, PhD → 3

# Size with explicit order
size_order = [['Small', 'Medium', 'Large']]
oe_size = OrdinalEncoder(categories=size_order)
size_ordinal = oe_size.fit_transform(df[['Size']])
print("
OrdinalEncoder — Size (explicit Small < Medium < Large):")
print(pd.DataFrame({'Original': df['Size'], 'Encoded': size_ordinal.flatten()}))

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain onehotencoder, labelencoder, and ordinalencoder and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 32 — Target Encoding

Target Encoding, Frequency Encoding & Binary Encoding

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

When a categorical feature has high cardinality (e.g., 500 unique cities), OneHotEncoding creates 500 columns — too many. These advanced techniques solve this problem.

Code Example
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

df = pd.read_csv('titanic.csv')
# Simulate a high-cardinality feature
df['Ticket_code'] = df['Ticket'].apply(lambda x: x.split()[0] if len(x.split()) > 1 else 'NUM')
print(f"Unique Ticket codes: {df['Ticket_code'].nunique()}")  # ~50+ unique

# ══════════════════════════════════════
# 1. FREQUENCY ENCODING
# Replace category with its frequency (proportion) in dataset
# Works well when frequency correlates with target
# ══════════════════════════════════════
def frequency_encode(df_train, df_test, column):
    freq_map = df_train[column].value_counts(normalize=True).to_dict()
    df_train_enc = df_train.copy()
    df_test_enc  = df_test.copy()
    df_train_enc[f'{column}_freq'] = df_train[column].map(freq_map)
    df_test_enc[f'{column}_freq']  = df_test[column].map(freq_map).fillna(0)
    return df_train_enc, df_test_enc

freq_map = df['Ticket_code'].value_counts(normalize=True).to_dict()
df['Ticket_code_freq'] = df['Ticket_code'].map(freq_map)

print("
Frequency Encoding — Ticket_code (sample):")
sample = df[['Ticket_code', 'Ticket_code_freq']].drop_duplicates().head(10)
print(sample.to_string(index=False))

# ══════════════════════════════════════
# 2. TARGET ENCODING (Mean Encoding)
# Replace category with mean of target for that category
# ⚠️  ALWAYS use cross-validation to prevent target leakage!
# ══════════════════════════════════════
def target_encode_cv(df, feature, target, n_folds=5, smoothing=20):
    """
    Proper target encoding with cross-validation to prevent leakage.
    Smoothing regularizes rare categories toward the global mean.
    """
    global_mean = df[target].mean()
    df_encoded = df.copy()
    df_encoded[f'{feature}_target_enc'] = np.nan
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(df):
        train_fold = df.iloc[train_idx]
        val_fold   = df.iloc[val_idx]
        
        # Compute category statistics from training fold only
        stats = train_fold.groupby(feature)[target].agg(['mean', 'count'])
        stats.columns = ['mean', 'count']
        
        # Apply smoothing: blend category mean with global mean
        # (prevents overfitting on rare categories)
        smoothed = (stats['count'] * stats['mean'] + smoothing * global_mean) / \
                   (stats['count'] + smoothing)
        
        # Apply to validation fold
        df_encoded.loc[val_idx, f'{feature}_target_enc'] = \
            val_fold[feature].map(smoothed).fillna(global_mean)
    
    return df_encoded

df_with_target_enc = target_encode_cv(df.dropna(subset=['Survived']), 
                                       'Embarked', 'Survived')

print("
Target Encoding — Embarked vs Survived rate:")
comp = df_with_target_enc.groupby('Embarked').agg(
    survival_rate=('Survived', 'mean'),
    target_encoded=('Embarked_target_enc', 'mean')
).round(3)
print(comp)

# ══════════════════════════════════════
# 3. BINARY ENCODING
# Converts category to integer then to binary bits
# Fewer columns than OHE for high cardinality features
# ══════════════════════════════════════
# pip install category_encoders
try:
    import category_encoders as ce
    
    # Binary encoding
    be = ce.BinaryEncoder(cols=['Embarked', 'Ticket_code'])
    df_binary = be.fit_transform(df[['Embarked', 'Ticket_code']].dropna())
    print(f"
Binary Encoding — Embarked (3 categories → 2 binary columns):")
    print(df_binary[['Embarked_0', 'Embarked_1']].drop_duplicates().head(5))
    
    # Hashing encoder — fixed number of columns regardless of cardinality!
    he = ce.HashingEncoder(cols=['Ticket_code'], n_components=8)  # 8 hash columns
    df_hash = he.fit_transform(df[['Ticket_code']].dropna())
    print(f"
Hashing Encoder output shape: {df_hash.shape}")  # always (n, 8)
    
    # Leave-One-Out encoding — another leakage-safe target encoding
    loo = ce.LeaveOneOutEncoder(cols=['Embarked'])
    df_loo = loo.fit_transform(df[['Embarked']].dropna(), 
                               df.loc[df['Embarked'].notna(), 'Survived'])
    print("
LeaveOneOut Encoding — Embarked sample:")
    print(df_loo.head())

except ImportError:
    print("Install category_encoders: pip install category_encoders")

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain target encoding, frequency encoding & binary encoding and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 33 — Feature Transforms

Feature Transformations — Log, Box-Cox, and Yeo-Johnson

Why this matters

Feature Transformations: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Many ML algorithms (linear regression, logistic regression, linear SVM) assume normally distributed features. Transformations convert skewed distributions into approximately normal ones, improving model performance.

When to Transform

  • Feature is right-skewed (long tail to the right) → log, square root, or Box-Cox
  • Feature is left-skewed → square, cube, or reflect then log
  • Feature has negative values → Yeo-Johnson (handles negatives, unlike Box-Cox)

Box-Cox Transformation:

$$y(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \text{if } \lambda eq 0 \\ \ln(x) & \text{if } \lambda = 0 \end{cases}$$

$\lambda$ is estimated automatically to maximize normality. Requires all values > 0.

Yeo-Johnson Transformation (extends Box-Cox to negatives):

$$\psi(x, \lambda) = \begin{cases} \frac{(x+1)^\lambda - 1}{\lambda} & x \geq 0, \lambda eq 0 \\ \ln(x+1) & x \geq 0, \lambda = 0 \\ \frac{-((-x+1)^{2-\lambda} - 1)}{2-\lambda} & x < 0, \lambda eq 2 \\ -\ln(-x+1) & x < 0, \lambda = 2 \end{cases}$$
Code Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
from scipy import stats

df = pd.read_csv('titanic.csv')
fare = df['Fare'].dropna()

# ══════════════════════════════════════
# MANUAL TRANSFORMS
# ══════════════════════════════════════
transforms = {
    'Original': fare,
    'Log (log1p)': np.log1p(fare),        # log(x+1) — handles zeros safely
    'Sqrt': np.sqrt(fare),                 # sqrt(x) — moderate skew reduction
    'Cube root': np.cbrt(fare),            # works for negative values too
    'Reciprocal': 1 / (fare + 1),         # very aggressive, flips direction
}

fig, axes = plt.subplots(1, 5, figsize=(22, 4))
for ax, (name, data) in zip(axes, transforms.items()):
    skew = data.skew()
    ax.hist(data, bins=30, color='#d4af37', alpha=0.7, edgecolor='black')
    ax.set_title(f'{name}
Skewness: {skew:.3f}', fontsize=9)

plt.suptitle('Effect of Transformations on Fare Distribution', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# SKLEARN PowerTransformer
# ══════════════════════════════════════
from sklearn.model_selection import train_test_split

fare_df = df[['Fare', 'Age']].dropna()

# Box-Cox transform (requires all positive values)
pt_boxcox = PowerTransformer(method='box-cox', standardize=True)
fare_boxcox = pt_boxcox.fit_transform(fare_df[['Fare']])
print(f"Box-Cox lambda (Fare): {pt_boxcox.lambdas_[0]:.4f}")
print(f"Before transform — skew: {fare_df['Fare'].skew():.3f}")
print(f"After Box-Cox   — skew: {pd.Series(fare_boxcox.flatten()).skew():.3f}")

# Yeo-Johnson transform (handles zeros and negatives)
pt_yj = PowerTransformer(method='yeo-johnson', standardize=True)
combined_yj = pt_yj.fit_transform(fare_df)
print(f"
Yeo-Johnson lambdas: Fare={pt_yj.lambdas_[0]:.4f}, Age={pt_yj.lambdas_[1]:.4f}")

# ══════════════════════════════════════
# QUANTILE TRANSFORMER
# Maps data to uniform or normal distribution
# ══════════════════════════════════════
qt_normal  = QuantileTransformer(output_distribution='normal',  random_state=42, n_quantiles=100)
qt_uniform = QuantileTransformer(output_distribution='uniform', random_state=42, n_quantiles=100)

fare_normal  = qt_normal.fit_transform(fare_df[['Fare']])
fare_uniform = qt_uniform.fit_transform(fare_df[['Fare']])

fig, axes = plt.subplots(1, 4, figsize=(18, 4))
datasets = [
    (fare_df['Fare'], 'Original Fare'),
    (fare_boxcox.flatten(), 'Box-Cox'),
    (fare_normal.flatten(), 'Quantile (Normal)'),
    (fare_uniform.flatten(), 'Quantile (Uniform)')
]
for ax, (data, title) in zip(axes, datasets):
    ax.hist(data, bins=30, color='#d4af37', alpha=0.7, edgecolor='black')
    ax.set_title(f'{title}
skew={pd.Series(data).skew():.3f}')

plt.suptitle('Comparing Transformation Methods', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# NORMALITY TESTING — Shapiro-Wilk Test
# ══════════════════════════════════════
print("
Normality Tests (H0: data is normally distributed):")
print("p > 0.05 → likely normal; p ≤ 0.05 → reject normality")
print("-" * 55)

for name, data in [('Original', fare_df['Fare'].values[:5000]),
                    ('Log1p',    np.log1p(fare_df['Fare'].values[:5000])),
                    ('Box-Cox',  fare_boxcox.flatten()[:5000])]:
    stat, p = stats.shapiro(data[:500])  # shapiro limited to 5000 samples
    normal = "✅ Normal" if p > 0.05 else "❌ Not Normal"
    print(f"  {name:12} W={stat:.4f}, p={p:.6f}  {normal}")

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain feature transformations and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 34 — Pipelines

Sklearn Pipelines — Preventing Data Leakage

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

A Pipeline chains multiple preprocessing steps and a final estimator into a single object. This is arguably the most important sklearn concept for production ML because it eliminates data leakage and makes models deployable as single objects.

Why Pipelines?

  • Prevent leakage: pipeline.fit(X_train) automatically fits transformers on training data only
  • One object for deployment: joblib.dump(pipeline, 'model.pkl') saves everything
  • Clean code: No manual fit/transform calls scattered throughout code
  • Cross-validation safe: cross_val_score(pipeline, X, y) correctly applies transforms per fold
Code Example
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, classification_report
import joblib

df = pd.read_csv('titanic.csv')
X = df[['Age', 'Fare', 'Pclass', 'SibSp']].copy()
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# ══════════════════════════════════════
# BASIC PIPELINE — Numeric features only
# ══════════════════════════════════════
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Step 1: fill NaN with median
    ('scaler',  StandardScaler()),                  # Step 2: standardize
    ('model',   LogisticRegression(max_iter=1000))  # Step 3: train model
])

# Train — Pipeline handles all steps
numeric_pipeline.fit(X_train, y_train)

# Predict — all transforms applied automatically!
y_pred = numeric_pipeline.predict(X_test)
print(f"Pipeline Accuracy: {accuracy_score(y_test, y_pred):.4f}")

# Access individual steps
print(f"
Imputer medians: {numeric_pipeline.named_steps['imputer'].statistics_.round(2)}")
print(f"Scaler means:    {numeric_pipeline.named_steps['scaler'].mean_.round(2)}")
print(f"Model coefs:     {numeric_pipeline.named_steps['model'].coef_[0].round(3)}")

# ══════════════════════════════════════
# PIPELINE WITH CROSS-VALIDATION
# This is the CORRECT way — transforms are re-fit per fold!
# ══════════════════════════════════════
cv_scores = cross_val_score(numeric_pipeline, X_train, y_train, cv=5, scoring='accuracy')
print(f"
Cross-validation scores: {cv_scores.round(3)}")
print(f"Mean CV score: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# ══════════════════════════════════════
# SAVE AND LOAD PIPELINE (deployment)
# ══════════════════════════════════════
joblib.dump(numeric_pipeline, 'titanic_pipeline.pkl')
loaded_pipeline = joblib.load('titanic_pipeline.pkl')

# Predict on raw data — no need to preprocess manually!
new_passenger = pd.DataFrame({
    'Age': [25], 'Fare': [72.5], 'Pclass': [1], 'SibSp': [0]
})
survival_prob = loaded_pipeline.predict_proba(new_passenger)
print(f"
New passenger survival probability: {survival_prob[0][1]:.4f}")

# ══════════════════════════════════════
# PIPELINE WITH HYPERPARAMETER TUNING
# Access parameters using __ notation
# ══════════════════════════════════════
from sklearn.model_selection import GridSearchCV

param_grid = {
    'imputer__strategy': ['mean', 'median'],           # step_name__param_name
    'model__C': [0.01, 0.1, 1, 10],
    'model__solver': ['liblinear', 'lbfgs']
}

grid_search = GridSearchCV(numeric_pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"
Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain sklearn pipelines and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 35 — ColumnTransformer

ColumnTransformer — Different Preprocessing per Column Type

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Real datasets have mixed feature types: some are numeric (need scaling), some are categorical (need encoding), some are ordinal. ColumnTransformer applies different transformations to different column groups — then concatenates the results.

Code Example
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report

df = pd.read_csv('titanic.csv')

# Define feature groups
numeric_features  = ['Age', 'Fare', 'SibSp', 'Parch']
nominal_features  = ['Sex', 'Embarked']
ordinal_features  = ['Pclass']   # 1 > 2 > 3 (ordinal: higher = more prestigious)

X = df[numeric_features + nominal_features + ordinal_features].copy()
y = df['Survived'].copy()

# Remove rows where target is missing
mask = y.notna()
X, y = X[mask], y[mask]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

# ══════════════════════════════════════
# BUILDING THE COLUMN TRANSFORMER
# ══════════════════════════════════════

# Pipeline for numeric features
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # fill with median
    ('scaler',  StandardScaler())                   # standardize
])

# Pipeline for nominal categorical features
nominal_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # fill with mode
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first'))
])

# Pipeline for ordinal features
ordinal_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=[[1, 2, 3]]))  # Pclass: 1 > 2 > 3
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('nom', nominal_transformer, nominal_features),
        ('ord', ordinal_transformer, ordinal_features)
    ],
    remainder='drop'   # drop any other columns
)

# ══════════════════════════════════════
# COMPLETE PIPELINE: Preprocessor + Model
# ══════════════════════════════════════
full_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier',    LogisticRegression(max_iter=1000, C=1.0))
])

# Train
full_pipeline.fit(X_train, y_train)

# Evaluate
y_pred = full_pipeline.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Died', 'Survived']))

# Cross-validation
cv_scores = cross_val_score(full_pipeline, X_train, y_train, cv=5, scoring='f1')
print(f"CV F1 Score: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# ══════════════════════════════════════
# INSPECT FEATURE NAMES AFTER TRANSFORM
# ══════════════════════════════════════
preprocessor.fit(X_train)
feature_names = preprocessor.get_feature_names_out()
print(f"
Transformed feature names:")
for name in feature_names:
    print(f"  {name}")

# ══════════════════════════════════════
# COMPARING MULTIPLE MODELS (same preprocessor)
# ══════════════════════════════════════
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5, random_state=42),
    'SVM':                 SVC(probability=True, random_state=42),
    'Gradient Boosting':   GradientBoostingClassifier(n_estimators=100, random_state=42)
}

print("
Model Comparison (5-fold CV on training data):")
print("=" * 55)
for model_name, model in models.items():
    pipe = Pipeline([('prep', preprocessor), ('model', model)])
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy')
    print(f"  {model_name:25}: {scores.mean():.4f} ± {scores.std():.4f}")

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain columntransformer and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 36 — Class Imbalance

Class Imbalance — Understanding and Initial Solutions

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Problem

Class imbalance occurs when the target variable's classes are not represented equally. For example, in fraud detection, only 0.1% of transactions are fraudulent — a model that always predicts "not fraud" achieves 99.9% accuracy but is useless.

Domain Typical Imbalance Ratio Impact
Credit Card Fraud1:500 to 1:2000Model ignores fraud entirely
Medical Diagnosis (rare disease)1:100 to 1:1000Misses critical positive cases
Customer Churn1:5 to 1:50Underperforms on churning class
Titanic (survival)1:1.6Mild — standard metrics still work
Code Example
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

# Create a highly imbalanced dataset
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    weights=[0.95, 0.05],   # 95% class 0, 5% class 1
    random_state=42
)
print(f"Class distribution: {dict(zip(*np.unique(y, return_counts=True)))}")
# {0: 9500, 1: 500}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

# ══════════════════════════════════════
# PROBLEM DEMONSTRATION
# ══════════════════════════════════════
# Naive model — just predicts majority class always
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_dummy = dummy.predict(X_test)
print("
--- Dummy Classifier (always predicts 0) ---")
print(classification_report(y_test, y_dummy, target_names=['Normal', 'Fraud']))
# Accuracy: 95%!!! But recall for Fraud = 0 → completely useless!

# Standard LR without handling imbalance
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
y_lr = lr.predict(X_test)
print("
--- Logistic Regression (no handling) ---")
print(classification_report(y_test, y_lr, target_names=['Normal', 'Fraud']))

# ══════════════════════════════════════
# SOLUTION 1: CLASS WEIGHTS
# Tell the model to penalize minority class errors more
# ══════════════════════════════════════
lr_weighted = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',   # automatically set weights inversely proportional to frequency
    random_state=42
)
lr_weighted.fit(X_train, y_train)
y_weighted = lr_weighted.predict(X_test)
print("
--- Logistic Regression (class_weight='balanced') ---")
print(classification_report(y_test, y_weighted, target_names=['Normal', 'Fraud']))
# Better recall for minority class!

# Manual class weights
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
weight_dict = dict(enumerate(class_weights))
print(f"
Computed class weights: {weight_dict}")
# {0: 0.526, 1: 10.0} — minority class weighted 19× more

# ══════════════════════════════════════
# SOLUTION 2: THRESHOLD ADJUSTMENT
# Default threshold is 0.5 — lower it to catch more minority class
# ══════════════════════════════════════
y_proba = lr.predict_proba(X_test)[:, 1]

print("
Threshold Comparison:")
print(f"{'Threshold':10} {'Precision':10} {'Recall':10} {'F1':10}")
print("-" * 45)
from sklearn.metrics import precision_score, recall_score, f1_score
for threshold in [0.3, 0.4, 0.5, 0.6]:
    y_thresh = (y_proba >= threshold).astype(int)
    p = precision_score(y_test, y_thresh, pos_label=1, zero_division=0)
    r = recall_score(y_test, y_thresh, pos_label=1)
    f = f1_score(y_test, y_thresh, pos_label=1)
    print(f"{threshold:<10.1f} {p:<10.3f} {r:<10.3f} {f:<10.3f}")

# ══════════════════════════════════════
# CORRECT METRICS FOR IMBALANCED DATA
# ══════════════════════════════════════
print("
--- Proper Metrics for Imbalanced Classification ---")
print(f"ROC-AUC:          {roc_auc_score(y_test, y_proba):.4f}")

from sklearn.metrics import average_precision_score, matthews_corrcoef
print(f"Avg Precision:    {average_precision_score(y_test, y_proba):.4f}")  # PR-AUC
print(f"Matthews Coef:    {matthews_corrcoef(y_test, y_lr):.4f}")          # balanced metric

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain class imbalance and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 37 — SMOTE

SMOTE and Variations — Synthetic Minority Oversampling

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

How SMOTE Works

SMOTE (Synthetic Minority Oversampling TEchnique) generates synthetic minority samples instead of simply duplicating existing ones:

  1. For each minority sample $x_i$, find its $k$ nearest minority neighbors
  2. Randomly select one neighbor $x_{nn}$
  3. Generate a new synthetic sample: $x_{new} = x_i + \lambda \cdot (x_{nn} - x_i)$ where $\lambda \in [0,1]$
Code Example
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE, SVMSMOTE
from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.pipeline import Pipeline as ImbPipeline

# Create imbalanced dataset
X, y = make_classification(
    n_samples=10000, n_features=20, weights=[0.95, 0.05], random_state=42
)

# CRITICAL: Split BEFORE resampling! Never resample test data!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

print(f"Training set class counts: {dict(zip(*np.unique(y_train, return_counts=True)))}")
# {0: 7600, 1: 400}

# ══════════════════════════════════════
# SMOTE — Standard synthetic oversampling
# ══════════════════════════════════════
smote = SMOTE(
    sampling_strategy=0.5,    # minority:majority = 1:2 after resampling
    k_neighbors=5,            # number of nearest neighbors to use
    random_state=42
)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
print(f"
After SMOTE: {dict(zip(*np.unique(y_smote, return_counts=True)))}")

# ══════════════════════════════════════
# ADASYN — Adaptive Synthetic Sampling
# Generates more synthetic samples in harder-to-learn regions
# ══════════════════════════════════════
adasyn = ADASYN(sampling_strategy=0.5, random_state=42)
X_adasyn, y_adasyn = adasyn.fit_resample(X_train, y_train)
print(f"After ADASYN: {dict(zip(*np.unique(y_adasyn, return_counts=True)))}")

# ══════════════════════════════════════
# BorderlineSMOTE — Only oversample borderline samples
# Focuses on samples near the decision boundary
# ══════════════════════════════════════
bl_smote = BorderlineSMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42)
X_bl, y_bl = bl_smote.fit_resample(X_train, y_train)

# ══════════════════════════════════════
# COMBINATION: SMOTE + Tomek Links
# Oversample minority AND clean majority class boundary
# ══════════════════════════════════════
smote_tomek = SMOTETomek(random_state=42)
X_st, y_st = smote_tomek.fit_resample(X_train, y_train)
print(f"After SMOTETomek: {dict(zip(*np.unique(y_st, return_counts=True)))}")

# ══════════════════════════════════════
# UNDERSAMPLING — RandomUnderSampler
# Removes majority class samples randomly
# ══════════════════════════════════════
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print(f"After Random Undersampling: {dict(zip(*np.unique(y_rus, return_counts=True)))}")

# ══════════════════════════════════════
# COMPARE ALL METHODS
# ══════════════════════════════════════
methods = {
    'No Resampling':     (X_train, y_train),
    'SMOTE':             (X_smote, y_smote),
    'ADASYN':            (X_adasyn, y_adasyn),
    'BorderlineSMOTE':   (X_bl, y_bl),
    'SMOTETomek':        (X_st, y_st),
    'RandomUnder':       (X_rus, y_rus),
}

print("
Comparison of Resampling Methods (LR on test set):")
print(f"{'Method':20} {'ROC-AUC':10} {'F1-Minority':12} {'Recall-Min':10}")
print("-" * 55)
for name, (X_res, y_res) in methods.items():
    lr = LogisticRegression(max_iter=1000, random_state=42)
    lr.fit(X_res, y_res)
    y_proba = lr.predict_proba(X_test)[:, 1]
    y_pred  = lr.predict(X_test)
    
    auc = roc_auc_score(y_test, y_proba)
    report = classification_report(y_test, y_pred, output_dict=True, zero_division=0)
    f1_min  = report['1']['f1-score']
    rec_min = report['1']['recall']
    print(f"  {name:20} {auc:.4f}     {f1_min:.4f}       {rec_min:.4f}")

# ══════════════════════════════════════
# IMBALANCED-LEARN PIPELINE
# Properly handles resampling in cross-validation
# ══════════════════════════════════════
from sklearn.model_selection import StratifiedKFold, cross_val_score
from imblearn.pipeline import Pipeline as ImbPipeline

imb_pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42, sampling_strategy=0.5)),
    ('scaler', __import__('sklearn.preprocessing', fromlist=['StandardScaler']).StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_auc = cross_val_score(imb_pipeline, X_train, y_train, cv=skf, scoring='roc_auc')
print(f"
Cross-validated AUC with SMOTE pipeline: {cv_auc.mean():.4f} ± {cv_auc.std():.4f}")
⚠️
CRITICAL: Never Apply SMOTE Before Train/Test Split

If you apply SMOTE to the entire dataset before splitting, synthetic samples leak into your test set — inflating metrics drastically. Always split first, then apply SMOTE only to the training set. Use imblearn.pipeline.Pipeline (not sklearn's) when doing cross-validation, as it correctly applies resampling per fold.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain smote and variations and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 38 — Feature Selection

Feature Selection — Filter, Wrapper, and Embedded Methods

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Why Feature Selection?

  • Curse of Dimensionality: Performance degrades as irrelevant dimensions increase
  • Faster Training: Fewer features = faster model training and prediction
  • Better Generalization: Removing noise features reduces overfitting
  • Interpretability: Simpler models with fewer features are easier to explain
🔵 Filter Methods

Statistical tests
Chi2, ANOVA, MI
Model-independent
Fastest

🟡 Wrapper Methods

Use model performance
RFE, RFECV
Model-dependent
Slower but better

🟢 Embedded Methods

Built into training
Lasso, Ridge, importances
Best of both worlds
Most commonly used

Code Example
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test_sc  = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

# ══════════════════════════════════════
# FILTER METHODS
# ══════════════════════════════════════

# 1. Chi-Square Test — for non-negative features vs categorical target
from sklearn.feature_selection import chi2, SelectKBest, f_classif, mutual_info_classif

# Chi-square (need non-negative data)
from sklearn.preprocessing import MinMaxScaler
X_pos = MinMaxScaler().fit_transform(X_train)
chi2_selector = SelectKBest(chi2, k=10)
chi2_selector.fit(X_pos, y_train)
chi2_scores = pd.Series(chi2_selector.scores_, index=X.columns).sort_values(ascending=False)

print("Top 10 features by Chi-Square test:")
print(chi2_scores.head(10).round(3))

# 2. ANOVA F-test — for continuous features vs categorical target
f_selector = SelectKBest(f_classif, k=10)
f_selector.fit(X_train, y_train)
f_scores = pd.Series(f_selector.scores_, index=X.columns).sort_values(ascending=False)
f_pvalues = pd.Series(f_selector.pvalues_, index=X.columns)

print("
Top 10 features by ANOVA F-test:")
for feat, score, pval in zip(f_scores.head(10).index, 
                              f_scores.head(10).values,
                              f_pvalues[f_scores.head(10).index].values):
    sig = "✅" if pval < 0.05 else "❌"
    print(f"  {feat:30} F={score:8.2f}  p={pval:.4f}  {sig}")

# 3. Mutual Information — captures non-linear relationships
mi_selector = SelectKBest(mutual_info_classif, k=10)
mi_selector.fit(X_train_sc, y_train)
mi_scores = pd.Series(mi_selector.scores_, index=X.columns).sort_values(ascending=False)

# Visualize filter method comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for ax, (title, scores) in zip(axes, [
    ('Chi-Square (Top 15)', chi2_scores.head(15)),
    ('ANOVA F-test (Top 15)', f_scores.head(15)),
    ('Mutual Information (Top 15)', mi_scores.head(15))
]):
    ax.barh(range(len(scores)), scores.values, color='#d4af37', alpha=0.8)
    ax.set_yticks(range(len(scores)))
    ax.set_yticklabels(scores.index, fontsize=8)
    ax.invert_yaxis()
    ax.set_title(title, fontweight='bold')
plt.suptitle('Filter Methods — Feature Importance Scores', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# WRAPPER METHODS: RFE and RFECV
# ══════════════════════════════════════
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression

# RFE — Recursive Feature Elimination
rfe = RFE(
    estimator=LogisticRegression(max_iter=1000, C=1.0),
    n_features_to_select=10,  # how many features to keep
    step=1                    # remove one feature per iteration
)
rfe.fit(X_train_sc, y_train)

rfe_ranking = pd.Series(rfe.ranking_, index=X.columns).sort_values()
selected = rfe_ranking[rfe_ranking == 1].index.tolist()
print(f"
RFE selected {len(selected)} features:")
print(selected)

# RFECV — RFE with Cross-Validation to find optimal number of features
from sklearn.model_selection import StratifiedKFold
rfecv = RFECV(
    estimator=LogisticRegression(max_iter=1000),
    min_features_to_select=1,
    cv=StratifiedKFold(5),
    scoring='accuracy',
    n_jobs=-1
)
rfecv.fit(X_train_sc, y_train)

print(f"
RFECV optimal number of features: {rfecv.n_features_}")

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(range(1, len(rfecv.cv_results_['mean_test_score'])+1), 
        rfecv.cv_results_['mean_test_score'], 'o-', color='#d4af37', linewidth=2)
ax.fill_between(range(1, len(rfecv.cv_results_['mean_test_score'])+1),
                rfecv.cv_results_['mean_test_score'] - rfecv.cv_results_['std_test_score'],
                rfecv.cv_results_['mean_test_score'] + rfecv.cv_results_['std_test_score'],
                alpha=0.2, color='#d4af37')
ax.axvline(rfecv.n_features_, color='red', linestyle='--', label=f'Optimal: {rfecv.n_features_}')
ax.set_xlabel('Number of Features')
ax.set_ylabel('Cross-Validation Accuracy')
ax.set_title('RFECV — Optimal Number of Features')
ax.legend()
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# EMBEDDED METHODS
# ══════════════════════════════════════

# 1. Lasso (L1) — drives some coefficients to exactly zero
from sklearn.linear_model import Lasso, Ridge, LassoCV

# LassoCV automatically finds best alpha via cross-validation
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_cv.fit(X_train_sc, y_train)
print(f"
LassoCV best alpha: {lasso_cv.alpha_:.6f}")

lasso_coefs = pd.Series(np.abs(lasso_cv.coef_), index=X.columns).sort_values(ascending=False)
lasso_selected = lasso_coefs[lasso_coefs > 0].index.tolist()
lasso_zeroed   = lasso_coefs[lasso_coefs == 0].index.tolist()

print(f"Lasso kept {len(lasso_selected)} features (coef ≠ 0):")
print(lasso_coefs.head(10).round(4).to_string())
print(f"
Lasso zeroed out {len(lasso_zeroed)} features: {lasso_zeroed}")

# 2. Tree Feature Importances (Embedded - Gini importance)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

rf_importance = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)

print("
Top 10 Features — Random Forest Importance:")
print(rf_importance.head(10).round(4).to_string())

# 3. SelectFromModel — automated threshold-based selection
from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(rf, threshold='mean')  # keep features above mean importance
sfm.fit(X_train, y_train)
sfm_selected = X.columns[sfm.get_support()].tolist()
print(f"
SelectFromModel (threshold=mean) — {len(sfm_selected)} features: {sfm_selected}")

# 4. Permutation Importance — model-agnostic, avoids high-cardinality bias
from sklearn.inspection import permutation_importance
perm = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)
perm_scores = pd.Series(perm.importances_mean, index=X.columns).sort_values(ascending=False)

# Comparison plot: Gini vs Permutation
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
rf_importance.head(15).plot(kind='barh', ax=axes[0], color='#d4af37', alpha=0.8)
axes[0].set_title('Random Forest — Gini Importance', fontweight='bold')
axes[0].invert_yaxis()

perm_scores.head(15).plot(kind='barh', ax=axes[1], color='#3a7bd5', alpha=0.8)
axes[1].set_title('Permutation Importance (unbiased)', fontweight='bold')
axes[1].invert_yaxis()

plt.suptitle('Embedded Feature Importance: Gini vs Permutation', fontweight='bold')
plt.tight_layout()
plt.show()

print("""
FEATURE SELECTION SUMMARY:
Filter:   Fast; model-independent; good for initial screening
Wrapper:  Best accuracy; expensive — use RFECV
Embedded: Best balance; Lasso + tree importances in practice
""")

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain feature selection and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 39 — Wrapper Methods

Wrapper Methods — RFE and RFECV

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Wrapper methods treat feature selection as a search problem: train a model, score subsets, and keep the set that maximizes performance. They are slower than filters but often find better subsets for a specific estimator.

Recursive Feature Elimination (RFE)

RFE repeatedly trains the model, ranks features by importance or coefficients, drops the weakest, and repeats until n_features_to_select remain.

Code Example
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

rfe = RFE(LogisticRegression(max_iter=1000), n_features_to_select=10)
rfe.fit(X_train_sc, y_train)
print("Selected:", X.columns[rfe.support_].tolist())

rfecv = RFECV(LogisticRegression(max_iter=1000), cv=StratifiedKFold(5), scoring='accuracy')
rfecv.fit(X_train_sc, y_train)
print("Optimal feature count:", rfecv.n_features_)
⚠️
Failure mode

RFE inside CV without a Pipeline leaks selection into each fold. Wrap RFE + estimator in Pipeline so selection runs only on training folds.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain wrapper methods and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 40 — Embedded Methods

Embedded Methods — Lasso, Tree Importances, SelectFromModel

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Embedded methods perform selection during training (L1 penalty, tree impurity). They balance speed and quality and are the default in many production tabular pipelines.

Code Example
from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

lasso = LassoCV(cv=5).fit(X_train_sc, y_train)
kept = X.columns[lasso.coef_ != 0]
print(f"Lasso kept {len(kept)} features")

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
sfm = SelectFromModel(rf, threshold='mean')
sfm.fit(X_train, y_train)
print("RF selected:", X.columns[sfm.get_support()].tolist())
Pro tip: Use filter methods for a fast screen (top 50%), then embedded selection for the final feature set. Reserve wrappers for high-stakes final tuning.
🚀
Module 3 Complete — What's Next?

You now have a complete preprocessing toolkit. Your clean, scaled, encoded, and selected feature matrix is ready for modeling. Move on to Module 4: Supervised Learning Algorithms.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain embedded methods and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Continue to the next day in this module.

Feature Engineering & Data Preprocessing Pipeline
Missing Values Imputation (Mean/Median) One-Hot Encoding Standard Scaling (z-score)
Exploratory Data Analysis → Supervised Learning →