100 Days of ML · Module 1 (10)

Module 1: Machine Learning Foundations

100 Days of ML Module 1 — Master Machine Learning foundations: what is ML, AI vs ML vs DL, types of learning, Python setup, NumPy, Pandas, and the ML Life Cycle.

⏱ 45 Min Read • 10 • Updated: May 2026

This module lays the conceptual bedrock of Machine Learning. By the end you will understand exactly what ML is, when to use it, how it differs from traditional programming, the key types of learning, and how to set up your Python data science environment.

What is Machine Learning?

Why this matters

Every ML career path starts here: you must know when learning beats hand-coded rules, and how the ML product lifecycle differs from writing one-off scripts.

Formal Definition

Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to learn from data, without being explicitly programmed. — Arthur Samuel, 1959.

Core Intuition: Instead of writing explicit rules, you feed data (examples of inputs + correct outputs) to an algorithm. The algorithm finds patterns and produces a learned program (model) that can generalize to new, unseen inputs.
      

Traditional Programming vs Machine Learning

Aspect	Traditional Programming	Machine Learning
Input	Data + Hand-written Logic/Rules	Data + Expected Outputs (Labels)
Output	Answers	A learned Model / Program
Logic	Written explicitly by programmer	Discovered automatically by algorithm
Adaptability	Breaks when rules change	Retrain on new data → model adapts
Example	if "Congratulations" in email: spam = True	Train on 50,000 labeled emails → model learns all spam patterns

When Should You Use Machine Learning?

ML is the right tool when:

Too many rules to code: Spam classification, fraud detection — the rules are nearly infinite and constantly shifting.
Rules impossible to enumerate: Image/face recognition — no human can write rules for all dog breeds, lighting, angles.
Data mining: Finding hidden patterns in large datasets (recommendation systems, customer segmentation).
Dynamic environments: Personalization systems that must adapt to changing user behavior in real-time.

📌

Don't Use ML When…

The problem has simple deterministic rules (calculating a square root, sorting a list), you have very little data, or you need 100% explainability and zero tolerance for error (e.g., surgical equipment control).

Why ML is Booming Now

ML existed since the 1950s but only recently became practical due to three converging factors:

Big Data: Internet, social media, IoT devices generating petabytes of labeled data daily.
Hardware: GPUs and TPUs enabling parallel matrix operations — training that took months now takes hours.
Better Algorithms: Deep learning, transformers, and refined optimization techniques.

Worked example — Spam vs rules

Rules: If email contains "winner" OR "free money" → spam. Spammers rename tokens; you patch rules weekly.

ML: Train on 50k labeled emails. Model learns combinations of words, headers, and senders. When spammers adapt, retrain on new labels — no manual rule explosion.

ML Job Market Context

ML talent has been scarce relative to demand. Salaries are favorable during the growth phase but will normalize as supply catches up — learning the full ML lifecycle (not just algorithms) is what separates ordinary from strong ML engineers.

Common mistakes

Using ML for deterministic problems (tax calculation, sorting) — traditional code is cheaper and exact.
Expecting magic without data — no labels, no volume, no signal means no reliable model.
Confusing training success with business success — high offline accuracy does not guarantee production value.

Interview checkpoints

Q: Define ML in one sentence. A: Systems that improve performance on a task by learning patterns from data rather than explicit rules.
Q: Spam filter: rules or ML? A: ML — adversaries change patterns; rules rot quickly.
Q: Why did ML explode after 2010? A: Data scale + compute (GPU) + better algorithms converged.

Practice

Basic: List 3 problems in your domain where rules would be painful to maintain.
Intermediate: Draw the traditional-programming vs ML flow for one use case.
Advanced: Write a 1-page ML problem brief: objective, data source, success metric, constraints.

Recap

ML learns patterns from data; it does not encode every rule manually.
Use ML when cases are too many, too fuzzy, or hidden in data.
Success today needs data, hardware, and sound ML lifecycle thinking.

Next: Day 2 — AI vs ML vs DL

AI vs ML vs Deep Learning

Why this matters

Interviewers and stakeholders constantly conflate AI, ML, and DL. Clear vocabulary prevents wrong architecture choices (e.g., using deep learning on tiny tabular data).

From the 100 Days curriculum: Focus on the ML product lifecycle — preprocessing, analysis, model selection, feature engineering, bias-variance, deployment — not isolated algorithms. That full-stack view is what employers reward.

Relationship: AI ⊃ ML ⊃ Deep Learning

Artificial Intelligence

Machine Learning

Deep Learning

Term	Definition	Examples	Key Techniques
Artificial Intelligence (AI)	Simulating human intelligence in machines — any technique enabling machines to mimic human behavior	Chess engines, expert systems, route planners	Rule-based systems, Search, Logic
Machine Learning (ML)	Subset of AI — systems that learn from data without being explicitly programmed	Email spam filter, credit scoring, recommendation engines	Linear models, Trees, SVM, Clustering
Deep Learning (DL)	Subset of ML using multi-layered neural networks that automatically learn hierarchical feature representations	Face recognition, GPT, DALL-E, AlphaFold	CNNs, RNNs, Transformers, Diffusion

Why Deep Learning Now?

Traditional ML requires manual feature engineering — domain experts hand-craft features (e.g., "count capital letters in email").
Deep Learning does automatic feature learning — the network learns its own features from raw pixels, raw text, raw audio.
DL outperforms classical ML when you have large amounts of data and GPU compute.

💡

Which to Use?

Tabular structured data (CSV files): XGBoost / LightGBM usually wins over deep learning.
Images, Text, Audio: Deep Learning (CNNs, Transformers) dominates.
Small data (<10k samples): Classical ML + feature engineering is safer.

Common mistakes

Calling every neural network project "AI" without specifying the learning paradigm.
Choosing deep learning first on small tabular datasets where gradient boosting wins.
Ignoring that classical ML still powers most production tabular systems.

Interview checkpoints

Q: Is GPT "AI" or "ML"? A: Both — it's ML (learned from data) within the broader AI goal of intelligent behavior.
Q: When is DL overkill? A: Small data, need interpretability, or simple structured features.
Q: Feature engineering — classical ML or DL? A: Classical ML needs it; DL learns features from raw inputs.

Practice

Basic: Classify 5 products (spam filter, chatbot, fraud score, house price, face unlock) into AI/ML/DL buckets.
Intermediate: For a tabular churn dataset (50k rows), justify ML family choice.
Advanced: Compare bias-variance and data needs for logistic regression vs a small MLP.

Recap

AI ⊃ ML ⊃ DL — each level adds data-driven learning depth.
Tabular → often classical ML; images/text/audio → often DL.
Terminology clarity saves months of wrong tooling.

Next: Day 3 — Types of ML

Types of Machine Learning

Why this matters

Choosing supervised vs unsupervised vs reinforcement learning defines your entire project: data labeling budget, metrics, and deployment loop.

1. Supervised Learning

The algorithm is trained on a labeled dataset — every training example has an input $X$ and a known correct output $y$. The model learns a mapping $f: X \to y$.

Task Type	Output	Examples	Algorithms
Classification	Discrete category / class label	Spam detection, disease diagnosis, sentiment analysis	Logistic Regression, SVM, Random Forest, XGBoost
Regression	Continuous numeric value	House price prediction, sales forecasting, temperature prediction	Linear Regression, Ridge, GBDT, SVR

2. Unsupervised Learning

No labels are provided. The algorithm discovers inherent structure, patterns, or groupings in the data on its own.

Clustering: K-Means, DBSCAN, Hierarchical — group similar data points together (customer segments).
Dimensionality Reduction: PCA, t-SNE, UMAP — compress high-dimensional data into fewer features while preserving structure.
Anomaly Detection: Isolation Forest, One-Class SVM — find outliers (fraud, manufacturing defects).
Density Estimation: Gaussian Mixture Models — model the probability distribution of data.

3. Semi-Supervised Learning

A middle ground — a small amount of labeled data plus a large amount of unlabeled data. The model uses unlabeled data to learn structure, then refines with labeled examples. Used in NLP pre-training (BERT), medical imaging.

4. Reinforcement Learning (RL)

An agent learns by interacting with an environment. It takes actions, receives rewards or penalties, and learns a policy that maximizes cumulative reward over time.

Agent

→ action →

Environment

→ state, reward →

Agent

Examples: AlphaGo, ChatGPT RLHF, trading bots, robotics control

5. Self-Supervised Learning

A special case of unsupervised learning where the model generates its own labels from the data. GPT models are pre-trained by predicting the next word — no human labeling needed. BERT is pre-trained by masking random words and predicting them.

Common mistakes

Treating clustering output as ground truth without business validation.
Using classification metrics on regression problems (or vice versa).
Forgetting that semi-supervised and self-supervised exist for label-scarce settings.

Interview checkpoints

Q: Customer segmentation — supervised or unsupervised? A: Usually unsupervised (clustering) unless you have defined segments as labels.
Q: What is RLHF? A: Reinforcement learning from human feedback to align LLM behavior.
Q: Self-supervised example? A: BERT masked language modeling; next-token prediction in GPT.

Practice

Basic: Match 6 scenarios to supervised / unsupervised / RL.
Intermediate: Design a label strategy for a new fraud product (what is y?).
Advanced: Explain when semi-supervised beats pure supervised for 1M unlabeled + 5k labeled samples.

Recap

Supervised needs labels; unsupervised finds structure; RL optimizes sequential rewards.
Pick the paradigm before picking an algorithm.
Many modern NLP/CV systems combine self-supervised pretraining + supervised fine-tuning.

Next: Day 4 — Batch vs Online Learning

Batch vs Online Learning · Instance vs Model-Based

Why this matters

Production systems fail when teams train offline but need online adaptation (or the reverse). This day sets your deployment and retraining strategy.

Batch Learning (Offline Learning)

The model is trained once on the entire available dataset, then deployed. To adapt to new data, you must retrain from scratch on the full updated dataset and redeploy.

Pros: Simpler, stable performance, full data access during training.
Cons: Can't adapt in real-time, expensive to retrain if dataset is huge, stale model between retrains.
Use when: Data changes slowly, resources exist for full retraining (e.g., monthly product recommendation updates).

Online Learning (Incremental Learning)

The model is updated continuously as new data arrives — either one sample at a time or in small mini-batches (mini-batch learning). The learning rate $\eta$ (eta) controls how fast the model adapts.

Pros: Adapts to new patterns immediately, memory-efficient (old data can be discarded), handles concept drift.
Cons: If bad data arrives, model quality degrades quickly (requires data validation). Harder to debug.
Use when: Real-time systems (stock trading, fraud detection, search ranking), very large datasets that won't fit in memory.

⚠️

Concept Drift

The statistical properties of the input data (or the relationship between input and output) change over time. For example, spam tactics change weekly — a model trained in January may perform poorly by March. Online learning + monitoring addresses this.

Instance-Based Learning

The model memorizes training examples and makes predictions by comparing new inputs to stored examples using a similarity/distance measure. No explicit model parameters are learned.

Example — K-Nearest Neighbors (KNN): To classify a new point, find the $k$ most similar training points and take a majority vote.

$$d(p, q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}$$

Model-Based Learning

The algorithm learns explicit model parameters (e.g., weights, thresholds, split rules) from the training data. Predictions use only these learned parameters — training data is not needed at inference time.

Example — Linear Regression: Learns parameters $\theta_0$ (bias) and $\theta_1, \theta_2, …$ (weights). Once trained, a new prediction is just a dot product: $\hat{y} = \theta^T x$

Common mistakes

Using batch-trained models for rapidly drifting fraud without monitoring.
Online updating on unvalidated streams — one bad batch poisons the model.
Confusing instance-based (KNN) latency at scale with model-based inference cost.

Interview checkpoints

Q: What is concept drift? A: Input/output relationship changes over time; old data misleads the model.
Q: KNN — instance or model-based? A: Instance-based; stores data, compares at query time.
Q: When prefer mini-batch online learning? A: Large streams, need stability between pure SGD noise and batch cost.

Practice

Basic: Label each system batch vs online: monthly recommender, live click ranking, yearly census model.
Intermediate: Sketch a retraining policy for a drift-prone classifier.
Advanced: Compare inference cost of KNN vs linear model for 10M training points.

Recap

Batch = retrain on full data; online = incremental updates.
Instance-based memorizes; model-based learns parameters.
Match learning mode to how fast your world changes.

Next: Day 5 — Python Setup

Python Development Setup for ML

Why this matters

Broken environments cause 80% of beginner frustration. A reproducible stack (conda + Jupyter + core libs) is non-negotiable for serious ML work.

Recommended Tools

Tool	Purpose	Why It Matters
Anaconda / Miniconda	Python environment manager	Isolates project dependencies, prevents version conflicts
Jupyter Notebook / JupyterLab	Interactive computation	Run code cell-by-cell, visualize data inline, document analysis
VS Code	Full IDE with Python extension	Best for production code, debugging, Git integration
Google Colab	Free cloud Jupyter + GPU	No setup needed, free GPU/TPU for training
Git + GitHub	Version control	Track experiments, collaborate, build portfolio

Essential ML Python Libraries

Installation — One Command

pip install numpy pandas matplotlib seaborn scikit-learn \
           xgboost lightgbm scipy jupyterlab plotly

Library	Purpose	Key Use
NumPy	Numerical computing	Arrays, matrix ops, linear algebra
Pandas	Data manipulation	DataFrames, CSV loading, groupby, merge
Matplotlib	Plotting	Line charts, bar charts, any customizable plot
Seaborn	Statistical visualization	Beautiful statistical plots built on Matplotlib
Scikit-Learn	ML algorithms	Models, pipelines, preprocessing, metrics
XGBoost / LightGBM	Gradient boosting	State-of-the-art tabular ML
SciPy	Scientific computing	Statistical tests, optimization, distributions

Common mistakes

Installing everything in base Python without virtual environments.
Version conflicts between TensorFlow and PyTorch in one env without need.
Skipping Git — losing experiments and non-reproducible results.

Interview checkpoints

Q: conda vs pip? A: conda manages environments + binary deps; pip installs Python packages (often use both).
Q: Why Jupyter for ML? A: Iterative EDA, inline plots, narrative + code together.
Q: Colab limitations? A: Session limits, data privacy, less control than local/proper CI.

Practice

Basic: Create a conda env `ml100` with Python 3.11 and install numpy, pandas, sklearn.
Intermediate: Export `conda env export > environment.yml` and recreate on another machine.
Advanced: Set up VS Code + Jupyter kernel linked to your env; run a one-cell smoke test.

Recap

Isolate projects with conda/venv.
Core stack: NumPy → Pandas → viz → scikit-learn.
Version control experiments from day one.

Next: Day 6 — NumPy

NumPy Essentials for ML

Why this matters

NumPy is the tensor layer under Pandas, sklearn, PyTorch, and TensorFlow. Weak NumPy = slow loops and confused matrix math in interviews.

NumPy is the foundation of scientific computing in Python. Every ML library (Pandas, Scikit-Learn, PyTorch, TensorFlow) internally uses NumPy arrays (ndarray). Understanding NumPy makes you faster and more efficient.

Core Concepts

Code Example

import numpy as np

# Creating arrays
a = np.array([1, 2, 3, 4, 5])          # 1D array (vector)
b = np.array([[1,2,3],[4,5,6]])         # 2D array (matrix)
c = np.zeros((3, 4))                    # 3×4 matrix of zeros
d = np.ones((2, 3))                     # 2×3 matrix of ones
e = np.arange(0, 10, 2)                # [0, 2, 4, 6, 8]
f = np.linspace(0, 1, 5)              # [0.0, 0.25, 0.5, 0.75, 1.0]
g = np.random.randn(3, 3)             # 3×3 matrix of random normal values

# Shape and dimensions
print(b.shape)   # (2, 3)
print(b.ndim)    # 2
print(b.dtype)   # int64

# Indexing & Slicing
arr = np.array([[10, 20, 30], [40, 50, 60]])
print(arr[0, 1])      # 20
print(arr[:, 1])      # [20, 50] — entire second column
print(arr[0, :])      # [10, 20, 30] — entire first row
print(arr[arr > 25])  # [30, 40, 50, 60] — boolean masking

# Vectorized operations (no loops needed!)
x = np.array([1, 2, 3, 4])
print(x * 2)         # [2, 4, 6, 8]
print(x ** 2)        # [1, 4, 9, 16]
print(np.sqrt(x))    # [1., 1.414, 1.732, 2.]
print(np.log(x))     # natural log of each element

Essential Operations for ML

Code Example

import numpy as np

# Matrix multiplication (dot product) — core of neural nets
A = np.array([[1,2],[3,4]])   # 2×2
B = np.array([[5,6],[7,8]])   # 2×2
print(A @ B)                  # matrix multiply: [[19,22],[43,50]]
print(np.dot(A, B))           # same result

# Transpose
print(A.T)                    # [[1,3],[2,4]]

# Statistics — key for EDA
data = np.array([2, 5, 7, 1, 8, 3, 9])
print(np.mean(data))     # 5.0
print(np.median(data))   # 5.0
print(np.std(data))      # standard deviation
print(np.var(data))      # variance
print(np.min(data), np.max(data))  # 1, 9
print(np.percentile(data, [25, 50, 75]))  # quartiles

# Reshaping
arr = np.arange(12)
matrix = arr.reshape(3, 4)     # 3 rows × 4 columns
flat = matrix.flatten()        # back to 1D

# Stacking arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.vstack([a, b]))   # vertical stack → 2×3
print(np.hstack([a, b]))   # horizontal stack → [1,2,3,4,5,6]

🚀

Vectorization vs For Loops

NumPy operations are implemented in C and run on entire arrays at once — this is called vectorization. A NumPy dot product on 1M elements is ~100× faster than a Python for-loop. Never loop over NumPy arrays when a vectorized operation exists.

Common mistakes

Python for-loops over rows instead of vectorized operations (100–1000× slower).
Wrong shapes in matrix multiply (features vs samples axis confusion).
Mutable views vs copies — silent bugs when slicing arrays.

Interview checkpoints

Q: Shape of X with m samples, n features? A: (m, n) in sklearn convention.
Q: Vectorization benefit? A: C-level loops in NumPy, SIMD, no Python per-element overhead.
Q: dot vs element-wise *? A: dot = matrix multiplication; * = Hadamard (same shape).

Practice

Basic: Create 5×3 random matrix; compute column means without loops.
Intermediate: Implement batch dot product for two matrices; verify with np.dot.
Advanced: Benchmark loop vs vectorized mean on 1M-element array.

Recap

ndarray = fast homogeneous arrays.
Think in shapes and vectorization.
Broadcasting rules prevent many explicit loops.

Next: Day 7 — Pandas

Pandas Basics — The ML Workhorse

Why this matters

Real ML is 60%+ data wrangling. Pandas is how you load, clean, join, and aggregate before any model sees data.

Pandas is the most important library for ML practitioners working with real-world tabular data. Every step from data loading to feature engineering uses Pandas.

DataFrame Fundamentals

Code Example

import pandas as pd
import numpy as np

# ── Creating DataFrames ──────────────────────────────────────
df = pd.read_csv('data.csv')                    # Load CSV
df = pd.read_excel('data.xlsx')                  # Load Excel
df = pd.read_json('data.json')                   # Load JSON

# Create from dictionary
df = pd.DataFrame({
    'Name':   ['Alice', 'Bob', 'Charlie', 'David'],
    'Age':    [25, 30, 35, 28],
    'Salary': [50000, 60000, 75000, 55000],
    'Dept':   ['HR', 'IT', 'Finance', 'IT']
})

# ── First Look at Data ────────────────────────────────────────
df.head(3)           # first 3 rows (default: 5)
df.tail(3)           # last 3 rows
df.shape             # (4, 4) — rows, cols
df.columns           # Index(['Name','Age','Salary','Dept'])
df.dtypes            # data type of each column
df.info()            # non-null count + dtype summary
df.describe()        # count, mean, std, min, quartiles, max

# ── Selecting Data ────────────────────────────────────────────
df['Age']                        # Series — single column
df[['Name', 'Salary']]           # DataFrame — multiple columns
df.loc[0]                        # Row by index label
df.iloc[0]                       # Row by integer position
df.loc[df['Dept'] == 'IT']       # Rows where Dept is 'IT'
df[df['Age'] > 28]               # Rows where Age > 28
df.loc[df['Dept']=='IT', 'Salary']  # IT employees' salaries only

Data Manipulation

Code Example

# ── Adding & Modifying Columns ──────────────────────────────
df['Bonus'] = df['Salary'] * 0.1
df['Seniority'] = df['Age'].apply(lambda x: 'Senior' if x >= 30 else 'Junior')

# ── Sorting ──────────────────────────────────────────────────
df.sort_values('Salary', ascending=False)
df.sort_values(['Dept', 'Age'], ascending=[True, False])

# ── Aggregation ──────────────────────────────────────────────
df['Salary'].mean()               # 60000.0
df.groupby('Dept')['Salary'].mean()   # avg salary per department
df.groupby('Dept').agg({'Salary': ['mean', 'max'], 'Age': 'mean'})

# ── Missing Values ───────────────────────────────────────────
df.isnull().sum()              # count nulls per column
df.isnull().sum() / len(df)   # null percentage
df.dropna()                    # drop rows with any null
df.fillna(df.mean())           # fill nulls with column mean
df['Age'].fillna(df['Age'].median(), inplace=True)

# ── Filtering & Conditions ───────────────────────────────────
senior_it = df[(df['Dept'] == 'IT') & (df['Age'] >= 30)]
high_earners = df[df['Salary'].between(60000, 80000)]

Common mistakes

Not setting index or dtypes on load — silent object columns break models.
Chained indexing (`df[df.A>0]['B']=1`) causing SettingWithCopyWarning.
Merging without validating row counts — duplicate keys explode row count.

Interview checkpoints

Q: loc vs iloc? A: loc = label-based; iloc = integer position.
Q: groupby mental model? A: Split-apply-combine: partition by key, aggregate/transform each group.
Q: Handle missing values in Pandas? A: dropna, fillna, interpolate — choice depends on MCAR/MAR/MNAR.

Practice

Basic: Load a CSV; show head, info(), describe(), missing counts.
Intermediate: groupby one categorical; compute mean of numeric cols per group.
Advanced: Merge two tables on key; assert no duplicate inflation; document join type.

Recap

DataFrame = labeled tables for ML pipelines.
Master selection, filtering, groupby, merge.
Always inspect dtypes and missingness after load.

Next: Day 8 — Pandas Advanced

Pandas Advanced — Merging, Pivoting, and Apply

Why this matters

Advanced Pandas (apply, pivot, time series, performance) separates analysts who script from engineers who ship reliable pipelines.

Merging DataFrames (SQL-style Joins)

Code Example

import pandas as pd

employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name':   ['Alice', 'Bob', 'Charlie', 'David'],
    'dept_id': [10, 20, 10, 30]
})
departments = pd.DataFrame({
    'dept_id':   [10, 20, 30],
    'dept_name': ['HR', 'IT', 'Finance']
})

# INNER JOIN — only rows that match in both
result = pd.merge(employees, departments, on='dept_id', how='inner')

# LEFT JOIN — keep all employees, nulls for missing dept
result = pd.merge(employees, departments, on='dept_id', how='left')

# Merge on different column names
result = pd.merge(
    left_df, right_df,
    left_on='emp_dept', right_on='dept_id',
    how='inner'
)

# Concatenation — stack vertically
combined = pd.concat([df1, df2], ignore_index=True)

# Concatenation — stack horizontally
combined = pd.concat([df1, df2], axis=1)

Pivot Tables & GroupBy Advanced

Code Example

import pandas as pd

# Pivot Table — like Excel pivot
pivot = df.pivot_table(
    values='Salary',
    index='Dept',
    columns='Seniority',
    aggfunc='mean',
    fill_value=0
)

# GroupBy with transform — adds group stats back to original df
df['dept_avg_salary'] = df.groupby('Dept')['Salary'].transform('mean')
df['salary_diff_from_avg'] = df['Salary'] - df['dept_avg_salary']

# Apply — custom functions on columns or rows
def salary_band(x):
    if x < 50000: return 'Low'
    elif x < 70000: return 'Medium'
    else: return 'High'

df['salary_band'] = df['Salary'].apply(salary_band)

# Apply on multiple columns
df['name_age'] = df.apply(lambda row: f"{row['Name']} ({row['Age']})", axis=1)

String Operations & DateTime

Code Example

# String operations (vectorized — no loops!)
df['Name'].str.lower()
df['Name'].str.upper()
df['Name'].str.contains('a', case=False)  # boolean mask
df['email'].str.extract(r'@(.+)')         # extract domain

# DateTime handling — critical for time-series ML
df['date'] = pd.to_datetime(df['date_string'])
df['year']  = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday
df['is_weekend'] = df['day_of_week'] >= 5

Common mistakes

apply(axis=1) on huge DataFrames when vectorization possible.
Not using categorical dtype for low-cardinality strings (memory blowup).
Parsing dates late — breaks time-series features and sorting.

Interview checkpoints

Q: When is apply OK? A: Complex row logic with no vectorized alternative; not default choice.
Q: melt vs pivot? A: pivot wideens; melt longens — tidy data for plotting/sklearn.
Q: read_csv speed tips? A: usecols, dtype, parse_dates, chunks for large files.

Practice

Basic: pivot_table: average tip by day and sex from sns tips.
Intermediate: Resample monthly sales from daily data with DatetimeIndex.
Advanced: Refactor a row-wise apply into vectorized ops; compare runtime.

Recap

Prefer vectorized ops over apply when possible.
Pivot/melt reshape data for modeling and viz.
Datetime handling is critical for time-series ML.

Next: Day 9 — ML Life Cycle

The Machine Learning Life Cycle

Why this matters

Extraordinary ML engineers optimize the full lifecycle — problem framing, EDA, deployment, monitoring — not just algorithm trivia.

Understanding the full ML project workflow separates ordinary practitioners from exceptional ML engineers. Each step is interconnected, and skipping any one step typically leads to poor real-world performance.

Standard ML Project Workflow

1️⃣ Problem Definition & Success Metrics

↓

2️⃣ Data Collection & Understanding

↓

3️⃣ Exploratory Data Analysis (EDA)

↓

4️⃣ Data Preprocessing & Feature Engineering

↓

5️⃣ Model Selection & Training

↓

6️⃣ Evaluation & Hyperparameter Tuning

↓

7️⃣ Deployment & Monitoring

Step 1: Problem Definition

Before touching any data, answer these questions:

Business goal: What decision will this model inform? (Reduce churn? Detect fraud?)
ML framing: Is this classification, regression, clustering, or ranking?
Success metric: What number defines success? (Precision > 95%? RMSE < 100?)
Baseline: What does the current non-ML solution achieve? (Always compare against this.)
Data availability: Do we have enough labeled data? Is it representative?

Step 2–3: Data Collection & EDA

Spend 60–70% of your project time here. Most ML failures stem from data quality issues, not algorithm choice.

Profile each feature: distribution, range, missing values, unique values.
Detect and handle outliers and anomalies.
Understand feature-target correlations.
Check for train/test distribution shift.

Step 5: Model Selection Framework

Scenario	Recommended Starting Point
Small tabular dataset (<10k)	Logistic Regression, Random Forest
Large tabular dataset	LightGBM / XGBoost
Image data	CNN (ResNet, EfficientNet)
Text/NLP	BERT, RoBERTa, or GPT fine-tuned
Time-series forecasting	LSTM, Prophet, N-BEATS
Needs interpretability	Decision Tree, Logistic Regression, SHAP on XGBoost

Common mistakes

Jumping to XGBoost before defining the business metric.
Training on all data without holdout — false confidence.
No monitoring plan after deployment (silent drift).

Interview checkpoints

Q: First step in an ML project? A: Define problem, metric, constraints, baseline — before modeling.
Q: CRISP-DM vs agile ML? A: CRISP-DM phases; agile iterates experiments in sprints — often combined.
Q: What is a baseline model? A: Simple heuristic (mean, majority class) to beat before claiming ML value.

Practice

Basic: Draw the 6-step ML lifecycle for a churn prediction product.
Intermediate: Write success metrics (business + ML) for a recommendation feature.
Advanced: Design monitoring: what alerts fire when fraud model drift occurs?

Recap

Lifecycle: problem → data → EDA → model → evaluate → deploy → monitor.
Align ML metrics with business outcomes.
Baselines and iteration beat one-shot perfect models.

Next: Day 10 — scikit-learn

Scikit-Learn Introduction — The Unified API

Why this matters

scikit-learn's unified API (fit/predict/transform) is the industry standard for classical ML. Master it once, use dozens of algorithms.

Scikit-Learn is the most important library for classical ML. Its genius is a consistent, unified API across all algorithms — if you know how to use one model, you know how to use all of them.

The 3-Step Pattern (Estimator API)

Every Scikit-Learn estimator follows the exact same interface:

Universal Pattern — Works for ALL sklearn models

from sklearn.linear_model import LinearRegression   # import
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

# ── Step 1: Instantiate ─────────────────────────────────────
model = RandomForestClassifier(n_estimators=100, random_state=42)

# ── Step 2: Fit (train) ──────────────────────────────────────
model.fit(X_train, y_train)

# ── Step 3: Predict ──────────────────────────────────────────
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)  # for classifiers

A Complete ML Pipeline in 30 Lines

Code Example

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# ── 1. Load Data ─────────────────────────────────────────────
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target  # 0 = malignant, 1 = benign

# ── 2. Split ─────────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ── 3. Preprocess ────────────────────────────────────────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # learn stats + scale
X_test_scaled  = scaler.transform(X_test)         # only scale (don't refit!)

# ── 4. Train ─────────────────────────────────────────────────
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# ── 5. Evaluate ──────────────────────────────────────────────
y_pred = model.predict(X_test_scaled)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# ── 6. Feature Importance ────────────────────────────────────
import pandas as pd
importances = pd.Series(model.feature_importances_, index=data.feature_names)
print(importances.sort_values(ascending=False).head(10))

⚠️

Critical: fit_transform vs transform

Always call fit_transform() on training data only, then transform() (not fit_transform) on the test set. Fitting on test data causes data leakage — the model indirectly "sees" test statistics during training, inflating performance metrics.

Key Scikit-Learn Modules

Module	Purpose	Key Classes
`sklearn.model_selection`	Data splitting, cross-validation, tuning	train_test_split, KFold, GridSearchCV, cross_val_score
`sklearn.preprocessing`	Scaling, encoding, transforms	StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
`sklearn.linear_model`	Linear models	LinearRegression, LogisticRegression, Ridge, Lasso
`sklearn.ensemble`	Ensemble methods	RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
`sklearn.metrics`	Evaluation metrics	accuracy_score, f1_score, roc_auc_score, mean_squared_error
`sklearn.pipeline`	Chain preprocessing + model	Pipeline, ColumnTransformer
`sklearn.datasets`	Built-in toy datasets	load_iris, load_boston, make_classification, make_regression

Common mistakes

fit_transform on test data — data leakage.
Not using Pipeline + ColumnTransformer for mixed feature types.
Tuning on test set instead of validation/CV.

Interview checkpoints

Q: fit vs transform vs fit_transform? A: fit learns params; transform applies; fit_transform only on train.
Q: Why Pipeline? A: Prevents leakage, bundles preprocess+model, enables CV on full flow.
Q: random_state purpose? A: Reproducible splits and stochastic algorithms.

Practice

Basic: Train RandomForest on breast cancer; report accuracy.
Intermediate: Add StandardScaler in Pipeline; compare with/without scaling on logistic regression.
Advanced: Build ColumnTransformer for numeric + categorical features on a mixed dataset.

Recap

Estimator API: instantiate → fit → predict.
Never leak test info into preprocessing.
Pipelines are production and interview best practice.

Next: Module 2 — Exploratory Data Analysis →

Traditional Programming vs. Machine Learning Paradigms

← Back to 100 Days of ML Exploratory Data Analysis →