Module 1: Machine Learning Foundations
100 Days of ML Module 1 — Master Machine Learning foundations: what is ML, AI vs ML vs DL, types of learning, Python setup, NumPy, Pandas, and the ML Life Cycle.
This module lays the conceptual bedrock of Machine Learning. By the end you will understand exactly what ML is, when to use it, how it differs from traditional programming, the key types of learning, and how to set up your Python data science environment.
What is Machine Learning?
Why this matters
Every ML career path starts here: you must know when learning beats hand-coded rules, and how the ML product lifecycle differs from writing one-off scripts.
Formal Definition
Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to learn from data, without being explicitly programmed. — Arthur Samuel, 1959.
Traditional Programming vs Machine Learning
| Aspect | Traditional Programming | Machine Learning |
|---|---|---|
| Input | Data + Hand-written Logic/Rules | Data + Expected Outputs (Labels) |
| Output | Answers | A learned Model / Program |
| Logic | Written explicitly by programmer | Discovered automatically by algorithm |
| Adaptability | Breaks when rules change | Retrain on new data → model adapts |
| Example | if "Congratulations" in email: spam = True | Train on 50,000 labeled emails → model learns all spam patterns |
When Should You Use Machine Learning?
ML is the right tool when:
- Too many rules to code: Spam classification, fraud detection — the rules are nearly infinite and constantly shifting.
- Rules impossible to enumerate: Image/face recognition — no human can write rules for all dog breeds, lighting, angles.
- Data mining: Finding hidden patterns in large datasets (recommendation systems, customer segmentation).
- Dynamic environments: Personalization systems that must adapt to changing user behavior in real-time.
Don't Use ML When…
The problem has simple deterministic rules (calculating a square root, sorting a list), you have very little data, or you need 100% explainability and zero tolerance for error (e.g., surgical equipment control).
Why ML is Booming Now
ML existed since the 1950s but only recently became practical due to three converging factors:
- Big Data: Internet, social media, IoT devices generating petabytes of labeled data daily.
- Hardware: GPUs and TPUs enabling parallel matrix operations — training that took months now takes hours.
- Better Algorithms: Deep learning, transformers, and refined optimization techniques.
Worked example — Spam vs rules
Rules: If email contains "winner" OR "free money" → spam. Spammers rename tokens; you patch rules weekly.
ML: Train on 50k labeled emails. Model learns combinations of words, headers, and senders. When spammers adapt, retrain on new labels — no manual rule explosion.
ML Job Market Context
ML talent has been scarce relative to demand. Salaries are favorable during the growth phase but will normalize as supply catches up — learning the full ML lifecycle (not just algorithms) is what separates ordinary from strong ML engineers.
Common mistakes
- Using ML for deterministic problems (tax calculation, sorting) — traditional code is cheaper and exact.
- Expecting magic without data — no labels, no volume, no signal means no reliable model.
- Confusing training success with business success — high offline accuracy does not guarantee production value.
Interview checkpoints
- Q: Define ML in one sentence. A: Systems that improve performance on a task by learning patterns from data rather than explicit rules.
- Q: Spam filter: rules or ML? A: ML — adversaries change patterns; rules rot quickly.
- Q: Why did ML explode after 2010? A: Data scale + compute (GPU) + better algorithms converged.
Practice
- Basic: List 3 problems in your domain where rules would be painful to maintain.
- Intermediate: Draw the traditional-programming vs ML flow for one use case.
- Advanced: Write a 1-page ML problem brief: objective, data source, success metric, constraints.
Recap
- ML learns patterns from data; it does not encode every rule manually.
- Use ML when cases are too many, too fuzzy, or hidden in data.
- Success today needs data, hardware, and sound ML lifecycle thinking.
Next: Day 2 — AI vs ML vs DL
AI vs ML vs Deep Learning
Why this matters
Interviewers and stakeholders constantly conflate AI, ML, and DL. Clear vocabulary prevents wrong architecture choices (e.g., using deep learning on tiny tabular data).
Relationship: AI ⊃ ML ⊃ Deep Learning
| Term | Definition | Examples | Key Techniques |
|---|---|---|---|
| Artificial Intelligence (AI) | Simulating human intelligence in machines — any technique enabling machines to mimic human behavior | Chess engines, expert systems, route planners | Rule-based systems, Search, Logic |
| Machine Learning (ML) | Subset of AI — systems that learn from data without being explicitly programmed | Email spam filter, credit scoring, recommendation engines | Linear models, Trees, SVM, Clustering |
| Deep Learning (DL) | Subset of ML using multi-layered neural networks that automatically learn hierarchical feature representations | Face recognition, GPT, DALL-E, AlphaFold | CNNs, RNNs, Transformers, Diffusion |
Why Deep Learning Now?
- Traditional ML requires manual feature engineering — domain experts hand-craft features (e.g., "count capital letters in email").
- Deep Learning does automatic feature learning — the network learns its own features from raw pixels, raw text, raw audio.
- DL outperforms classical ML when you have large amounts of data and GPU compute.
Which to Use?
Tabular structured data (CSV files): XGBoost / LightGBM usually wins over deep learning.
Images, Text, Audio: Deep Learning (CNNs, Transformers) dominates.
Small data (<10k samples): Classical ML + feature engineering is safer.
Common mistakes
- Calling every neural network project "AI" without specifying the learning paradigm.
- Choosing deep learning first on small tabular datasets where gradient boosting wins.
- Ignoring that classical ML still powers most production tabular systems.
Interview checkpoints
- Q: Is GPT "AI" or "ML"? A: Both — it's ML (learned from data) within the broader AI goal of intelligent behavior.
- Q: When is DL overkill? A: Small data, need interpretability, or simple structured features.
- Q: Feature engineering — classical ML or DL? A: Classical ML needs it; DL learns features from raw inputs.
Practice
- Basic: Classify 5 products (spam filter, chatbot, fraud score, house price, face unlock) into AI/ML/DL buckets.
- Intermediate: For a tabular churn dataset (50k rows), justify ML family choice.
- Advanced: Compare bias-variance and data needs for logistic regression vs a small MLP.
Recap
- AI ⊃ ML ⊃ DL — each level adds data-driven learning depth.
- Tabular → often classical ML; images/text/audio → often DL.
- Terminology clarity saves months of wrong tooling.
Next: Day 3 — Types of ML
Types of Machine Learning
Why this matters
Choosing supervised vs unsupervised vs reinforcement learning defines your entire project: data labeling budget, metrics, and deployment loop.
1. Supervised Learning
The algorithm is trained on a labeled dataset — every training example has an input $X$ and a known correct output $y$. The model learns a mapping $f: X \to y$.
| Task Type | Output | Examples | Algorithms |
|---|---|---|---|
| Classification | Discrete category / class label | Spam detection, disease diagnosis, sentiment analysis | Logistic Regression, SVM, Random Forest, XGBoost |
| Regression | Continuous numeric value | House price prediction, sales forecasting, temperature prediction | Linear Regression, Ridge, GBDT, SVR |
2. Unsupervised Learning
No labels are provided. The algorithm discovers inherent structure, patterns, or groupings in the data on its own.
- Clustering: K-Means, DBSCAN, Hierarchical — group similar data points together (customer segments).
- Dimensionality Reduction: PCA, t-SNE, UMAP — compress high-dimensional data into fewer features while preserving structure.
- Anomaly Detection: Isolation Forest, One-Class SVM — find outliers (fraud, manufacturing defects).
- Density Estimation: Gaussian Mixture Models — model the probability distribution of data.
3. Semi-Supervised Learning
A middle ground — a small amount of labeled data plus a large amount of unlabeled data. The model uses unlabeled data to learn structure, then refines with labeled examples. Used in NLP pre-training (BERT), medical imaging.
4. Reinforcement Learning (RL)
An agent learns by interacting with an environment. It takes actions, receives rewards or penalties, and learns a policy that maximizes cumulative reward over time.
Examples: AlphaGo, ChatGPT RLHF, trading bots, robotics control
5. Self-Supervised Learning
A special case of unsupervised learning where the model generates its own labels from the data. GPT models are pre-trained by predicting the next word — no human labeling needed. BERT is pre-trained by masking random words and predicting them.
Common mistakes
- Treating clustering output as ground truth without business validation.
- Using classification metrics on regression problems (or vice versa).
- Forgetting that semi-supervised and self-supervised exist for label-scarce settings.
Interview checkpoints
- Q: Customer segmentation — supervised or unsupervised? A: Usually unsupervised (clustering) unless you have defined segments as labels.
- Q: What is RLHF? A: Reinforcement learning from human feedback to align LLM behavior.
- Q: Self-supervised example? A: BERT masked language modeling; next-token prediction in GPT.
Practice
- Basic: Match 6 scenarios to supervised / unsupervised / RL.
- Intermediate: Design a label strategy for a new fraud product (what is y?).
- Advanced: Explain when semi-supervised beats pure supervised for 1M unlabeled + 5k labeled samples.
Recap
- Supervised needs labels; unsupervised finds structure; RL optimizes sequential rewards.
- Pick the paradigm before picking an algorithm.
- Many modern NLP/CV systems combine self-supervised pretraining + supervised fine-tuning.
Batch vs Online Learning · Instance vs Model-Based
Why this matters
Production systems fail when teams train offline but need online adaptation (or the reverse). This day sets your deployment and retraining strategy.
Batch Learning (Offline Learning)
The model is trained once on the entire available dataset, then deployed. To adapt to new data, you must retrain from scratch on the full updated dataset and redeploy.
- Pros: Simpler, stable performance, full data access during training.
- Cons: Can't adapt in real-time, expensive to retrain if dataset is huge, stale model between retrains.
- Use when: Data changes slowly, resources exist for full retraining (e.g., monthly product recommendation updates).
Online Learning (Incremental Learning)
The model is updated continuously as new data arrives — either one sample at a time or in small mini-batches (mini-batch learning). The learning rate $\eta$ (eta) controls how fast the model adapts.
- Pros: Adapts to new patterns immediately, memory-efficient (old data can be discarded), handles concept drift.
- Cons: If bad data arrives, model quality degrades quickly (requires data validation). Harder to debug.
- Use when: Real-time systems (stock trading, fraud detection, search ranking), very large datasets that won't fit in memory.
Concept Drift
The statistical properties of the input data (or the relationship between input and output) change over time. For example, spam tactics change weekly — a model trained in January may perform poorly by March. Online learning + monitoring addresses this.
Instance-Based Learning
The model memorizes training examples and makes predictions by comparing new inputs to stored examples using a similarity/distance measure. No explicit model parameters are learned.
Example — K-Nearest Neighbors (KNN): To classify a new point, find the $k$ most similar training points and take a majority vote.
$$d(p, q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}$$Model-Based Learning
The algorithm learns explicit model parameters (e.g., weights, thresholds, split rules) from the training data. Predictions use only these learned parameters — training data is not needed at inference time.
Example — Linear Regression: Learns parameters $\theta_0$ (bias) and $\theta_1, \theta_2, …$ (weights). Once trained, a new prediction is just a dot product: $\hat{y} = \theta^T x$
Common mistakes
- Using batch-trained models for rapidly drifting fraud without monitoring.
- Online updating on unvalidated streams — one bad batch poisons the model.
- Confusing instance-based (KNN) latency at scale with model-based inference cost.
Interview checkpoints
- Q: What is concept drift? A: Input/output relationship changes over time; old data misleads the model.
- Q: KNN — instance or model-based? A: Instance-based; stores data, compares at query time.
- Q: When prefer mini-batch online learning? A: Large streams, need stability between pure SGD noise and batch cost.
Practice
- Basic: Label each system batch vs online: monthly recommender, live click ranking, yearly census model.
- Intermediate: Sketch a retraining policy for a drift-prone classifier.
- Advanced: Compare inference cost of KNN vs linear model for 10M training points.
Recap
- Batch = retrain on full data; online = incremental updates.
- Instance-based memorizes; model-based learns parameters.
- Match learning mode to how fast your world changes.
Next: Day 5 — Python Setup
Python Development Setup for ML
Why this matters
Broken environments cause 80% of beginner frustration. A reproducible stack (conda + Jupyter + core libs) is non-negotiable for serious ML work.
Recommended Tools
| Tool | Purpose | Why It Matters |
|---|---|---|
| Anaconda / Miniconda | Python environment manager | Isolates project dependencies, prevents version conflicts |
| Jupyter Notebook / JupyterLab | Interactive computation | Run code cell-by-cell, visualize data inline, document analysis |
| VS Code | Full IDE with Python extension | Best for production code, debugging, Git integration |
| Google Colab | Free cloud Jupyter + GPU | No setup needed, free GPU/TPU for training |
| Git + GitHub | Version control | Track experiments, collaborate, build portfolio |
Essential ML Python Libraries
pip install numpy pandas matplotlib seaborn scikit-learn \
xgboost lightgbm scipy jupyterlab plotly| Library | Purpose | Key Use |
|---|---|---|
| NumPy | Numerical computing | Arrays, matrix ops, linear algebra |
| Pandas | Data manipulation | DataFrames, CSV loading, groupby, merge |
| Matplotlib | Plotting | Line charts, bar charts, any customizable plot |
| Seaborn | Statistical visualization | Beautiful statistical plots built on Matplotlib |
| Scikit-Learn | ML algorithms | Models, pipelines, preprocessing, metrics |
| XGBoost / LightGBM | Gradient boosting | State-of-the-art tabular ML |
| SciPy | Scientific computing | Statistical tests, optimization, distributions |
Common mistakes
- Installing everything in base Python without virtual environments.
- Version conflicts between TensorFlow and PyTorch in one env without need.
- Skipping Git — losing experiments and non-reproducible results.
Interview checkpoints
- Q: conda vs pip? A: conda manages environments + binary deps; pip installs Python packages (often use both).
- Q: Why Jupyter for ML? A: Iterative EDA, inline plots, narrative + code together.
- Q: Colab limitations? A: Session limits, data privacy, less control than local/proper CI.
Practice
- Basic: Create a conda env `ml100` with Python 3.11 and install numpy, pandas, sklearn.
- Intermediate: Export `conda env export > environment.yml` and recreate on another machine.
- Advanced: Set up VS Code + Jupyter kernel linked to your env; run a one-cell smoke test.
Recap
- Isolate projects with conda/venv.
- Core stack: NumPy → Pandas → viz → scikit-learn.
- Version control experiments from day one.
Next: Day 6 — NumPy
NumPy Essentials for ML
Why this matters
NumPy is the tensor layer under Pandas, sklearn, PyTorch, and TensorFlow. Weak NumPy = slow loops and confused matrix math in interviews.
NumPy is the foundation of scientific computing in Python. Every ML library (Pandas, Scikit-Learn, PyTorch, TensorFlow) internally uses NumPy arrays (ndarray). Understanding NumPy makes you faster and more efficient.
Core Concepts
import numpy as np
# Creating arrays
a = np.array([1, 2, 3, 4, 5]) # 1D array (vector)
b = np.array([[1,2,3],[4,5,6]]) # 2D array (matrix)
c = np.zeros((3, 4)) # 3×4 matrix of zeros
d = np.ones((2, 3)) # 2×3 matrix of ones
e = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
f = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
g = np.random.randn(3, 3) # 3×3 matrix of random normal values
# Shape and dimensions
print(b.shape) # (2, 3)
print(b.ndim) # 2
print(b.dtype) # int64
# Indexing & Slicing
arr = np.array([[10, 20, 30], [40, 50, 60]])
print(arr[0, 1]) # 20
print(arr[:, 1]) # [20, 50] — entire second column
print(arr[0, :]) # [10, 20, 30] — entire first row
print(arr[arr > 25]) # [30, 40, 50, 60] — boolean masking
# Vectorized operations (no loops needed!)
x = np.array([1, 2, 3, 4])
print(x * 2) # [2, 4, 6, 8]
print(x ** 2) # [1, 4, 9, 16]
print(np.sqrt(x)) # [1., 1.414, 1.732, 2.]
print(np.log(x)) # natural log of each elementEssential Operations for ML
import numpy as np
# Matrix multiplication (dot product) — core of neural nets
A = np.array([[1,2],[3,4]]) # 2×2
B = np.array([[5,6],[7,8]]) # 2×2
print(A @ B) # matrix multiply: [[19,22],[43,50]]
print(np.dot(A, B)) # same result
# Transpose
print(A.T) # [[1,3],[2,4]]
# Statistics — key for EDA
data = np.array([2, 5, 7, 1, 8, 3, 9])
print(np.mean(data)) # 5.0
print(np.median(data)) # 5.0
print(np.std(data)) # standard deviation
print(np.var(data)) # variance
print(np.min(data), np.max(data)) # 1, 9
print(np.percentile(data, [25, 50, 75])) # quartiles
# Reshaping
arr = np.arange(12)
matrix = arr.reshape(3, 4) # 3 rows × 4 columns
flat = matrix.flatten() # back to 1D
# Stacking arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.vstack([a, b])) # vertical stack → 2×3
print(np.hstack([a, b])) # horizontal stack → [1,2,3,4,5,6]Vectorization vs For Loops
NumPy operations are implemented in C and run on entire arrays at once — this is called vectorization. A NumPy dot product on 1M elements is ~100× faster than a Python for-loop. Never loop over NumPy arrays when a vectorized operation exists.
Common mistakes
- Python for-loops over rows instead of vectorized operations (100–1000× slower).
- Wrong shapes in matrix multiply (features vs samples axis confusion).
- Mutable views vs copies — silent bugs when slicing arrays.
Interview checkpoints
- Q: Shape of X with m samples, n features? A: (m, n) in sklearn convention.
- Q: Vectorization benefit? A: C-level loops in NumPy, SIMD, no Python per-element overhead.
- Q: dot vs element-wise *? A: dot = matrix multiplication; * = Hadamard (same shape).
Practice
- Basic: Create 5×3 random matrix; compute column means without loops.
- Intermediate: Implement batch dot product for two matrices; verify with np.dot.
- Advanced: Benchmark loop vs vectorized mean on 1M-element array.
Recap
- ndarray = fast homogeneous arrays.
- Think in shapes and vectorization.
- Broadcasting rules prevent many explicit loops.
Next: Day 7 — Pandas
Pandas Basics — The ML Workhorse
Why this matters
Real ML is 60%+ data wrangling. Pandas is how you load, clean, join, and aggregate before any model sees data.
Pandas is the most important library for ML practitioners working with real-world tabular data. Every step from data loading to feature engineering uses Pandas.
DataFrame Fundamentals
import pandas as pd
import numpy as np
# ── Creating DataFrames ──────────────────────────────────────
df = pd.read_csv('data.csv') # Load CSV
df = pd.read_excel('data.xlsx') # Load Excel
df = pd.read_json('data.json') # Load JSON
# Create from dictionary
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'Salary': [50000, 60000, 75000, 55000],
'Dept': ['HR', 'IT', 'Finance', 'IT']
})
# ── First Look at Data ────────────────────────────────────────
df.head(3) # first 3 rows (default: 5)
df.tail(3) # last 3 rows
df.shape # (4, 4) — rows, cols
df.columns # Index(['Name','Age','Salary','Dept'])
df.dtypes # data type of each column
df.info() # non-null count + dtype summary
df.describe() # count, mean, std, min, quartiles, max
# ── Selecting Data ────────────────────────────────────────────
df['Age'] # Series — single column
df[['Name', 'Salary']] # DataFrame — multiple columns
df.loc[0] # Row by index label
df.iloc[0] # Row by integer position
df.loc[df['Dept'] == 'IT'] # Rows where Dept is 'IT'
df[df['Age'] > 28] # Rows where Age > 28
df.loc[df['Dept']=='IT', 'Salary'] # IT employees' salaries onlyData Manipulation
# ── Adding & Modifying Columns ──────────────────────────────
df['Bonus'] = df['Salary'] * 0.1
df['Seniority'] = df['Age'].apply(lambda x: 'Senior' if x >= 30 else 'Junior')
# ── Sorting ──────────────────────────────────────────────────
df.sort_values('Salary', ascending=False)
df.sort_values(['Dept', 'Age'], ascending=[True, False])
# ── Aggregation ──────────────────────────────────────────────
df['Salary'].mean() # 60000.0
df.groupby('Dept')['Salary'].mean() # avg salary per department
df.groupby('Dept').agg({'Salary': ['mean', 'max'], 'Age': 'mean'})
# ── Missing Values ───────────────────────────────────────────
df.isnull().sum() # count nulls per column
df.isnull().sum() / len(df) # null percentage
df.dropna() # drop rows with any null
df.fillna(df.mean()) # fill nulls with column mean
df['Age'].fillna(df['Age'].median(), inplace=True)
# ── Filtering & Conditions ───────────────────────────────────
senior_it = df[(df['Dept'] == 'IT') & (df['Age'] >= 30)]
high_earners = df[df['Salary'].between(60000, 80000)]Common mistakes
- Not setting index or dtypes on load — silent object columns break models.
- Chained indexing (`df[df.A>0]['B']=1`) causing SettingWithCopyWarning.
- Merging without validating row counts — duplicate keys explode row count.
Interview checkpoints
- Q: loc vs iloc? A: loc = label-based; iloc = integer position.
- Q: groupby mental model? A: Split-apply-combine: partition by key, aggregate/transform each group.
- Q: Handle missing values in Pandas? A: dropna, fillna, interpolate — choice depends on MCAR/MAR/MNAR.
Practice
- Basic: Load a CSV; show head, info(), describe(), missing counts.
- Intermediate: groupby one categorical; compute mean of numeric cols per group.
- Advanced: Merge two tables on key; assert no duplicate inflation; document join type.
Recap
- DataFrame = labeled tables for ML pipelines.
- Master selection, filtering, groupby, merge.
- Always inspect dtypes and missingness after load.
Next: Day 8 — Pandas Advanced
Pandas Advanced — Merging, Pivoting, and Apply
Why this matters
Advanced Pandas (apply, pivot, time series, performance) separates analysts who script from engineers who ship reliable pipelines.
Merging DataFrames (SQL-style Joins)
import pandas as pd
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'dept_id': [10, 20, 10, 30]
})
departments = pd.DataFrame({
'dept_id': [10, 20, 30],
'dept_name': ['HR', 'IT', 'Finance']
})
# INNER JOIN — only rows that match in both
result = pd.merge(employees, departments, on='dept_id', how='inner')
# LEFT JOIN — keep all employees, nulls for missing dept
result = pd.merge(employees, departments, on='dept_id', how='left')
# Merge on different column names
result = pd.merge(
left_df, right_df,
left_on='emp_dept', right_on='dept_id',
how='inner'
)
# Concatenation — stack vertically
combined = pd.concat([df1, df2], ignore_index=True)
# Concatenation — stack horizontally
combined = pd.concat([df1, df2], axis=1)Pivot Tables & GroupBy Advanced
import pandas as pd
# Pivot Table — like Excel pivot
pivot = df.pivot_table(
values='Salary',
index='Dept',
columns='Seniority',
aggfunc='mean',
fill_value=0
)
# GroupBy with transform — adds group stats back to original df
df['dept_avg_salary'] = df.groupby('Dept')['Salary'].transform('mean')
df['salary_diff_from_avg'] = df['Salary'] - df['dept_avg_salary']
# Apply — custom functions on columns or rows
def salary_band(x):
if x < 50000: return 'Low'
elif x < 70000: return 'Medium'
else: return 'High'
df['salary_band'] = df['Salary'].apply(salary_band)
# Apply on multiple columns
df['name_age'] = df.apply(lambda row: f"{row['Name']} ({row['Age']})", axis=1)String Operations & DateTime
# String operations (vectorized — no loops!)
df['Name'].str.lower()
df['Name'].str.upper()
df['Name'].str.contains('a', case=False) # boolean mask
df['email'].str.extract(r'@(.+)') # extract domain
# DateTime handling — critical for time-series ML
df['date'] = pd.to_datetime(df['date_string'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek # 0=Monday
df['is_weekend'] = df['day_of_week'] >= 5Common mistakes
- apply(axis=1) on huge DataFrames when vectorization possible.
- Not using categorical dtype for low-cardinality strings (memory blowup).
- Parsing dates late — breaks time-series features and sorting.
Interview checkpoints
- Q: When is apply OK? A: Complex row logic with no vectorized alternative; not default choice.
- Q: melt vs pivot? A: pivot wideens; melt longens — tidy data for plotting/sklearn.
- Q: read_csv speed tips? A: usecols, dtype, parse_dates, chunks for large files.
Practice
- Basic: pivot_table: average tip by day and sex from sns tips.
- Intermediate: Resample monthly sales from daily data with DatetimeIndex.
- Advanced: Refactor a row-wise apply into vectorized ops; compare runtime.
Recap
- Prefer vectorized ops over apply when possible.
- Pivot/melt reshape data for modeling and viz.
- Datetime handling is critical for time-series ML.
Next: Day 9 — ML Life Cycle
The Machine Learning Life Cycle
Why this matters
Extraordinary ML engineers optimize the full lifecycle — problem framing, EDA, deployment, monitoring — not just algorithm trivia.
Understanding the full ML project workflow separates ordinary practitioners from exceptional ML engineers. Each step is interconnected, and skipping any one step typically leads to poor real-world performance.
Standard ML Project Workflow
Step 1: Problem Definition
Before touching any data, answer these questions:
- Business goal: What decision will this model inform? (Reduce churn? Detect fraud?)
- ML framing: Is this classification, regression, clustering, or ranking?
- Success metric: What number defines success? (Precision > 95%? RMSE < 100?)
- Baseline: What does the current non-ML solution achieve? (Always compare against this.)
- Data availability: Do we have enough labeled data? Is it representative?
Step 2–3: Data Collection & EDA
Spend 60–70% of your project time here. Most ML failures stem from data quality issues, not algorithm choice.
- Profile each feature: distribution, range, missing values, unique values.
- Detect and handle outliers and anomalies.
- Understand feature-target correlations.
- Check for train/test distribution shift.
Step 5: Model Selection Framework
| Scenario | Recommended Starting Point |
|---|---|
| Small tabular dataset (<10k) | Logistic Regression, Random Forest |
| Large tabular dataset | LightGBM / XGBoost |
| Image data | CNN (ResNet, EfficientNet) |
| Text/NLP | BERT, RoBERTa, or GPT fine-tuned |
| Time-series forecasting | LSTM, Prophet, N-BEATS |
| Needs interpretability | Decision Tree, Logistic Regression, SHAP on XGBoost |
Common mistakes
- Jumping to XGBoost before defining the business metric.
- Training on all data without holdout — false confidence.
- No monitoring plan after deployment (silent drift).
Interview checkpoints
- Q: First step in an ML project? A: Define problem, metric, constraints, baseline — before modeling.
- Q: CRISP-DM vs agile ML? A: CRISP-DM phases; agile iterates experiments in sprints — often combined.
- Q: What is a baseline model? A: Simple heuristic (mean, majority class) to beat before claiming ML value.
Practice
- Basic: Draw the 6-step ML lifecycle for a churn prediction product.
- Intermediate: Write success metrics (business + ML) for a recommendation feature.
- Advanced: Design monitoring: what alerts fire when fraud model drift occurs?
Recap
- Lifecycle: problem → data → EDA → model → evaluate → deploy → monitor.
- Align ML metrics with business outcomes.
- Baselines and iteration beat one-shot perfect models.
Next: Day 10 — scikit-learn
Scikit-Learn Introduction — The Unified API
Why this matters
scikit-learn's unified API (fit/predict/transform) is the industry standard for classical ML. Master it once, use dozens of algorithms.
Scikit-Learn is the most important library for classical ML. Its genius is a consistent, unified API across all algorithms — if you know how to use one model, you know how to use all of them.
The 3-Step Pattern (Estimator API)
Every Scikit-Learn estimator follows the exact same interface:
from sklearn.linear_model import LinearRegression # import
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
# ── Step 1: Instantiate ─────────────────────────────────────
model = RandomForestClassifier(n_estimators=100, random_state=42)
# ── Step 2: Fit (train) ──────────────────────────────────────
model.fit(X_train, y_train)
# ── Step 3: Predict ──────────────────────────────────────────
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test) # for classifiersA Complete ML Pipeline in 30 Lines
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# ── 1. Load Data ─────────────────────────────────────────────
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target # 0 = malignant, 1 = benign
# ── 2. Split ─────────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# ── 3. Preprocess ────────────────────────────────────────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # learn stats + scale
X_test_scaled = scaler.transform(X_test) # only scale (don't refit!)
# ── 4. Train ─────────────────────────────────────────────────
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# ── 5. Evaluate ──────────────────────────────────────────────
y_pred = model.predict(X_test_scaled)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred, target_names=data.target_names))
# ── 6. Feature Importance ────────────────────────────────────
import pandas as pd
importances = pd.Series(model.feature_importances_, index=data.feature_names)
print(importances.sort_values(ascending=False).head(10))Critical: fit_transform vs transform
Always call fit_transform() on training data only, then transform() (not fit_transform) on the test set. Fitting on test data causes data leakage — the model indirectly "sees" test statistics during training, inflating performance metrics.
Key Scikit-Learn Modules
| Module | Purpose | Key Classes |
|---|---|---|
sklearn.model_selection | Data splitting, cross-validation, tuning | train_test_split, KFold, GridSearchCV, cross_val_score |
sklearn.preprocessing | Scaling, encoding, transforms | StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder |
sklearn.linear_model | Linear models | LinearRegression, LogisticRegression, Ridge, Lasso |
sklearn.ensemble | Ensemble methods | RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier |
sklearn.metrics | Evaluation metrics | accuracy_score, f1_score, roc_auc_score, mean_squared_error |
sklearn.pipeline | Chain preprocessing + model | Pipeline, ColumnTransformer |
sklearn.datasets | Built-in toy datasets | load_iris, load_boston, make_classification, make_regression |
Common mistakes
- fit_transform on test data — data leakage.
- Not using Pipeline + ColumnTransformer for mixed feature types.
- Tuning on test set instead of validation/CV.
Interview checkpoints
- Q: fit vs transform vs fit_transform? A: fit learns params; transform applies; fit_transform only on train.
- Q: Why Pipeline? A: Prevents leakage, bundles preprocess+model, enables CV on full flow.
- Q: random_state purpose? A: Reproducible splits and stochastic algorithms.
Practice
- Basic: Train RandomForest on breast cancer; report accuracy.
- Intermediate: Add StandardScaler in Pipeline; compare with/without scaling on logistic regression.
- Advanced: Build ColumnTransformer for numeric + categorical features on a mixed dataset.
Recap
- Estimator API: instantiate → fit → predict.
- Never leak test info into preprocessing.
- Pipelines are production and interview best practice.
