Search topics…
Tutorials
Explore
June 6 Offline Event →
100 Days of ML · Module 2 (25)

Module 2: Exploratory Data Analysis (EDA)

100 Days of ML Module 2 — Master Exploratory Data Analysis: univariate/bivariate/multivariate analysis, missing data, outlier detection, matplotlib, seaborn, plotly, correlation, and case studies.

⏱ 90 Min Read 25 Updated: May 2026

EDA is the most critical and time-consuming module of any ML project. Experienced practitioners spend 60–70% of project time here. By the end of this module you will be able to deeply understand any dataset, uncover hidden patterns, detect problems before they ruin your models, and tell data-driven stories through powerful visualizations.

What is EDA? Why is it the Most Important Step?

Why this matters

EDA is where most production ML failures are prevented — you discover leakage, bad dtypes, and useless features before wasting weeks on modeling.

Definition and Philosophy

Exploratory Data Analysis (EDA) is the process of investigating datasets to summarize their main characteristics, discover patterns, spot anomalies, test hypotheses, and check assumptions — primarily through statistical summaries and visualizations. The term was coined by statistician John Tukey in his 1977 book of the same name.

Core Philosophy: Don't make assumptions about your data. Let the data tell its own story. A model trained on poorly understood data will fail in production no matter how sophisticated the algorithm.

Why EDA is Non-Negotiable

  • Understand what you have: Feature types, ranges, distributions, and business meaning.
  • Detect data quality issues: Missing values, duplicates, typos, wrong dtypes, impossible values (e.g., negative age).
  • Discover feature-target relationships: Which features actually predict the target? Are they linear, non-linear, or conditional?
  • Identify feature engineering opportunities: Should you log-transform income? Create age buckets? Combine features?
  • Choose the right algorithm: Linear data → linear model. High-dimensional sparse → tree-based or regularized models.
  • Prevent data leakage: Features that are derived from the target variable will inflate performance metrics.

The Three Pillars of EDA

📊 Univariate
One variable at a time
📈 Bivariate
Relationship between two variables
🌐 Multivariate
Interactions among 3+ variables

The First Commands: Pandas EDA Toolkit

Every EDA starts with these fundamental Pandas commands to get an initial lay of the land:

Essential EDA First Steps
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('your_dataset.csv')

# ── Shape & Basic Info ──────────────────────────────────────────
print(f"Shape: {df.shape}")           # (rows, columns)
print(f"Size: {df.size}")             # total number of elements
print(f"Columns: {list(df.columns)}")

# ── .info() — THE most important first look ─────────────────────
# Shows: column name, non-null count, dtype
# Immediately reveals: missing values + wrong dtypes
df.info()
# Output:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 891 entries, 0 to 890
# Data columns (total 12 columns):
#  #   Column       Non-Null Count  Dtype
# ---  ------       --------------  -----
#  0   PassengerId  891 non-null    int64
#  1   Survived     891 non-null    int64
#  2   Pclass       891 non-null    int64
#  3   Name         891 non-null    object
#  4   Sex          891 non-null    object
#  5   Age          714 non-null    float64   ← MISSING! (177 nulls)
#  6   SibSp        891 non-null    int64
#  7   Parch        891 non-null    int64
#  8   Ticket       891 non-null    object
#  9   Fare         891 non-null    float64
#  10  Cabin        204 non-null    object    ← HEAVILY MISSING!
#  11  Embarked     889 non-null    object    ← 2 missing

# ── .describe() — Statistical summary ──────────────────────────
# For numeric columns: count, mean, std, min, 25%, 50%, 75%, max
df.describe()

# For ALL columns including categoricals:
df.describe(include='all')

# ── .value_counts() — Frequency of categories ──────────────────
df['Sex'].value_counts()
# male      577
# female    314
# dtype: int64

df['Sex'].value_counts(normalize=True)  # proportions (0 to 1)
df['Pclass'].value_counts().sort_index()  # sort by class

# ── Checking Data Types ──────────────────────────────────────────
df.dtypes
df.select_dtypes(include=['object']).columns   # categorical cols
df.select_dtypes(include=['number']).columns   # numeric cols

# ── Checking for Duplicates ──────────────────────────────────────
print(f"Duplicate rows: {df.duplicated().sum()}")
df[df.duplicated()]  # view duplicate rows
df.drop_duplicates(inplace=True)  # remove them

# ── Unique Value Counts ──────────────────────────────────────────
df.nunique()  # how many unique values per column
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")
💡
EDA Mindset

Always ask: "Does this make sense from a business perspective?" A customer with age = -5 or salary = $999,999,999 is almost certainly a data entry error. Never blindly trust your data. Validate against domain knowledge.

Common mistakes

  • Skipping EDA and jumping straight to modeling on dirty data.
  • Treating correlation as causation without domain checks.
  • Ignoring class imbalance or duplicate rows visible only in plots.

Interview checkpoints

  • Q: Why is EDA non-negotiable? A: It validates data quality, distributions, and signal before any algorithm choice.
  • Q: EDA on train only or full dataset? A: Explore train deeply; compare test only for drift, never tune on test.

Practice

  1. Basic: Summarize shape, dtypes, missing %, and target distribution for a CSV.
  2. Intermediate: Build a 6-panel EDA dashboard (hist, box, corr heatmap, missing bar).
  3. Advanced: Write an EDA report with 3 actionable feature-engineering ideas.

Recap

  • You can explain what is eda? why is it the most important step? and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 12 — Bivariate EDA

Univariate Analysis — Understanding Each Feature Individually

Why this matters

Univariate Analysis: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Univariate analysis examines each variable in isolation. For numerical features, we study distributions. For categorical features, we study frequency counts.

Distribution Shapes — What to Look For

Shape Description Skewness Value Implication for ML
Normal (Gaussian) Symmetric bell curve, mean ≈ median ≈ mode ≈ 0 Ideal for linear models; no transformation needed
Right-skewed (positive) Long tail to the right; most values are low (e.g., income) > 0 (often 1–3+) Apply log transform; can hurt linear models
Left-skewed (negative) Long tail to the left; most values are high < 0 Apply square/cube transform
Bimodal Two distinct peaks — suggests two sub-populations Variable Consider separating into two groups; hidden categorical
Uniform All values equally likely ≈ 0 Feature may not be very discriminative

Statistical Measures of Shape

Skewness measures the asymmetry of the distribution. For a distribution with values $x_1, x_2, ..., x_n$:

$$\text{Skewness} = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3$$

Kurtosis measures the "tailedness" — how heavy the tails are relative to a normal distribution. High kurtosis (leptokurtic) means heavy tails with more extreme outliers.

$$\text{Kurtosis} = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}$$

Histograms, KDE Plots, and Box Plots

Code Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load Titanic dataset
df = pd.read_csv('titanic.csv')

# ── 1. Histogram — shows frequency distribution ─────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].hist(df['Age'].dropna(), bins=30, color='#d4af37', alpha=0.7, edgecolor='black')
axes[0].set_title('Age Distribution (Histogram)')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Count')

# ── 2. KDE (Kernel Density Estimate) — smooth probability density
sns.kdeplot(df['Age'].dropna(), ax=axes[1], color='#d4af37', fill=True, alpha=0.4)
axes[1].set_title('Age Distribution (KDE)')
axes[1].set_xlabel('Age')

# ── 3. Box Plot — shows quartiles and outliers ───────────────────
axes[2].boxplot(df['Age'].dropna(), patch_artist=True,
                boxprops=dict(facecolor='rgba(212,175,55,0.2)', color='#d4af37'),
                medianprops=dict(color='white', linewidth=2))
axes[2].set_title('Age Distribution (Box Plot)')
axes[2].set_ylabel('Age')

plt.tight_layout()
plt.savefig('univariate_age.png', dpi=150, bbox_inches='tight')
plt.show()

# ── Combined: Histogram + KDE overlay ───────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(df['Age'].dropna(), kde=True, bins=30, color='#d4af37', ax=ax, 
             edgecolor='black', alpha=0.6)
ax.axvline(df['Age'].mean(), color='red', linestyle='--', label=f"Mean: {df['Age'].mean():.1f}")
ax.axvline(df['Age'].median(), color='blue', linestyle='--', label=f"Median: {df['Age'].median():.1f}")
ax.legend()
ax.set_title('Age Distribution with Mean and Median')
plt.show()

# ── Measuring Skewness and Kurtosis ─────────────────────────────
for col in df.select_dtypes(include='number').columns:
    skew = df[col].skew()
    kurt = df[col].kurtosis()
    print(f"{col:15} | Skewness: {skew:7.3f} | Kurtosis: {kurt:7.3f}")

# ── Seaborn displot — the modern all-in-one distribution plot ──
sns.displot(df, x='Age', hue='Survived', kind='kde', fill=True, height=5, aspect=1.5)
plt.title('Age Distribution by Survival Status')
plt.show()

# ── For Categorical Variables ────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart of value counts
df['Pclass'].value_counts().sort_index().plot(kind='bar', ax=axes[0], 
    color='#d4af37', alpha=0.8, edgecolor='black')
axes[0].set_title('Passenger Class Distribution')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Count')

# Pie chart
df['Sex'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%',
    colors=['#d4af37', '#3a7bd5'])
axes[1].set_title('Gender Distribution')
axes[1].set_ylabel('')
plt.tight_layout()
plt.show()
⚠️
Histogram Bin Count Matters

Too few bins hide the true shape; too many bins create noise. Use Sturges' rule: $k = \lceil \log_2 n \rceil + 1$ or just try 20–50 bins for most datasets. Seaborn's histplot with kde=True is usually the best default choice.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain univariate analysis and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 13 — Missing Data

Missing Data Analysis — Types, Detection, and Visualization

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Types of Missingness (Rubin's Classification, 1976)

Understanding why data is missing is just as important as knowing how much is missing. The type of missingness determines the appropriate handling strategy.

Type Full Name Meaning Example Handling
MCAR Missing Completely At Random Probability of missing is unrelated to any observed or unobserved data A survey respondent randomly skips a question; sensor randomly malfunctions Simple imputation (mean/median/mode) is safe
MAR Missing At Random Probability of missing depends on observed variables, but NOT on the missing value itself Older people less likely to report income, but given their age, income missingness is random Imputation using other features (KNN, MICE) is appropriate
MNAR Missing Not At Random Probability of missing IS related to the unobserved missing value itself High earners less likely to report income; severely ill patients don't complete health surveys Most dangerous — needs domain expertise; create a "is_missing" indicator feature

Detecting and Visualizing Missing Data

Code Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('titanic.csv')

# ── Basic Missing Value Analysis ────────────────────────────────
missing_count = df.isnull().sum()
missing_pct   = df.isnull().sum() / len(df) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing %': missing_pct
}).sort_values('Missing %', ascending=False)

print(missing_df[missing_df['Missing Count'] > 0])
# Output:
#          Missing Count  Missing %
# Cabin              687  77.104377
# Age                177  19.865320
# Embarked             2   0.224467

# ── Visual 1: Bar chart of missing values ───────────────────────
fig, ax = plt.subplots(figsize=(10, 5))
missing_df['Missing %'].plot(kind='bar', ax=ax, color='#d4af37', alpha=0.8, edgecolor='black')
ax.axhline(y=50, color='red', linestyle='--', alpha=0.7, label='50% threshold')
ax.axhline(y=20, color='orange', linestyle='--', alpha=0.7, label='20% threshold')
ax.set_title('Missing Data Percentage by Column', fontsize=14, fontweight='bold')
ax.set_xlabel('Columns')
ax.set_ylabel('Missing %')
ax.legend()
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# ── Visual 2: Heatmap of nulls (manual, no missingno needed) ───
fig, ax = plt.subplots(figsize=(12, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis', ax=ax)
ax.set_title('Missing Value Heatmap — Yellow = Missing', fontsize=14)
plt.tight_layout()
plt.show()

# ── Visual 3: missingno library (if installed) ──────────────────
# pip install missingno
try:
    import missingno as msno
    
    # Matrix plot — white lines = missing values
    msno.matrix(df, figsize=(10, 5), color=(0.83, 0.69, 0.22))
    plt.title('Missing Value Matrix')
    plt.show()
    
    # Bar chart — bar height = % complete
    msno.bar(df, figsize=(10, 5), color='#d4af37')
    plt.title('Data Completeness by Column')
    plt.show()
    
    # Heatmap — correlation of missingness between columns
    # High value = they tend to be missing together (informative!)
    msno.heatmap(df, figsize=(8, 6))
    plt.title('Missingness Correlation Heatmap')
    plt.show()
    
    # Dendrogram — clusters columns by missingness similarity
    msno.dendrogram(df, figsize=(10, 5))
    plt.show()
    
except ImportError:
    print("Install missingno: pip install missingno")

# ── MCAR Test (Little's MCAR Test) ─────────────────────────────
# If p-value > 0.05, data is likely MCAR
# Requires pyampute or statsmodels
from scipy.stats import chi2

def littles_mcar_test(df):
    """Simplified version of Little's MCAR test"""
    numeric_df = df.select_dtypes(include='number')
    # Check if there's a pattern in missing data
    pattern = numeric_df.isnull().astype(int)
    # If all patterns are uncorrelated → MCAR
    corr = pattern.corr()
    print("Missingness correlation matrix:")
    print(corr.round(3))
    return corr

littles_mcar_test(df)

# ── Creating a 'is_missing' indicator feature ───────────────────
# This is the MNAR strategy — preserve the fact that it was missing
df['Age_is_missing']   = df['Age'].isnull().astype(int)
df['Cabin_is_missing'] = df['Cabin'].isnull().astype(int)

# Check if missingness correlates with target
print(df.groupby('Age_is_missing')['Survived'].mean())
# If survival rate differs significantly → missingness is informative (MNAR!)
📌
Missing Value Rules of Thumb

<5% missing: Safe to drop rows or use simple imputation.
5–20% missing: Use KNN or iterative imputation; consider is_missing indicator.
>20% missing: Dropping the column is often better; or use advanced imputation + indicator.
>50% missing: Almost always drop the column unless domain knowledge says otherwise (like Cabin in Titanic telling us about cabin class).

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain missing data analysis and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 14 — Outlier Detection

Outlier Detection — IQR, Z-Score, and Visualization

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is an Outlier?

An outlier is an observation that lies an abnormal distance from other values in a sample. Outliers can be:

  • Genuine extreme values: A billionaire in a salary dataset — real, but extreme.
  • Measurement errors: A person recorded as 250 years old.
  • Data entry errors: Salary entered as $10,000,000 instead of $100,000.
  • Rare events: A flash crash in stock prices — real and important to model.

Method 1: IQR (Interquartile Range) Method

The IQR method is robust because it uses quartiles, not the mean (which is itself affected by outliers). This is the default method used in box plots.

$$IQR = Q_3 - Q_1$$ $$\text{Lower Fence} = Q_1 - 1.5 \times IQR \quad \text{Upper Fence} = Q_3 + 1.5 \times IQR$$

Any value outside these fences is considered an outlier. Using 3.0 instead of 1.5 gives "extreme outliers".

Method 2: Z-Score Method

Z-score measures how many standard deviations a value is from the mean. Assumes data is approximately normally distributed.

$$z = \frac{x - \mu}{\sigma}$$

Values with $|z| > 3$ are typically considered outliers (only 0.3% of a normal distribution lies beyond 3σ).

Code Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

df = pd.read_csv('titanic.csv')

# ════════════════════════════════════════════
# METHOD 1: IQR Method
# ════════════════════════════════════════════
def detect_outliers_iqr(data, column, threshold=1.5):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_fence = Q1 - threshold * IQR
    upper_fence = Q3 + threshold * IQR
    
    outliers = data[(data[column] < lower_fence) | (data[column] > upper_fence)]
    
    print(f"Column: {column}")
    print(f"  Q1={Q1:.2f}, Q3={Q3:.2f}, IQR={IQR:.2f}")
    print(f"  Lower fence: {lower_fence:.2f}")
    print(f"  Upper fence: {upper_fence:.2f}")
    print(f"  Outliers detected: {len(outliers)} ({len(outliers)/len(data)*100:.2f}%)")
    return outliers, lower_fence, upper_fence

fare_outliers, fare_lower, fare_upper = detect_outliers_iqr(df, 'Fare')
age_outliers, age_lower, age_upper    = detect_outliers_iqr(df, 'Age')

# ════════════════════════════════════════════
# METHOD 2: Z-Score Method
# ════════════════════════════════════════════
def detect_outliers_zscore(data, column, threshold=3):
    z_scores = np.abs(stats.zscore(data[column].dropna()))
    outlier_mask = z_scores > threshold
    outliers = data[column].dropna()[outlier_mask]
    
    print(f"Column: {column}")
    print(f"  Mean: {data[column].mean():.2f}, Std: {data[column].std():.2f}")
    print(f"  Z-score outliers (|z|>{threshold}): {len(outliers)} ({len(outliers)/len(data)*100:.2f}%)")
    return outliers

detect_outliers_zscore(df, 'Fare')

# ════════════════════════════════════════════
# VISUALIZATION: Box Plots
# ════════════════════════════════════════════
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Basic box plot
df.boxplot(column='Fare', ax=axes[0])
axes[0].set_title('Fare — Box Plot
(Outliers shown as dots)')

# Grouped box plot — outliers by class
df.boxplot(column='Fare', by='Pclass', ax=axes[1])
axes[1].set_title('Fare by Passenger Class')

# Seaborn box plot with strip overlay
sns.boxplot(data=df, y='Fare', x='Pclass', ax=axes[2], palette='viridis')
sns.stripplot(data=df, y='Fare', x='Pclass', ax=axes[2], color='#d4af37', size=3, alpha=0.5)
axes[2].set_title('Fare Distribution + Individual Points')

plt.suptitle('Outlier Visualization — Box Plots', fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# ════════════════════════════════════════════
# HANDLING OUTLIERS
# ════════════════════════════════════════════
df_clean = df.copy()

# Strategy 1: Remove outliers (use when errors)
df_no_outliers = df_clean[df_clean['Fare'] <= fare_upper]
print(f"Rows after removing Fare outliers: {len(df_no_outliers)}")

# Strategy 2: Cap/Winsorize (use when genuine extreme values)
df_clean['Fare_capped'] = df_clean['Fare'].clip(lower=fare_lower, upper=fare_upper)

# Strategy 3: Log transformation (use for right-skewed data)
df_clean['Fare_log'] = np.log1p(df_clean['Fare'])  # log1p handles 0 values safely

# Compare distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(df['Fare'].dropna(), bins=50, color='#d4af37', alpha=0.7)
axes[0].set_title('Original Fare (Heavy right skew)')
axes[1].hist(df_clean['Fare_capped'].dropna(), bins=50, color='blue', alpha=0.7)
axes[1].set_title('Fare — Winsorized (Capped)')
axes[2].hist(df_clean['Fare_log'].dropna(), bins=50, color='green', alpha=0.7)
axes[2].set_title('Fare — Log Transformed')
plt.tight_layout()
plt.show()
💡
When to Remove vs Keep Outliers

Remove: When they are clear data entry errors or measurement failures.
Cap (Winsorize): When they are genuine but extreme — preserves the observation but limits its influence.
Keep: When outliers are the most important signal (fraud detection, anomaly detection).
Transform: Log/sqrt transforms reduce the effect of outliers without removing data.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain outlier detection and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 15 — Matplotlib

Matplotlib Deep Dive — Figure/Axes Architecture

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Understanding the Object-Oriented API

Matplotlib has two interfaces: the pyplot interface (quick and simple) and the object-oriented interface (recommended for complex plots). Every plot consists of a Figure (the canvas) containing one or more Axes (individual plot panels).

Figure (the whole canvas)
contains
Axes (individual plot area)
has
Title
x/y-axis
Legend
Tick labels
Artists (lines, bars, etc.)
Code Example
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd

# ── Basic Figure/Axes Architecture ──────────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))    # fig=canvas, ax=single plot area

x = np.linspace(0, 2*np.pi, 100)
ax.plot(x, np.sin(x), color='#d4af37', linewidth=2, label='sin(x)')
ax.plot(x, np.cos(x), color='#3a7bd5', linewidth=2, linestyle='--', label='cos(x)')

# Complete customization
ax.set_title('Sine and Cosine Functions', fontsize=14, fontweight='bold', pad=15)
ax.set_xlabel('Angle (radians)', fontsize=12)
ax.set_ylabel('Value', fontsize=12)
ax.legend(loc='upper right', fontsize=11)
ax.grid(True, alpha=0.3, linestyle='--')
ax.axhline(y=0, color='white', alpha=0.3, linewidth=1)
ax.set_xlim(0, 2*np.pi)
ax.set_ylim(-1.2, 1.2)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('sine_cosine.png', dpi=150, bbox_inches='tight', facecolor='#0e0e18')
plt.show()

# ── Multiple Subplots: plt.subplots ─────────────────────────────
fig, axes = plt.subplots(2, 3, figsize=(15, 8))   # 2 rows × 3 cols

# Access individual axes by [row][col]
axes[0][0].hist(np.random.normal(0, 1, 1000), bins=30, color='#d4af37')
axes[0][0].set_title('Normal Distribution')

axes[0][1].scatter(np.random.rand(100), np.random.rand(100), color='#3a7bd5', alpha=0.6)
axes[0][1].set_title('Scatter Plot')

axes[0][2].bar(['A','B','C','D'], [4, 7, 2, 9], color=['#d4af37','#3a7bd5','#e74c3c','#2ecc71'])
axes[0][2].set_title('Bar Chart')

x = np.linspace(-3, 3, 100)
axes[1][0].plot(x, x**2, color='#e74c3c', linewidth=2)
axes[1][0].fill_between(x, x**2, alpha=0.2, color='#e74c3c')
axes[1][0].set_title('Parabola with Fill')

data = [np.random.normal(loc, 0.5, 100) for loc in [1, 2, 3, 4]]
axes[1][1].boxplot(data, patch_artist=True, labels=['G1','G2','G3','G4'])
axes[1][1].set_title('Box Plots')

theta = np.linspace(0, 2*np.pi, 100)
axes[1][2].plot(np.cos(theta), np.sin(theta), color='#d4af37', linewidth=2)
axes[1][2].set_aspect('equal')
axes[1][2].set_title('Circle')

plt.suptitle('Matplotlib Subplot Gallery', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# ── Advanced: GridSpec for unequal subplot sizes ────────────────
fig = plt.figure(figsize=(12, 8))
gs = gridspec.GridSpec(2, 3, figure=fig)

ax_main  = fig.add_subplot(gs[0, :2])   # spans first 2 cols, first row
ax_side  = fig.add_subplot(gs[0, 2])    # third col, first row
ax_bot1  = fig.add_subplot(gs[1, 0])    # bottom left
ax_bot2  = fig.add_subplot(gs[1, 1])    # bottom middle
ax_bot3  = fig.add_subplot(gs[1, 2])    # bottom right

data = np.random.randn(500)
ax_main.hist(data, bins=40, color='#d4af37', alpha=0.8, edgecolor='black')
ax_main.set_title('Main: Full Distribution', fontweight='bold')

ax_side.boxplot(data, patch_artist=True, boxprops=dict(facecolor='rgba(212,175,55,0.3)'))
ax_side.set_title('Box Plot')

for ax, title in zip([ax_bot1, ax_bot2, ax_bot3], ['Q1', 'Q2-Q3', 'Q4']):
    ax.set_title(title)

plt.suptitle('GridSpec Layout Example', fontweight='bold')
plt.tight_layout()
plt.show()

# ── Saving Figures — Production Quality ─────────────────────────
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot([1,2,3,4,5], [1,4,9,16,25], 'o-', color='#d4af37', linewidth=2, markersize=8)
ax.set_title('Save Figure Demo', fontsize=14)

# Save options
plt.savefig('plot.png', dpi=300, bbox_inches='tight')          # PNG for web
plt.savefig('plot.pdf', bbox_inches='tight')                    # PDF for papers  
plt.savefig('plot.svg', bbox_inches='tight')                    # SVG for scaling
plt.savefig('plot_dark.png', dpi=300, bbox_inches='tight',     # Dark background
            facecolor='#0d0d1a', edgecolor='none')
plt.show()

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain matplotlib deep dive and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 16 — Seaborn

Seaborn Deep Dive — Statistical Visualization

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Seaborn is built on top of Matplotlib and provides beautiful, statistically-aware visualizations with minimal code. It has three high-level "figure-level" functions: relplot, displot, and catplot — each creating multiple sub-plots through a col or row parameter.

Function Type Use For Kind Options
displot() Distribution Univariate/bivariate distributions hist, kde, ecdf
relplot() Relational Relationships between numerical variables scatter, line
catplot() Categorical Distribution of numerical variable across categories strip, swarm, box, violin, bar, count, point
Code Example
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Set professional theme
sns.set_theme(style="darkgrid", palette="deep")

df = sns.load_dataset('titanic')   # built-in Titanic dataset
tips = sns.load_dataset('tips')    # built-in tips dataset

# ══════════════════════════════════════
# DISTRIBUTION PLOTS (displot)
# ══════════════════════════════════════

# 1. Histogram with KDE
sns.displot(df, x='age', kde=True, hue='survived', bins=25, height=5, aspect=1.4)
plt.title('Age Distribution by Survival')
plt.show()

# 2. KDE plot — great for comparing distributions
sns.displot(df, x='fare', hue='class', kind='kde', fill=True, height=5, aspect=1.4)
plt.title('Fare Distribution by Passenger Class')
plt.show()

# 3. ECDF — cumulative distribution
sns.displot(df, x='age', hue='sex', kind='ecdf', height=5, aspect=1.4)
plt.title('Cumulative Distribution of Age by Gender')
plt.show()

# ══════════════════════════════════════
# RELATIONAL PLOTS (relplot)
# ══════════════════════════════════════

# 4. Scatter with color and size encoding
sns.relplot(
    data=df, x='age', y='fare',
    hue='survived',        # color = survival
    size='pclass',         # size = passenger class
    style='sex',           # marker style = gender
    height=6, aspect=1.2,
    palette={0: '#e74c3c', 1: '#2ecc71'},
    sizes=(30, 200)
)
plt.title('Age vs Fare (color=Survival, size=Class, style=Gender)')
plt.show()

# ══════════════════════════════════════
# CATEGORICAL PLOTS (catplot)
# ══════════════════════════════════════

# 5. Box plot
sns.catplot(data=df, x='class', y='age', kind='box', hue='sex', height=5, aspect=1.3)
plt.title('Age Distribution by Class and Gender')
plt.show()

# 6. Violin plot — shows full distribution shape
sns.catplot(data=df, x='class', y='fare', kind='violin', height=5, aspect=1.3,
            hue='survived', split=True, inner='quart')
plt.title('Fare Distribution by Class (split by survival)')
plt.show()

# 7. Strip plot — shows all individual data points
sns.catplot(data=df, x='pclass', y='age', kind='strip', hue='survived', 
            jitter=True, height=5, aspect=1.3, palette={0:'#e74c3c', 1:'#2ecc71'})
plt.title('Age by Class — Individual Points (jittered)')
plt.show()

# 8. Bar plot — mean + confidence interval
sns.catplot(data=df, x='class', y='survived', kind='bar', 
            hue='sex', height=5, aspect=1.3, palette='Set2')
plt.title('Survival Rate by Class and Gender
(95% Confidence Intervals shown)')
plt.ylabel('Survival Rate')
plt.show()

# ══════════════════════════════════════
# PAIRPLOT — EDA SUPERWEAPON
# ══════════════════════════════════════

# 9. Pairplot — all pairwise relationships at once!
numeric_df = df[['age', 'fare', 'sibsp', 'parch', 'survived']].dropna()
g = sns.pairplot(
    numeric_df, 
    hue='survived',          # color by target variable
    diag_kind='kde',         # diagonal: KDE plots
    plot_kws={'alpha': 0.5, 's': 30},
    palette={0: '#e74c3c', 1: '#2ecc71'}
)
g.fig.suptitle('Pairplot — All Features vs All Features (color=Survival)', y=1.02)
plt.show()

# ══════════════════════════════════════
# HEATMAP — Correlation Matrix
# ══════════════════════════════════════

# 10. Correlation heatmap
corr_matrix = numeric_df.corr()
fig, ax = plt.subplots(figsize=(8, 6))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))   # hide upper triangle
sns.heatmap(
    corr_matrix,
    annot=True,          # show values in cells
    fmt='.2f',           # 2 decimal places
    cmap='RdYlGn',       # red=negative, green=positive
    vmin=-1, vmax=1,     # fix color scale
    mask=mask,           # show only lower triangle
    square=True,
    linewidths=0.5,
    ax=ax
)
ax.set_title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain seaborn deep dive and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 17 — Plotly

Plotly — Interactive Visualizations

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Plotly creates interactive, web-ready charts with hover tooltips, zoom, pan, and animations. Essential for dashboards, presentations, and exploring large datasets where static charts are insufficient.

Code Example
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd
import numpy as np

df = px.data.gapminder()    # built-in gapminder dataset
tips = px.data.tips()       # built-in tips dataset
titanic = pd.read_csv('titanic.csv')

# ══════════════════════════════════════
# PLOTLY EXPRESS — High-level API
# ══════════════════════════════════════

# 1. Interactive Scatter Plot
fig = px.scatter(
    df.query("year == 2007"),   # only 2007 data
    x='gdpPercap', y='lifeExp',
    size='pop',                 # bubble size = population
    color='continent',          # color by continent
    hover_name='country',       # tooltip title
    hover_data=['year', 'pop'], # additional tooltip info
    log_x=True,                 # log scale for GDP
    size_max=60,
    title='Life Expectancy vs GDP per Capita (2007)',
    labels={'gdpPercap': 'GDP per Capita (log scale)', 'lifeExp': 'Life Expectancy'}
)
fig.update_layout(
    template='plotly_dark',   # dark theme
    title_x=0.5
)
fig.show()

# 2. Animated Bubble Chart — Hans Rosling style!
fig = px.scatter(
    df, x='gdpPercap', y='lifeExp',
    animation_frame='year',     # play animation by year
    animation_group='country',  # track country across frames
    size='pop', color='continent',
    hover_name='country',
    log_x=True, size_max=55,
    range_x=[100, 100000], range_y=[25, 90],
    title='World Development 1952–2007 (Animated)',
    template='plotly_dark'
)
fig.show()

# 3. Interactive Bar Chart
survival_by_class = titanic.groupby('Pclass')['Survived'].mean().reset_index()
fig = px.bar(
    survival_by_class, x='Pclass', y='Survived',
    color='Survived', color_continuous_scale='RdYlGn',
    title='Survival Rate by Passenger Class',
    labels={'Survived': 'Survival Rate', 'Pclass': 'Passenger Class'},
    text='Survived',
    template='plotly_dark'
)
fig.update_traces(texttemplate='%{text:.1%}', textposition='outside')
fig.update_layout(coloraxis_showscale=False)
fig.show()

# 4. Interactive Histogram with marginal plots
fig = px.histogram(
    tips, x='total_bill', y='tip',
    color='sex', marginal='box',    # marginal box plot on top
    hover_data=tips.columns,
    barmode='overlay',
    title='Total Bill Distribution with Marginal Box Plots',
    template='plotly_dark'
)
fig.show()

# 5. Choropleth Map
fig = px.choropleth(
    df.query("year == 2007"),
    locations='iso_alpha',
    color='lifeExp',
    hover_name='country',
    color_continuous_scale=px.colors.sequential.Viridis,
    title='Life Expectancy by Country (2007)',
    template='plotly_dark'
)
fig.show()

# ══════════════════════════════════════
# PLOTLY GO — Full control
# ══════════════════════════════════════

# 6. Subplots with plotly
fig = make_subplots(rows=2, cols=2,
    subplot_titles=('Histogram', 'Box Plot', 'Violin', 'Scatter'))

# Add traces to specific subplot positions
fig.add_trace(go.Histogram(x=tips['total_bill'], marker_color='#d4af37', name='Total Bill'), 
              row=1, col=1)
fig.add_trace(go.Box(y=tips['tip'], name='Tip', marker_color='#3a7bd5'), 
              row=1, col=2)
fig.add_trace(go.Violin(y=tips['total_bill'], x=tips['day'], name='By Day',
                        box_visible=True, meanline_visible=True), 
              row=2, col=1)
fig.add_trace(go.Scatter(x=tips['total_bill'], y=tips['tip'], mode='markers',
                         marker=dict(color=tips['size'], colorscale='Viridis', size=8),
                         name='Scatter'), 
              row=2, col=2)

fig.update_layout(height=700, title_text='Plotly Subplot Gallery', 
                  template='plotly_dark', showlegend=False)
fig.show()

# 7. Interactive Pie Chart
fig = px.pie(titanic, names='Pclass', values='Survived',
             title='Total Survivors by Passenger Class',
             template='plotly_dark', hole=0.3)  # donut chart
fig.update_traces(textinfo='percent+label')
fig.show()
💡
Plotly in Jupyter vs HTML

In Jupyter Notebooks, fig.show() renders inline. To save as interactive HTML: fig.write_html('chart.html'). For static PNG export: fig.write_image('chart.png') (requires pip install kaleido). Use plotly.offline.plot() to open in a browser without Jupyter.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain plotly and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 18 — Correlation

Correlation Analysis — Pearson, Spearman, Kendall, and VIF

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Types of Correlation

Pearson Correlation ($r$) — measures linear relationship between two continuous variables. Assumes normality and linearity.

$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}$$

Spearman Correlation ($\rho$) — rank-based, works for non-linear monotonic relationships. Robust to outliers.

$$\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)} \quad \text{where } d_i = \text{rank}(x_i) - \text{rank}(y_i)$$

Kendall's Tau ($\tau$) — concordance-based correlation, better for small samples and many tied ranks.

Method Best For Assumption Robust to Outliers
PearsonLinear relationships, normally distributed dataLinearity, normality, homoscedasticityNo ❌
SpearmanMonotonic (not necessarily linear) relationshipsOrdinal or continuous dataYes ✅
KendallSmall samples, many tiesOrdinal dataYes ✅
Code Example
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

df = pd.read_csv('titanic.csv')
numeric_df = df.select_dtypes(include='number').drop(['PassengerId'], axis=1, errors='ignore')

# ══════════════════════════════════════
# CORRELATION MATRICES
# ══════════════════════════════════════

# Pearson correlation (default in pandas)
pearson_corr = numeric_df.corr(method='pearson')

# Spearman rank correlation
spearman_corr = numeric_df.corr(method='spearman')

# Kendall Tau correlation
kendall_corr = numeric_df.corr(method='kendall')

# ══════════════════════════════════════
# CORRELATION HEATMAP (Production Quality)
# ══════════════════════════════════════

def plot_correlation_heatmap(corr_matrix, title, figsize=(9, 7)):
    fig, ax = plt.subplots(figsize=figsize)
    
    # Create mask for upper triangle
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
    
    # Draw heatmap
    sns.heatmap(
        corr_matrix,
        annot=True,
        fmt='.2f',
        cmap='RdYlGn',
        vmin=-1, vmax=1,
        mask=mask,
        square=True,
        linewidths=0.5,
        cbar_kws={'shrink': 0.8},
        ax=ax,
        annot_kws={'size': 9}
    )
    
    # Add correlation strength indicators
    ax.set_title(f'{title}
(|r| > 0.7 = Strong, 0.3–0.7 = Moderate, < 0.3 = Weak)',
                 fontsize=12, fontweight='bold', pad=15)
    plt.tight_layout()
    plt.show()

plot_correlation_heatmap(pearson_corr, 'Pearson Correlation Matrix')
plot_correlation_heatmap(spearman_corr, 'Spearman Correlation Matrix')

# ══════════════════════════════════════
# POINT BISERIAL CORRELATION
# (Continuous variable vs Binary variable)
# ══════════════════════════════════════

from scipy.stats import pointbiserialr

# Correlation of each feature with the binary target
target = 'Survived'
print("
Point-Biserial Correlation with Survived:")
print("=" * 50)
for col in numeric_df.columns:
    if col != target:
        clean = numeric_df[[col, target]].dropna()
        corr, pval = pointbiserialr(clean[col], clean[target])
        sig = "✅ Significant" if pval < 0.05 else "❌ Not significant"
        print(f"{col:15} r={corr:+.3f}  p={pval:.4f}  {sig}")

# ══════════════════════════════════════
# VIF — Variance Inflation Factor
# Detects MULTICOLLINEARITY among features
# ══════════════════════════════════════

from sklearn.linear_model import LinearRegression

def calculate_vif(df):
    """
    VIF for feature X_i = 1 / (1 - R²_i)
    where R²_i = R² from regressing X_i on all other features.
    VIF > 5: concerning; VIF > 10: severe multicollinearity
    """
    vif_data = []
    cols = df.columns.tolist()
    
    for i, col in enumerate(cols):
        X = df.drop(columns=[col])
        y = df[col]
        
        # Drop rows with NaN
        mask = (~X.isnull().any(axis=1)) & (~y.isnull())
        X, y = X[mask], y[mask]
        
        r2 = LinearRegression().fit(X, y).score(X, y)
        vif = 1 / (1 - r2) if r2 < 1 else float('inf')
        vif_data.append({'Feature': col, 'VIF': round(vif, 2)})
    
    return pd.DataFrame(vif_data).sort_values('VIF', ascending=False)

clean_numeric = numeric_df.dropna()
vif_df = calculate_vif(clean_numeric)
print("
Variance Inflation Factors:")
print(vif_df.to_string(index=False))
# VIF > 10 → severe multicollinearity → consider removing the feature

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain correlation analysis and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 19 — Pandas Profiling

Automated EDA — ydata-profiling (Pandas Profiling)

Why this matters

EDA is where most production ML failures are prevented — you discover leakage, bad dtypes, and useless features before wasting weeks on modeling.

ydata-profiling (formerly Pandas Profiling) generates a comprehensive HTML EDA report with a single line of code. It covers distributions, missing values, correlations, and more — perfect for quickly understanding a new dataset.

Code Example
# pip install ydata-profiling

import pandas as pd
from ydata_profiling import ProfileReport

df = pd.read_csv('titanic.csv')

# ── Basic Report ─────────────────────────────────────────────────
profile = ProfileReport(df, title="Titanic EDA Report", explorative=True)
profile.to_file("titanic_eda_report.html")  # saves interactive HTML
profile.to_notebook_iframe()                # display in Jupyter

# ── Minimal Report (faster for large datasets) ───────────────────
profile_minimal = ProfileReport(df, title="Quick EDA", minimal=True)
profile_minimal.to_file("quick_report.html")

# ── What the Report Covers: ──────────────────────────────────────
# 1. Dataset Overview: shape, dtypes, missing values, duplicate rows
# 2. Per-Variable Analysis:
#    - Numeric: min, max, mean, std, quartiles, histogram, KDE
#    - Categorical: unique values, top categories, bar chart
#    - Date: date range, trend
# 3. Correlations: Pearson, Spearman, Kendall, Cramér's V (categorical)
# 4. Missing Values: heatmap, bar chart, matrix (missingno style)
# 5. Interactions: scatter plots for all numeric pairs
# 6. Duplicates: identifies exact duplicate rows

# ── Comparing Two Datasets (train vs test drift detection) ───────
df_train = pd.read_csv('train.csv')
df_test  = pd.read_csv('test.csv')

report_train = ProfileReport(df_train, title='Train')
report_test  = ProfileReport(df_test,  title='Test')

comparison_report = report_train.compare(report_test)
comparison_report.to_file('train_test_comparison.html')

# ── sweetviz — Another automated EDA library ────────────────────
# pip install sweetviz
import sweetviz as sv

report = sv.analyze(df)
report.show_html('sweetviz_report.html')

# Compare with a different dataset or split
report_compare = sv.compare([df_train, "Train"], [df_test, "Test"])
report_compare.show_html('train_test_sweetviz.html')

# ── D-Tale — Interactive EDA Dashboard ──────────────────────────
# pip install dtale
# import dtale
# d = dtale.show(df)
# d.open_browser()   # opens a full interactive web dashboard!
📌
Automated EDA vs Manual EDA

Automated tools like ydata-profiling are great for initial exploration but should not replace manual EDA. They generate generic insights — your domain expertise and business context will reveal things that automation misses. Use automated EDA to quickly identify where to dig deeper, then use matplotlib/seaborn for targeted investigation.

Common mistakes

  • Skipping EDA and jumping straight to modeling on dirty data.
  • Treating correlation as causation without domain checks.
  • Ignoring class imbalance or duplicate rows visible only in plots.

Interview checkpoints

  • Q: Why is EDA non-negotiable? A: It validates data quality, distributions, and signal before any algorithm choice.
  • Q: EDA on train only or full dataset? A: Explore train deeply; compare test only for drift, never tune on test.

Practice

  1. Basic: Summarize shape, dtypes, missing %, and target distribution for a CSV.
  2. Intermediate: Build a 6-panel EDA dashboard (hist, box, corr heatmap, missing bar).
  3. Advanced: Write an EDA report with 3 actionable feature-engineering ideas.

Recap

  • You can explain automated eda and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 20 — EDA Case Study

EDA Case Study — End-to-End Titanic Exploration

Why this matters

EDA is where most production ML failures are prevented — you discover leakage, bad dtypes, and useless features before wasting weeks on modeling.

This case study ties together univariate, bivariate, and automated EDA on the Titanic dataset. Follow the checklist: understand schema → missingness → target balance → feature relationships → modeling hypotheses.

Worked example — EDA checklist

  1. Load & profile: shape, dtypes, df.info(), duplicate rows.
  2. Target: survival rate by class/sex — bar plots + crosstab.
  3. Numeric: age/fare distributions; outliers via box plots.
  4. Missing: cabin mostly null → drop or engineer HasCabin; impute age carefully.
  5. Hypothesis: women and 1st class more likely to survive — validate before modeling.
Code Example
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = sns.load_dataset('titanic')
print(df.shape, df['survived'].mean())

# Missingness heatmap
sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing values — Titanic')
plt.show()

# Bivariate: survival by sex and class
sns.catplot(data=df, x='sex', y='survived', hue='class', kind='bar')
plt.show()

# Numeric vs target
sns.boxplot(data=df, x='survived', y='age')
plt.show()
⚠️
Failure mode

Running ydata-profiling on the full dataset including test labels, then choosing features from the report, leaks information. Always EDA on train split only; use test for final validation once.

Common mistakes

  • Skipping EDA and jumping straight to modeling on dirty data.
  • Treating correlation as causation without domain checks.
  • Ignoring class imbalance or duplicate rows visible only in plots.

Interview checkpoints

  • Q: Why is EDA non-negotiable? A: It validates data quality, distributions, and signal before any algorithm choice.
  • Q: EDA on train only or full dataset? A: Explore train deeply; compare test only for drift, never tune on test.

Practice

  1. Basic: Summarize shape, dtypes, missing %, and target distribution for a CSV.
  2. Intermediate: Build a 6-panel EDA dashboard (hist, box, corr heatmap, missing bar).
  3. Advanced: Write an EDA report with 3 actionable feature-engineering ideas.

Recap

  • You can explain eda case study and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 21 — Bivariate+Seaborn

Bivariate Analysis — Relationships Between Two Variables

Why this matters

Bivariate EDA is how you choose the right chart and statistical test before modeling — wrong pairings (e.g., Pearson on ordinal data) hide real relationships and mislead feature engineering.

Key question framework (from source notes)

Before plotting, ask: (1) What are the data types? (numerical↔numerical, numerical↔categorical, categorical↔categorical) (2) What relationship are you exploring — correlation, distribution, comparison, or trend? Extend to three or more variables for multivariate analysis.

Types of Bivariate Analysis

Variable Types Best Charts Statistical Test
Numerical × Numerical Scatter plot, line plot, hex bin plot Pearson/Spearman correlation
Numerical × Categorical Box plot, violin plot, bar + error, strip plot t-test, ANOVA, Mann-Whitney U
Categorical × Categorical Grouped bar, stacked bar, heatmap of counts Chi-square test, Cramér's V
Code Example
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

df = sns.load_dataset('titanic')

# ══════════════════════════════════════
# NUMERICAL vs NUMERICAL
# ══════════════════════════════════════

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1. Basic Scatter
axes[0].scatter(df['age'], df['fare'], alpha=0.4, color='#d4af37', s=20)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Fare')
axes[0].set_title('Age vs Fare')

# 2. Scatter with regression line
clean = df[['age','fare']].dropna()
axes[1].scatter(clean['age'], clean['fare'], alpha=0.4, color='#3a7bd5', s=20)
m, b, r, p, se = stats.linregress(clean['age'], clean['fare'])
x_line = np.array([clean['age'].min(), clean['age'].max()])
axes[1].plot(x_line, m*x_line + b, color='#d4af37', linewidth=2, 
             label=f'r={r:.2f}, p={p:.4f}')
axes[1].set_title(f'Age vs Fare with Regression
(r={r:.2f})')
axes[1].legend()

# 3. Hex Bin — better for overlapping points
axes[2].hexbin(clean['age'], clean['fare'], gridsize=20, cmap='YlOrRd')
axes[2].set_xlabel('Age')
axes[2].set_ylabel('Fare')
axes[2].set_title('Age vs Fare (Hex Bin)
Better for dense data')
plt.colorbar(axes[2].collections[0], ax=axes[2], label='Count')

plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# NUMERICAL vs CATEGORICAL
# ══════════════════════════════════════

fig, axes = plt.subplots(1, 4, figsize=(20, 5))

# 4. Box plot — shows distribution per category
sns.boxplot(data=df, x='class', y='age', ax=axes[0], palette='Set2')
axes[0].set_title('Age by Passenger Class
(Box Plot)')

# 5. Violin plot — shows full distribution shape
sns.violinplot(data=df, x='class', y='fare', ax=axes[1], palette='Set2',
               inner='quart', scale='width')
axes[1].set_title('Fare by Passenger Class
(Violin Plot)')
axes[1].set_yscale('log')  # log scale for skewed data

# 6. Strip plot — all individual data points
sns.stripplot(data=df, x='class', y='age', ax=axes[2], palette='Set2', 
              jitter=True, size=3, alpha=0.6)
axes[2].set_title('Age by Passenger Class
(Strip Plot)')

# 7. Bar plot with CI — means + confidence intervals
sns.barplot(data=df, x='class', y='survived', ax=axes[3], palette='Set2',
            estimator=np.mean, ci=95)
axes[3].set_title('Survival Rate by Class
(95% CI shown as error bars)')
axes[3].set_ylabel('Survival Rate')

plt.suptitle('Numerical vs Categorical Bivariate Analysis', fontweight='bold')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# CATEGORICAL vs CATEGORICAL
# ══════════════════════════════════════

# 8. Crosstab + Heatmap
crosstab = pd.crosstab(df['class'], df['survived'], 
                        normalize='index')  # normalize by row = rates
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.heatmap(crosstab, annot=True, fmt='.2%', cmap='RdYlGn', 
            vmin=0, vmax=1, ax=axes[0])
axes[0].set_title('Survival Rate: Class vs Survived
(row-normalized %)')

# 9. Grouped bar chart
ct_raw = pd.crosstab(df['class'], df['sex'])
ct_raw.plot(kind='bar', ax=axes[1], color=['#3a7bd5','#e74c3c'], alpha=0.8)
axes[1].set_title('Count: Passenger Class vs Gender')
axes[1].set_xlabel('Passenger Class')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=0)
axes[1].legend(title='Sex')

plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# STATISTICAL TESTS for Bivariate
# ══════════════════════════════════════

# Test if age differs significantly across classes
first  = df[df['class']=='First']['age'].dropna()
second = df[df['class']=='Second']['age'].dropna()
third  = df[df['class']=='Third']['age'].dropna()

# ANOVA F-test: are the means of 3+ groups significantly different?
f_stat, p_value = stats.f_oneway(first, second, third)
print(f"One-way ANOVA: F={f_stat:.3f}, p={p_value:.6f}")
print("→ Classes have significantly different ages!" if p_value < 0.05 else "→ No significant age difference")

# Chi-square test: are class and sex independent?
chi2, p, dof, expected = stats.chi2_contingency(pd.crosstab(df['class'], df['sex']))
print(f"
Chi-square test (Class vs Sex): χ²={chi2:.3f}, p={p:.6f}, df={dof}")
print("→ Class and Sex are NOT independent!" if p < 0.05 else "→ Class and Sex appear independent")

Common mistakes

  • Using Pearson correlation on non-linear relationships without checking scatter plots first.
  • Plotting numerical vs categorical with a scatter plot instead of box/violin plots.
  • Ignoring Simpson's paradox — aggregate correlations can reverse within subgroups.

Interview checkpoints

  • Q: Numerical × categorical — best plot and test? A: Box/violin plot; t-test or ANOVA (or Mann-Whitney if non-normal).
  • Q: Two key questions before any bivariate plot? A: (1) data types of both variables, (2) relationship type (correlation, comparison, trend).
  • Q: When Spearman over Pearson? A: Ordinal data or monotonic but non-linear relationships.

Practice

  1. Basic: For Titanic, plot survival vs sex and vs age with appropriate chart types.
  2. Intermediate: Build a Seaborn pairplot on 4 numeric columns; note strongest off-diagonal patterns.
  3. Advanced: Find one feature pair where Pearson and Spearman disagree; explain why with a scatter plot.

Recap

  • Match chart type to variable-type pairing (num×num, num×cat, cat×cat).
  • Ask data types and relationship goal before plotting.
  • Use heatmaps and pairplots to scale to multivariate exploration.

Next: Day 22 — Multivariate

Multivariate Analysis — Three or More Variables at Once

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Code Example
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

df = sns.load_dataset('titanic')

# ══════════════════════════════════════
# 1. PAIRPLOT with HUE (target variable)
# ══════════════════════════════════════
numeric_cols = ['age', 'fare', 'sibsp', 'parch']
pair_data = df[numeric_cols + ['survived', 'class']].dropna()

g = sns.pairplot(
    pair_data, hue='survived', diag_kind='kde',
    plot_kws={'alpha': 0.4, 's': 20},
    diag_kws={'fill': True},
    palette={0: '#e74c3c', 1: '#2ecc71'}
)
g.fig.suptitle('Pairplot: All Numeric Features (color=Survival)', y=1.02, fontsize=14)
plt.show()

# ══════════════════════════════════════
# 2. CLUSTERMAP — Hierarchical clustering of rows AND columns
# ══════════════════════════════════════
# Group features by similarity AND group passengers by similarity simultaneously
sample = df[numeric_cols].dropna().sample(200, random_state=42)
sample_scaled = (sample - sample.mean()) / sample.std()   # standardize

g = sns.clustermap(
    sample_scaled,
    cmap='RdYlGn',
    figsize=(10, 8),
    dendrogram_ratio=0.2,
    cbar_pos=(0.02, 0.2, 0.03, 0.4),
    linewidths=0.01
)
g.fig.suptitle('Clustermap: Passengers Clustered by Feature Similarity', 
               y=1.02, fontsize=12)
plt.show()

# ══════════════════════════════════════
# 3. MULTI-DIMENSIONAL SCATTER
# Encode 4+ variables on a single scatter plot
# ══════════════════════════════════════
fig, ax = plt.subplots(figsize=(10, 7))

survived_colors = {0: '#e74c3c', 1: '#2ecc71'}
class_sizes = {1: 200, 2: 100, 3: 40}

for cls in [1, 2, 3]:
    for surv in [0, 1]:
        subset = df[(df['pclass'] == cls) & (df['survived'] == surv)].dropna(subset=['age','fare'])
        ax.scatter(
            subset['age'], subset['fare'],
            c=survived_colors[surv],
            s=class_sizes[cls],
            alpha=0.5,
            label=f'Class {cls}, {"Survived" if surv else "Died"}',
            marker='o' if surv else '^'
        )

ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Fare', fontsize=12)
ax.set_title('Age vs Fare
(Color=Survival, Size=Class, Shape=Survived/Died)', fontsize=12)
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)
ax.set_yscale('log')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# 4. 3D SCATTER PLOT
# ══════════════════════════════════════
from sklearn.preprocessing import LabelEncoder
clean = df[['age', 'fare', 'pclass', 'survived']].dropna()

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

colors = clean['survived'].map({0: '#e74c3c', 1: '#2ecc71'})
scatter = ax.scatter(
    clean['age'], clean['fare'], clean['pclass'],
    c=colors, alpha=0.5, s=30
)
ax.set_xlabel('Age')
ax.set_ylabel('Fare')
ax.set_zlabel('Passenger Class')
ax.set_title('3D Scatter: Age × Fare × Class (color=Survival)')

from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0],[0], marker='o', color='w', markerfacecolor='#e74c3c', markersize=10, label='Died'),
    Line2D([0],[0], marker='o', color='w', markerfacecolor='#2ecc71', markersize=10, label='Survived')
]
ax.legend(handles=legend_elements)
plt.show()

# ══════════════════════════════════════
# 5. FACET GRID — Small multiples
# Same plot for each category
# ══════════════════════════════════════
g = sns.FacetGrid(df.dropna(), col='class', row='sex', hue='survived',
                  palette={0:'#e74c3c', 1:'#2ecc71'}, height=3, aspect=1.2)
g.map(sns.histplot, 'age', bins=15, kde=True, alpha=0.6)
g.add_legend(title='Survived')
g.set_titles(row_template='{row_name}', col_template='Class: {col_name}')
g.fig.suptitle('Age Distribution — Faceted by Class & Gender', y=1.05, fontsize=13)
plt.show()

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain multivariate analysis and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 23 — Feature Insights

Feature Insights — Identifying Important vs Redundant Features

Why this matters

Feature Insights: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What to Look For in EDA

  • High correlation with target: Features that clearly separate classes or correlate with the target variable are valuable.
  • High inter-feature correlation (multicollinearity): Two features that are highly correlated with each other provide redundant information — one can often be dropped.
  • Low or near-zero variance: Features where almost all values are the same provide no discriminative information.
  • Leaky features: Features that are derived from the target or that won't be available at prediction time.
Code Example
import pandas as pd
import numpy as np
from sklearn.feature_selection import mutual_info_classif
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('titanic.csv')

# ══════════════════════════════════════
# 1. VARIANCE ANALYSIS — Remove near-zero variance features
# ══════════════════════════════════════
from sklearn.feature_selection import VarianceThreshold

numeric_df = df.select_dtypes(include='number').dropna()
variance = numeric_df.var().sort_values()

print("Feature Variances:")
print(variance.round(3).to_string())

# Low variance features (threshold = 0.01 means 99% same value)
selector = VarianceThreshold(threshold=0.01)
selector.fit(numeric_df)
low_var_mask = ~selector.get_support()
low_var_features = numeric_df.columns[low_var_mask].tolist()
print(f"
Near-zero variance features to consider dropping: {low_var_features}")

# ══════════════════════════════════════
# 2. MUTUAL INFORMATION — Non-linear correlation with target
# ══════════════════════════════════════
features_for_mi = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
target = 'Survived'

clean = df[features_for_mi + [target]].dropna()
X = clean[features_for_mi]
y = clean[target]

mi_scores = mutual_info_classif(X, y, random_state=42)
mi_df = pd.DataFrame({'Feature': features_for_mi, 'MI Score': mi_scores})
mi_df = mi_df.sort_values('MI Score', ascending=False)

fig, ax = plt.subplots(figsize=(8, 4))
ax.barh(mi_df['Feature'], mi_df['MI Score'], color='#d4af37', alpha=0.8, edgecolor='black')
ax.set_xlabel('Mutual Information Score')
ax.set_title('Feature Importance via Mutual Information
(Higher = More Informative about Survival)')
ax.invert_yaxis()
for i, (v, f) in enumerate(zip(mi_df['MI Score'], mi_df['Feature'])):
    ax.text(v + 0.002, i, f'{v:.4f}', va='center', fontsize=9)
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# 3. CORRELATION with TARGET (for regression/classification)
# ══════════════════════════════════════
target_corr = clean.corr()['Survived'].drop('Survived').sort_values(key=abs, ascending=False)

fig, ax = plt.subplots(figsize=(8, 4))
colors = ['#2ecc71' if v > 0 else '#e74c3c' for v in target_corr.values]
ax.barh(target_corr.index, target_corr.values, color=colors, alpha=0.8)
ax.axvline(x=0, color='white', linewidth=0.8)
ax.set_xlabel('Pearson Correlation with Survived')
ax.set_title('Feature-Target Correlation
(Green=Positive, Red=Negative)')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

# ══════════════════════════════════════
# 4. IDENTIFYING REDUNDANT FEATURES
# ══════════════════════════════════════
corr_matrix = clean[features_for_mi].corr().abs()

# Find pairs with high correlation
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if corr_matrix.iloc[i, j] > 0.7:  # threshold
            high_corr_pairs.append({
                'Feature 1': corr_matrix.columns[i],
                'Feature 2': corr_matrix.columns[j],
                'Correlation': corr_matrix.iloc[i, j]
            })

if high_corr_pairs:
    print("
Highly correlated feature pairs (|r| > 0.7):")
    print(pd.DataFrame(high_corr_pairs).sort_values('Correlation', ascending=False).to_string(index=False))
    print("→ Consider dropping one feature from each pair")
else:
    print("
No highly correlated feature pairs found.")

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain feature insights and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 24 — Titanic EDA

Titanic Survival — Complete EDA Case Study

Why this matters

EDA is where most production ML failures are prevented — you discover leakage, bad dtypes, and useless features before wasting weeks on modeling.

This is a full, end-to-end EDA of the Titanic dataset — one of the most famous ML datasets. This walkthrough demonstrates the complete EDA workflow on a real problem: predicting passenger survival.

Code Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# ══════════════════════════════════════════════════════════════════
# STEP 1: LOAD AND INITIAL INSPECTION
# ══════════════════════════════════════════════════════════════════
df = pd.read_csv('titanic.csv')

print("=" * 60)
print(f"Dataset Shape: {df.shape}")
print(f"Total passengers: {len(df)}")
print(f"
Column dtypes:
{df.dtypes}")
print(f"
Missing Values:
{df.isnull().sum()[df.isnull().sum() > 0]}")
print(f"
Basic Statistics:
{df.describe().round(2)}")

# ══════════════════════════════════════════════════════════════════
# STEP 2: TARGET VARIABLE ANALYSIS
# ══════════════════════════════════════════════════════════════════
survival_rate = df['Survived'].mean()
print(f"
Overall Survival Rate: {survival_rate:.2%}")
print(f"Survived: {df['Survived'].sum()}, Died: {(df['Survived']==0).sum()}")
# → 38.38% survived — this is CLASS IMBALANCE!

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df['Survived'].value_counts().plot(kind='bar', ax=axes[0], 
    color=['#e74c3c', '#2ecc71'], alpha=0.8, edgecolor='black')
axes[0].set_xticklabels(['Died', 'Survived'], rotation=0)
axes[0].set_title(f'Target Variable Distribution
(Imbalance: {survival_rate:.1%} survived)')
axes[0].set_ylabel('Count')

axes[1].pie([df['Survived'].sum(), (df['Survived']==0).sum()],
            labels=['Survived
38.4%', 'Died
61.6%'],
            colors=['#2ecc71', '#e74c3c'], autopct='%1.1f%%',
            startangle=90, pctdistance=0.85)
axes[1].set_title('Survival Distribution')
plt.tight_layout()
plt.show()

# ══════════════════════════════════════════════════════════════════
# STEP 3: FEATURE-WISE ANALYSIS & KEY INSIGHTS
# ══════════════════════════════════════════════════════════════════

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Titanic EDA — Key Feature Insights', fontsize=16, fontweight='bold')

# 1. Gender — "Women and Children First"
gender_survival = df.groupby('Sex')['Survived'].mean()
bars = axes[0][0].bar(gender_survival.index, gender_survival.values, 
                       color=['#3a7bd5', '#e74c3c'], alpha=0.8, edgecolor='black')
axes[0][0].set_title(f'Survival Rate by Gender
(Female: {gender_survival["female"]:.1%}, Male: {gender_survival["male"]:.1%})')
axes[0][0].set_ylabel('Survival Rate')
axes[0][0].set_ylim(0, 1)
for bar, val in zip(bars, gender_survival.values):
    axes[0][0].text(bar.get_x() + bar.get_width()/2, val + 0.02, f'{val:.1%}', 
                    ha='center', fontweight='bold')

# 2. Passenger Class
class_survival = df.groupby('Pclass')['Survived'].mean()
axes[0][1].bar(['1st Class', '2nd Class', '3rd Class'], class_survival.values,
                color=['#d4af37', '#aaa', '#cd7f32'], alpha=0.8, edgecolor='black')
axes[0][1].set_title(f'Survival by Passenger Class
(1st: {class_survival[1]:.0%}, 2nd: {class_survival[2]:.0%}, 3rd: {class_survival[3]:.0%})')
axes[0][1].set_ylabel('Survival Rate')

# 3. Age Distribution by Survival
survived     = df[df['Survived']==1]['Age'].dropna()
died         = df[df['Survived']==0]['Age'].dropna()
axes[0][2].hist(died, bins=25, alpha=0.6, color='#e74c3c', label=f'Died (n={len(died)})', density=True)
axes[0][2].hist(survived, bins=25, alpha=0.6, color='#2ecc71', label=f'Survived (n={len(survived)})', density=True)
axes[0][2].axvline(x=16, color='yellow', linestyle='--', alpha=0.8, label='Age 16 (children)')
axes[0][2].set_title('Age Distribution by Survival')
axes[0][2].set_xlabel('Age')
axes[0][2].legend(fontsize=8)

# 4. Fare Distribution by Survival
axes[1][0].boxplot([
    df[df['Survived']==0]['Fare'].dropna(),
    df[df['Survived']==1]['Fare'].dropna()
], labels=['Died', 'Survived'], patch_artist=True,
   boxprops=dict(facecolor='rgba(212,175,55,0.2)'))
axes[1][0].set_title(f'Fare by Survival
Higher fare → higher survival')
axes[1][0].set_ylabel('Fare (£)')

# 5. Embarked port
emb_survival = df.groupby('Embarked')['Survived'].mean()
port_names = {'C': 'Cherbourg', 'Q': 'Queenstown', 'S': 'Southampton'}
emb_survival.index = [port_names.get(e, e) for e in emb_survival.index]
axes[1][1].bar(emb_survival.index, emb_survival.values, 
               color=['#d4af37', '#3a7bd5', '#e74c3c'], alpha=0.8, edgecolor='black')
axes[1][1].set_title('Survival Rate by Port of Embarkation')
axes[1][1].set_ylabel('Survival Rate')

# 6. Family size analysis
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
family_survival = df.groupby('FamilySize')['Survived'].agg(['mean','count'])
axes[1][2].bar(family_survival.index, family_survival['mean'], 
               alpha=0.8, color='#d4af37', edgecolor='black')
axes[1][2].set_title('Survival Rate by Family Size
(Alone=1, Sweet spot=2-4)')
axes[1][2].set_xlabel('Family Size')
axes[1][2].set_ylabel('Survival Rate')

plt.tight_layout()
plt.show()

# ══════════════════════════════════════════════════════════════════
# STEP 4: KEY INSIGHTS SUMMARY
# ══════════════════════════════════════════════════════════════════
print("
" + "="*60)
print("KEY INSIGHTS FROM TITANIC EDA")
print("="*60)
print(f"✅ Gender: Women had {gender_survival['female']:.0%} survival rate vs {gender_survival['male']:.0%} for men")
print(f"✅ Class: 1st class passengers survived at {class_survival[1]:.0%} vs {class_survival[3]:.0%} for 3rd class")
print(f"✅ Age: Children (<16) had higher survival rate than adults")
print(f"✅ Fare: Survivors paid higher fares on average (proxy for class)")
print(f"✅ Family: Solo travelers and very large families had lower survival")
print(f"⚠️  Missing Data: Age=20%, Cabin=77%")
print(f"⚠️  Class Imbalance: Only 38.4% survived — use stratified splits!")

Common mistakes

  • Skipping EDA and jumping straight to modeling on dirty data.
  • Treating correlation as causation without domain checks.
  • Ignoring class imbalance or duplicate rows visible only in plots.

Interview checkpoints

  • Q: Why is EDA non-negotiable? A: It validates data quality, distributions, and signal before any algorithm choice.
  • Q: EDA on train only or full dataset? A: Explore train deeply; compare test only for drift, never tune on test.

Practice

  1. Basic: Summarize shape, dtypes, missing %, and target distribution for a CSV.
  2. Intermediate: Build a 6-panel EDA dashboard (hist, box, corr heatmap, missing bar).
  3. Advanced: Write an EDA report with 3 actionable feature-engineering ideas.

Recap

  • You can explain titanic survival and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 25 — EDA Project

EDA Project Template and Checklist

Why this matters

EDA is where most production ML failures are prevented — you discover leakage, bad dtypes, and useless features before wasting weeks on modeling.

Use this standardized template for every new ML project. The checklist ensures you don't miss any critical EDA step.

Code Example
"""
=====================================
STANDARD EDA TEMPLATE
GenAIWallah — 100 Days of ML
=====================================
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

plt.style.use('dark_background')
sns.set_palette('Set2')

def run_eda(filepath, target_col, task='classification'):
    """
    Complete EDA template. 
    task: 'classification' or 'regression'
    """
    print("=" * 70)
    print("EDA REPORT — GenAIWallah Standard Template")
    print("=" * 70)
    
    # ─────────────────────────────────────────────────────
    # SECTION 1: LOAD AND OVERVIEW
    # ─────────────────────────────────────────────────────
    df = pd.read_csv(filepath)
    print(f"
📂 Dataset: {filepath}")
    print(f"📐 Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"📦 Memory Usage: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB")
    print(f"🔄 Duplicate Rows: {df.duplicated().sum()}")
    
    # ─────────────────────────────────────────────────────
    # SECTION 2: FEATURE CLASSIFICATION
    # ─────────────────────────────────────────────────────
    numeric_cols = df.select_dtypes(include='number').columns.tolist()
    cat_cols     = df.select_dtypes(include='object').columns.tolist()
    bool_cols    = df.select_dtypes(include='bool').columns.tolist()
    
    print(f"
🔢 Numeric features ({len(numeric_cols)}): {numeric_cols}")
    print(f"🏷️  Categorical features ({len(cat_cols)}): {cat_cols}")
    
    # ─────────────────────────────────────────────────────
    # SECTION 3: TARGET VARIABLE
    # ─────────────────────────────────────────────────────
    print(f"
🎯 TARGET: {target_col}")
    if task == 'classification':
        vc = df[target_col].value_counts()
        print(f"   Class distribution:
{vc}")
        print(f"   Imbalance ratio: {vc.max()/vc.min():.2f}:1")
    else:
        print(f"   Mean: {df[target_col].mean():.4f}")
        print(f"   Std:  {df[target_col].std():.4f}")
        print(f"   Min:  {df[target_col].min():.4f}")
        print(f"   Max:  {df[target_col].max():.4f}")
        print(f"   Skew: {df[target_col].skew():.4f}")
    
    # ─────────────────────────────────────────────────────
    # SECTION 4: MISSING DATA
    # ─────────────────────────────────────────────────────
    missing = df.isnull().sum()
    missing = missing[missing > 0].sort_values(ascending=False)
    if len(missing) > 0:
        print(f"
⚠️  Missing Values:")
        for col, cnt in missing.items():
            pct = cnt / len(df) * 100
            severity = "🔴 CRITICAL" if pct > 50 else ("🟡 HIGH" if pct > 20 else "🟢 LOW")
            print(f"   {col:20} {cnt:5d} ({pct:5.1f}%) {severity}")
    else:
        print("
✅ No missing values!")
    
    # ─────────────────────────────────────────────────────
    # SECTION 5: OUTLIER SUMMARY
    # ─────────────────────────────────────────────────────
    print(f"
📊 Outlier Analysis (IQR Method):")
    for col in numeric_cols:
        if col == target_col:
            continue
        q1, q3 = df[col].quantile([0.25, 0.75])
        iqr = q3 - q1
        outliers = df[(df[col] < q1 - 1.5*iqr) | (df[col] > q3 + 1.5*iqr)]
        if len(outliers) > 0:
            pct = len(outliers)/len(df)*100
            print(f"   {col:20} {len(outliers):4d} outliers ({pct:.1f}%)")
    
    # ─────────────────────────────────────────────────────
    # SECTION 6: DISTRIBUTION ANALYSIS (Auto-plots)
    # ─────────────────────────────────────────────────────
    n_num = len(numeric_cols)
    if n_num > 0:
        ncols = 3
        nrows = (n_num + ncols - 1) // ncols
        fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*4, nrows*3))
        axes = axes.flatten()
        
        for i, col in enumerate(numeric_cols):
            skew = df[col].skew()
            axes[i].hist(df[col].dropna(), bins=25, color='#d4af37', alpha=0.7, edgecolor='black')
            axes[i].set_title(f'{col}
Skew: {skew:.2f}', fontsize=9)
            axes[i].axvline(df[col].mean(), color='red', linestyle='--', linewidth=1, alpha=0.8)
        
        # Hide unused axes
        for i in range(len(numeric_cols), len(axes)):
            axes[i].set_visible(False)
        
        plt.suptitle('Univariate Distributions (Red line = Mean)', fontweight='bold')
        plt.tight_layout()
        plt.show()
    
    # ─────────────────────────────────────────────────────
    # SECTION 7: CORRELATION HEATMAP
    # ─────────────────────────────────────────────────────
    if n_num > 1:
        fig, ax = plt.subplots(figsize=(max(8, n_num), max(6, n_num-1)))
        corr = df[numeric_cols].corr()
        mask = np.triu(np.ones_like(corr, dtype=bool))
        sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdYlGn', 
                    mask=mask, square=True, vmin=-1, vmax=1, ax=ax)
        ax.set_title('Correlation Heatmap (lower triangle)')
        plt.tight_layout()
        plt.show()
    
    print("
✅ EDA Complete! Review plots and insights above.")
    return df

# ─────────────────────────────────────────────────────────────────
# EDA CHECKLIST (copy for every project)
# ─────────────────────────────────────────────────────────────────
"""
□ Load data and check .info(), .head(), .describe()
□ Count rows, columns, memory usage
□ Check for duplicate rows
□ Classify features: numeric, categorical, datetime, boolean
□ Analyze target variable distribution and class balance
□ Missing value analysis: count, percentage, type (MCAR/MAR/MNAR)
□ Univariate analysis for all numeric features (histogram + KDE)
□ Univariate analysis for all categorical features (value_counts + bar)
□ Outlier detection: IQR method + visualization for all numeric features
□ Check skewness and kurtosis for all numeric features
□ Bivariate analysis: all features vs target variable
□ Correlation matrix heatmap (Pearson + Spearman)
□ Check for multicollinearity (VIF)
□ Multivariate analysis: pairplot with target hue
□ Statistical tests for significant relationships
□ Document all insights with business interpretation
□ List features to engineer / transform / drop
□ List preprocessing steps needed (imputation, scaling, encoding)
"""
🚀
Module 2 Complete — What's Next?

You've mastered EDA! The patterns you discovered now drive Module 3: Preprocessing. Every decision in Module 3 (which imputation strategy, which scaler, which encoding) should be backed by EDA insights. Move on to Module 3: Data Preprocessing & Feature Engineering.

Common mistakes

  • Skipping EDA and jumping straight to modeling on dirty data.
  • Treating correlation as causation without domain checks.
  • Ignoring class imbalance or duplicate rows visible only in plots.

Interview checkpoints

  • Q: Why is EDA non-negotiable? A: It validates data quality, distributions, and signal before any algorithm choice.
  • Q: EDA on train only or full dataset? A: Explore train deeply; compare test only for drift, never tune on test.

Practice

  1. Basic: Summarize shape, dtypes, missing %, and target distribution for a CSV.
  2. Intermediate: Build a 6-panel EDA dashboard (hist, box, corr heatmap, missing bar).
  3. Advanced: Write an EDA report with 3 actionable feature-engineering ideas.

Recap

  • You can explain eda project template and checklist and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Continue to the next day in this module.

Exploratory Data Analysis: Feature Correlation Matrix Grid
CORRELATION MAP (Age vs Salary vs Purchased) Age: 1.0 Salary: 0.6 Buy: 0.3 0.6 1.0 0.7
Foundations & Essentials → Preprocessing & Feature Engineering →