100 Days of ML · Module 5 (70)

Module 5: Unsupervised Learning

100 Days of ML Module 5 — Master unsupervised learning: K-Means, DBSCAN, hierarchical clustering, PCA, t-SNE, UMAP, anomaly detection, and association rules.

⏱ 50 Min Read • 70 • Updated: May 2026

Unsupervised learning finds hidden structure in unlabeled data. No labels, no feedback — the algorithm discovers groupings, compressed representations, anomalies, and patterns entirely on its own. This module covers the most widely-used unsupervised algorithms in industry: K-Means, hierarchical clustering, DBSCAN, PCA, t-SNE, UMAP, anomaly detection, and association rule mining.

K-Means Clustering

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is K-Means?

K-Means partitions $n$ data points into $k$ clusters, where each point belongs to the cluster with the nearest mean (centroid). It minimises the Within-Cluster Sum of Squares (WCSS), also called inertia:

$$WCSS = \sum_{i=1}^{k}\sum_{x \in C_i} \|x - \mu_i\|^2$$

where $C_i$ is cluster $i$, $\mu_i$ is its centroid, and $\|\cdot\|$ is the Euclidean distance.

Algorithm Steps (Lloyd's Algorithm)

Initialise: Randomly select $k$ points from the dataset as initial centroids $\mu_1, \mu_2, \ldots, \mu_k$.
Assign: For every data point $x_i$, assign it to the closest centroid: $C_j = \arg\min_j \|x_i - \mu_j\|^2$
Update: Recompute each centroid as the mean of all points assigned to it: $\mu_j = \frac{1}{|C_j|}\sum_{x \in C_j} x$
Repeat: Iterate steps 2–3 until the centroids stop moving (convergence) or a max iteration count is reached.

Init K centroids

→

Assign points

→

Update centroids

→

Converged?

→ No → Assign

Typical convergence: 10–300 iterations

K-Means++ Initialisation

Standard random initialisation can converge to poor local minima. K-Means++ (default in sklearn) chooses initial centroids probabilistically — farther points are more likely to be chosen:

Choose the first centroid $\mu_1$ uniformly at random from the data.
For each subsequent centroid, choose $x$ with probability proportional to $D(x)^2$, where $D(x)$ is the distance to the nearest already-chosen centroid.
Repeat until $k$ centroids are chosen.

💡

K-Means++ Effect

K-Means++ guarantees an $O(\log k)$ approximation to the optimal solution and typically converges 2–5× faster than random initialisation. Always use init='k-means++' (the sklearn default).

Python Implementation

sklearn K-Means — Full Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# ── Generate sample data ──────────────────────────────────────
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.7, random_state=42)

# ── Scale features (IMPORTANT for K-Means — it's distance-based) ─
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ── Fit K-Means ───────────────────────────────────────────────
kmeans = KMeans(
    n_clusters=4,       # Number of clusters k
    init='k-means++',   # Smart initialisation (default)
    n_init=10,          # Run 10 times, keep best result
    max_iter=300,       # Max iterations per run
    random_state=42
)
kmeans.fit(X_scaled)

# ── Inspect results ───────────────────────────────────────────
print("Cluster labels:", kmeans.labels_[:10])   # e.g. [2, 0, 1, 3, ...]
print("Inertia (WCSS):", kmeans.inertia_)        # 45.23
print("Centroids shape:", kmeans.cluster_centers_.shape)  # (4, 2)
print("Iterations to converge:", kmeans.n_iter_)  # e.g. 12

# ── Predict new points ────────────────────────────────────────
new_points = np.array([[1.5, -0.5], [-2.0, 3.0]])
new_scaled = scaler.transform(new_points)
predictions = kmeans.predict(new_scaled)
print("New point clusters:", predictions)

# ── Visualise ─────────────────────────────────────────────────
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='tab10', s=20)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            marker='X', s=200, c='red', label='Centroids')
plt.title("K-Means Clusters (k=4)")
plt.legend()
plt.tight_layout()
plt.show()

Key Limitations

Limitation	Explanation	Solution
Must specify K	You must provide the number of clusters in advance	Use Elbow Method or Silhouette Score (Day 62)
Assumes spherical clusters	Fails on non-convex or elongated clusters	Use DBSCAN or Gaussian Mixture Models
Sensitive to outliers	Outliers pull centroids away from true cluster centers	Use K-Medoids or remove outliers first
Sensitive to scale	Features with larger ranges dominate distance	Always StandardScale before K-Means
Local minima	Random init may converge to suboptimal solution	Use K-Means++ and n_init≥10

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain k-means clustering and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 62 — Elbow Method

Choosing K — Elbow Method, Silhouette Score, Davies-Bouldin

Why this matters

Choosing K: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Elbow Method

Plot WCSS (inertia) against different values of $k$. As $k$ increases, inertia always decreases. The elbow point — where the rate of decrease sharply slows — suggests the optimal $k$.

Intuition: Adding more clusters always reduces inertia, but after the true number of clusters, the improvement becomes marginal. The "elbow" is where diminishing returns kick in.
      

Code Example

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)

# ── Compute inertia for k = 1..10 ────────────────────────────
inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

# ── Plot Elbow Curve ──────────────────────────────────────────
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method — Find Optimal K')
plt.axvline(x=4, color='red', linestyle='--', label='Elbow at k=4')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Silhouette Score

The silhouette score for a single point $i$ is:

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

where $a(i)$ = mean distance to points in the same cluster, $b(i)$ = mean distance to points in the nearest other cluster.

$s(i) = +1$: Point is perfectly inside its own cluster, far from others
$s(i) = 0$: Point is on the boundary between two clusters
$s(i) = -1$: Point is probably in the wrong cluster

Code Example

from sklearn.metrics import silhouette_score, davies_bouldin_score
import numpy as np

silhouette_scores = []
db_scores = []
K_range = range(2, 11)  # Silhouette undefined for k=1

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X)
    sil = silhouette_score(X, labels)
    db  = davies_bouldin_score(X, labels)
    silhouette_scores.append(sil)
    db_scores.append(db)
    print(f"k={k:2d}  Silhouette={sil:.4f}  Davies-Bouldin={db:.4f}")

# Best k: highest silhouette score, lowest Davies-Bouldin
best_k_sil = K_range[np.argmax(silhouette_scores)]
best_k_db  = K_range[np.argmin(db_scores)]
print(f"
Best k by Silhouette: {best_k_sil}")
print(f"Best k by Davies-Bouldin: {best_k_db}")

Davies-Bouldin Index

Measures the average ratio of within-cluster scatter to between-cluster separation. Lower is better (unlike silhouette which is higher=better). No negative values — range is $[0, \infty)$.

📌

Choosing the Right Method

Use all three methods together: Elbow for a quick visual, Silhouette for mathematical optimum, Davies-Bouldin as a cross-check. When they agree, you have high confidence. When they disagree, domain knowledge should guide the final choice.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain choosing k and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 63 — Hierarchical Clustering

Hierarchical Clustering

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Two Approaches

Approach	Direction	Process	Result
Agglomerative (bottom-up)	Merge upward	Start with each point as its own cluster. Iteratively merge the two closest clusters until one cluster remains.	Most commonly used
Divisive (top-down)	Split downward	Start with all points in one cluster. Recursively split the least cohesive cluster.	Computationally expensive; rarely used

Linkage Criteria

When merging clusters, how do we measure the distance between two clusters?

Linkage	Distance Measure	Pros	Cons
Single	Minimum distance between any two points in the two clusters	Can handle non-elliptic shapes	Chaining effect — elongated "sausage" clusters
Complete	Maximum distance between any two points	Produces compact, spherical clusters	Sensitive to outliers
Average	Mean distance between all pairs of points	Good compromise	Biased toward globular clusters
Ward	Minimises increase in total WCSS when merging	Most popular; produces compact equal-size clusters	Only valid with Euclidean distance

Dendrograms

A dendrogram is a tree diagram showing the hierarchical merge sequence. The y-axis shows the distance (dissimilarity) at which clusters merge. You choose $k$ by drawing a horizontal line — the number of vertical lines it crosses is the number of clusters.

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# ── Data ────────────────────────────────────────────────────
X, _ = make_blobs(n_samples=80, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)

# ── Dendrogram with scipy ────────────────────────────────────
linked = linkage(X, method='ward')   # Ward linkage matrix

plt.figure(figsize=(14, 5))
dendrogram(
    linked,
    orientation='top',
    distance_sort='descending',
    show_leaf_counts=True,
    leaf_font_size=6,
    color_threshold=4.0   # Draw horizontal line at this height
)
plt.title("Hierarchical Clustering Dendrogram (Ward Linkage)")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.axhline(y=4.0, color='red', linestyle='--', label='Cut line → 4 clusters')
plt.legend()
plt.tight_layout()
plt.show()

# ── Agglomerative Clustering with sklearn ───────────────────
agg = AgglomerativeClustering(
    n_clusters=4,
    linkage='ward',          # 'single', 'complete', 'average', 'ward'
    metric='euclidean'       # Only 'euclidean' valid for Ward
)
labels = agg.fit_predict(X)
print("Cluster labels:", np.unique(labels))  # [0, 1, 2, 3]

# With affinity='precomputed' you can use custom distance matrices
# Useful for text clustering with cosine similarity

💡

When to Choose Hierarchical over K-Means

Use hierarchical clustering when: (1) you don't know the number of clusters in advance and want to explore the dendrogram, (2) you need a tree-structured output, (3) you have small-to-medium data (<10,000 points — it's O(n² log n) memory/time). For large data, K-Means or DBSCAN scale better.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain hierarchical clustering and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 64 — DBSCAN

DBSCAN — Density-Based Spatial Clustering

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Core Idea

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as dense regions of points separated by sparse regions. Unlike K-Means, it does not require specifying the number of clusters and can detect arbitrarily-shaped clusters.

Two Parameters

ε (epsilon): The radius of the neighbourhood around each point. A point's neighbourhood = all points within distance ε.
min_samples: The minimum number of points required to form a dense region (including the point itself). Typical values: 3–10.

Three Types of Points

Point Type	Definition	Label
Core Point	Has at least `min_samples` points (including itself) within radius ε	Belongs to a cluster
Border Point	Within ε of a core point but has fewer than `min_samples` neighbours	Belongs to the cluster of its core point
Noise Point	Not a core point AND not within ε of any core point	Label = -1 (outlier)

Key Insight: DBSCAN automatically labels outliers as -1. This makes it a natural anomaly detector for free! Any point that doesn't fit in a dense cluster is flagged as noise.
      

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons, make_blobs
from sklearn.preprocessing import StandardScaler

# ── K-Means fails on crescent shapes; DBSCAN handles them ────
X, _ = make_moons(n_samples=300, noise=0.08, random_state=42)
X = StandardScaler().fit_transform(X)

dbscan = DBSCAN(
    eps=0.3,           # Neighbourhood radius
    min_samples=5,     # Minimum points to be a core point
    metric='euclidean' # Distance metric
)
labels = dbscan.fit_predict(X)

# ── Analyse results ──────────────────────────────────────────
n_clusters  = len(set(labels)) - (1 if -1 in labels else 0)
n_noise     = np.sum(labels == -1)
core_mask   = np.zeros_like(labels, dtype=bool)
core_mask[dbscan.core_sample_indices_] = True

print(f"Clusters found: {n_clusters}")
print(f"Noise points:   {n_noise}")

# ── Visualise ────────────────────────────────────────────────
plt.figure(figsize=(8, 4))
unique_labels = set(labels)
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))

for label, col in zip(unique_labels, colors):
    if label == -1:
        col = 'black'  # Noise = black
    mask = labels == label
    plt.scatter(X[mask & core_mask, 0], X[mask & core_mask, 1],
                c=[col], s=50, label=f'Cluster {label}' if label != -1 else 'Noise')
    plt.scatter(X[mask & ~core_mask, 0], X[mask & ~core_mask, 1],
                c=[col], s=15, alpha=0.5)

plt.title(f"DBSCAN — {n_clusters} clusters, {n_noise} noise points")
plt.legend()
plt.tight_layout()
plt.show()

# ── Choosing epsilon with the k-distance plot ────────────────
from sklearn.neighbors import NearestNeighbors

nbrs = NearestNeighbors(n_neighbors=5).fit(X)
distances, _ = nbrs.kneighbors(X)
k_distances = np.sort(distances[:, -1])[::-1]  # Sort descending

plt.figure(figsize=(8, 3))
plt.plot(k_distances)
plt.xlabel('Points (sorted)')
plt.ylabel(f'5th nearest neighbour distance')
plt.title('K-Distance Plot — elbow suggests optimal epsilon')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

DBSCAN vs K-Means Comparison

Feature	K-Means	DBSCAN
Cluster shape	Spherical / convex only	Arbitrary (crescents, rings, blobs)
Number of clusters	Must be specified (k)	Auto-determined from data density
Outliers	Every point assigned to a cluster	Outliers explicitly labelled as -1
Parameters	k, n_init	eps, min_samples
Scalability	O(nkI) — fast for large data	O(n log n) with spatial index; O(n²) naive
Varying density	Handles uniform density	Struggles with highly varying density (use HDBSCAN)

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain dbscan and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 65 — PCA

PCA — Principal Component Analysis

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is PCA?

PCA is a linear dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data. Each principal component is a linear combination of original features and is orthogonal to all other components.

Mathematical Foundation

Given a data matrix $X \in \mathbb{R}^{n \times p}$ (n samples, p features):

Centre the data: Subtract the mean: $\tilde{X} = X - \bar{X}$
Compute covariance matrix: $\Sigma = \frac{1}{n-1}\tilde{X}^T\tilde{X}$
Eigen-decomposition: $\Sigma v = \lambda v$ — find eigenvalues $\lambda$ and eigenvectors $v$
Sort by variance: Order eigenvectors by their eigenvalues (largest first) — these are the principal components
Project: $Z = \tilde{X} W$ where $W$ is the matrix of top-$d$ eigenvectors

Explained Variance Ratio: Each principal component explains $\frac{\lambda_i}{\sum_j \lambda_j}$ of the total variance. Aim to retain components explaining ≥ 95% cumulative variance for most tasks.
      

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

# ── Load high-dimensional dataset (64 features) ──────────────
digits = load_digits()
X = digits.data          # (1797, 64) — 64 pixel features
y = digits.target

# ── Always scale before PCA ──────────────────────────────────
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ── Full PCA to examine variance ──────────────────────────────
pca_full = PCA()   # All components
pca_full.fit(X_scaled)

# ── Scree Plot ────────────────────────────────────────────────
explained_var = pca_full.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.bar(range(1, 21), explained_var[:20], color='#d4af37', alpha=0.8)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot — Individual Variance')

plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumulative_var)+1), cumulative_var, 'b-o', markersize=3)
plt.axhline(y=0.95, color='red', linestyle='--', label='95% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance')
plt.legend()
plt.tight_layout()
plt.show()

# Find n_components for 95% variance
n_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 95% variance: {n_95} (out of 64)")
# Typically ~29 components needed for digits dataset

# ── Apply PCA for compression ─────────────────────────────────
pca = PCA(n_components=n_95)
X_pca = pca.fit_transform(X_scaled)
print(f"Original shape: {X_scaled.shape}")    # (1797, 64)
print(f"Reduced shape:  {X_pca.shape}")       # (1797, 29)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.3f}")

# ── 2D Visualisation ─────────────────────────────────────────
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=10, alpha=0.7)
plt.colorbar(scatter, label='Digit Class')
plt.title('Digits Dataset — PCA 2D Projection')
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)')
plt.tight_layout()
plt.show()

# ── PCA for feature preprocessing ────────────────────────────
# Use PCA-compressed features in downstream ML
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

lr = LogisticRegression(max_iter=2000)
scores_raw = cross_val_score(lr, X_scaled, y, cv=5, scoring='accuracy')
scores_pca = cross_val_score(lr, X_pca, y, cv=5, scoring='accuracy')
print(f"Accuracy (64 features): {scores_raw.mean():.4f}")
print(f"Accuracy (PCA {n_95}): {scores_pca.mean():.4f}")
# PCA often achieves similar accuracy with far fewer features

⚠️

PCA is for Linear Relationships Only

PCA finds linear combinations of features. If your data has non-linear structure, PCA may miss important patterns. Use Kernel PCA (sklearn.decomposition.KernelPCA) or t-SNE/UMAP for non-linear dimensionality reduction.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain pca and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 66 — t-SNE

t-SNE — t-Distributed Stochastic Neighbour Embedding

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is t-SNE?

t-SNE is a non-linear dimensionality reduction technique designed exclusively for 2D or 3D visualisation. It preserves local neighbourhood structure — points that are close in high-dimensional space remain close in the 2D embedding.

How it Works (Intuition)

High-D similarities: For each pair of points, compute the probability $p_{ij}$ that point $j$ would be picked as a neighbour of $i$ using a Gaussian distribution centered at $i$.
Low-D similarities: Place all points randomly in 2D. Compute $q_{ij}$ — the probability of $j$ being picked as a neighbour of $i$ in 2D, using a t-distribution (heavier tails than Gaussian to avoid crowding).
Minimise KL-divergence: Move points in 2D to minimise $KL(P||Q) = \sum_{ij} p_{ij} \log\frac{p_{ij}}{q_{ij}}$. Similar points are pulled together; dissimilar points are pushed apart.

The Perplexity Parameter

Perplexity ≈ the effective number of neighbours each point considers. Rule of thumb: perplexity = 5–50 (default 30). Larger datasets can use larger perplexity. Try multiple values and pick the most meaningful visualisation.

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target

# ── Run t-SNE ────────────────────────────────────────────────
# Step 1: Reduce to ~50D with PCA first (speeds up t-SNE significantly)
from sklearn.decomposition import PCA
X_pca50 = PCA(n_components=50).fit_transform(X)

tsne = TSNE(
    n_components=2,    # ALWAYS 2 or 3 — not for ML features!
    perplexity=30,     # Effective neighbourhood size (5–50)
    learning_rate=200, # Step size for gradient descent (auto in newer sklearn)
    n_iter=1000,       # Number of optimisation iterations (≥250)
    random_state=42,
    n_jobs=-1          # Use all CPU cores
)
X_tsne = tsne.fit_transform(X_pca50)

print(f"t-SNE KL divergence: {tsne.kl_divergence_:.4f}")  # Lower = better fit

# ── Visualise ────────────────────────────────────────────────
plt.figure(figsize=(9, 7))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=15, alpha=0.8)
plt.colorbar(scatter, label='Digit Class')
plt.title('Digits Dataset — t-SNE 2D (perplexity=30)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.tight_layout()
plt.show()

# ── Compare different perplexities ───────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, perp in zip(axes, [5, 30, 100]):
    tsne_p = TSNE(n_components=2, perplexity=perp, random_state=42)
    X_t = tsne_p.fit_transform(X_pca50)
    ax.scatter(X_t[:, 0], X_t[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
    ax.set_title(f'Perplexity = {perp}')
    ax.set_xticks([]); ax.set_yticks([])
plt.suptitle('t-SNE with Different Perplexity Values')
plt.tight_layout()
plt.show()

⚠️

Critical t-SNE Rules

NEVER use t-SNE components as ML features. The distances are not meaningful across different runs or different perplexity values.
Cluster sizes in t-SNE are meaningless. Dense clusters look the same as sparse ones.
Distance between clusters is not meaningful. Only neighbourhood structure is preserved.
Always run multiple times. t-SNE is non-deterministic — set random_state for reproducibility.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain t-sne and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 67 — UMAP

UMAP — Uniform Manifold Approximation and Projection

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

UMAP vs t-SNE

UMAP (McInnes et al., 2018) has largely replaced t-SNE in many workflows due to its superior speed and better preservation of global structure.

Property	t-SNE	UMAP
Speed	Slow (hours on large data)	Fast (minutes on same data)
Global structure	Poor — only local structure	Better — preserves global relationships
Scalability	~50,000 points max	Millions of points
Output dims	2D or 3D only	Any dimension (can use for ML features!)
Stability	Variable across runs	More stable
Theory basis	KL divergence minimisation	Riemannian geometry + algebraic topology

Key Parameters

n_neighbors: Controls the balance between local and global structure. Low (2–5) = very local; High (50–200) = more global. Default: 15.
min_dist: Minimum distance between points in the embedding (0–1). Low = tighter clusters; High = more spread out. Default: 0.1.
n_components: Number of output dimensions. Use 2 for visualisation, higher for feature compression before ML.
metric: Distance metric — 'euclidean', 'cosine', 'manhattan', etc.

Code Example

# pip install umap-learn
import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target

# ── Basic UMAP visualisation ──────────────────────────────────
reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,     # Local neighborhood size
    min_dist=0.1,       # Minimum distance in embedding
    metric='euclidean',
    random_state=42,
    n_jobs=-1
)
X_umap = reducer.fit_transform(X)

plt.figure(figsize=(9, 7))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=12, alpha=0.85)
plt.colorbar(scatter, label='Digit')
plt.title('Digits — UMAP 2D (n_neighbors=15, min_dist=0.1)')
plt.tight_layout()
plt.show()

# ── UMAP for feature engineering (unlike t-SNE this is valid!) ──
reducer_50d = umap.UMAP(n_components=10, n_neighbors=30, random_state=42)
X_umap_10d = reducer_50d.fit_transform(X)
print(f"UMAP 10D features shape: {X_umap_10d.shape}")  # (1797, 10)

# These 10D features CAN be used as input to a classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
scores = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42),
                         X_umap_10d, y, cv=5, scoring='accuracy')
print(f"Random Forest on UMAP 10D: {scores.mean():.4f} ± {scores.std():.4f}")

# ── UMAP with transform() — apply to new data ─────────────────
# Unlike t-SNE, UMAP supports transform() for new data!
new_data = np.random.randn(5, 64)  # 5 new samples
new_scaled = StandardScaler().fit(digits.data).transform(new_data)
new_umap = reducer.transform(new_scaled)  # Project into learned 2D space
print("New data projected:", new_umap.shape)  # (5, 2)

💡

UMAP for Feature Engineering

Unlike t-SNE, UMAP's output is a valid coordinate system — you can use UMAP-reduced features as input to downstream ML models. This is particularly useful for high-dimensional data like TF-IDF text vectors or image embeddings. Use UMAP → XGBoost pipelines for Kaggle competitions with tabular embeddings.

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain umap and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 68 — Anomaly Detection

Anomaly Detection

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is Anomaly Detection?

Anomaly detection (outlier detection) finds data points that deviate significantly from the majority pattern. Applications: fraud detection, network intrusion, manufacturing defects, medical diagnosis, system monitoring.

Isolation Forest

Isolation Forest works by randomly isolating observations. Anomalies are isolated earlier (fewer splits needed) than normal points, because they are few and different. The anomaly score is the average path length to isolate a point across many trees — shorter = more anomalous.

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs

# ── Generate data with outliers ───────────────────────────────
X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.5, random_state=42)
X_outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X = np.vstack([X_normal, X_outliers])

# ── Fit Isolation Forest ─────────────────────────────────────
iso_forest = IsolationForest(
    n_estimators=100,       # Number of trees
    max_samples='auto',     # Subsample size per tree (default: min(256, n))
    contamination=0.06,     # Expected proportion of outliers (0.06 = 6%)
    random_state=42,
    n_jobs=-1
)
labels = iso_forest.fit_predict(X)
# Returns: +1 for inliers, -1 for outliers

# ── Anomaly scores ────────────────────────────────────────────
scores = iso_forest.decision_function(X)
# More negative = more anomalous; 0 = boundary
print(f"Inliers:  {(labels == 1).sum()}")
print(f"Outliers: {(labels == -1).sum()}")

# ── Visualise ────────────────────────────────────────────────
plt.figure(figsize=(8, 6))
plt.scatter(X[labels==1, 0], X[labels==1, 1], c='steelblue', s=20, label='Normal', alpha=0.7)
plt.scatter(X[labels==-1, 0], X[labels==-1, 1], c='red', s=80, marker='x', label='Anomaly', linewidths=2)
plt.title('Isolation Forest — Anomaly Detection')
plt.legend()
plt.tight_layout()
plt.show()

Local Outlier Factor (LOF)

LOF measures the local density deviation of a point relative to its neighbours. A point with much lower density than its neighbours gets a high LOF score (anomalous). It is effective for detecting anomalies in datasets with varying densities.

Code Example

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(
    n_neighbors=20,        # Size of local neighbourhood
    contamination=0.06,    # Expected proportion of outliers
    novelty=False          # Set True for predict() on new data
)
lof_labels = lof.fit_predict(X)
# +1 = normal, -1 = outlier

lof_scores = -lof.negative_outlier_factor_  # Higher score = more anomalous
print(f"LOF outliers: {(lof_labels == -1).sum()}")

One-Class SVM

One-Class SVM learns a boundary around the "normal" training data. Points outside this boundary are classified as anomalies. Useful when you only have normal data during training (semi-supervised anomaly detection).

Code Example

from sklearn.svm import OneClassSVM

oc_svm = OneClassSVM(
    kernel='rbf',    # RBF kernel for non-linear boundary
    gamma='auto',    # Kernel coefficient
    nu=0.05          # Upper bound on fraction of outliers (≈ contamination)
)
oc_svm.fit(X_normal)           # Train ONLY on normal data
svm_labels = oc_svm.predict(X) # +1 = normal, -1 = outlier
print(f"One-Class SVM outliers: {(svm_labels == -1).sum()}")

Method	Best For	Speed	Note
Isolation Forest	High-dimensional data, large datasets	Fast	Usually the best default choice
LOF	Local anomalies in varying-density data	Medium	Doesn't support predict() by default
One-Class SVM	When only normal data available at training	Slow	Doesn't scale well to large datasets

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain anomaly detection and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 69 — Association Rules

Association Rules — Apriori Algorithm

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Core Concepts

Association rule mining finds interesting relationships between variables in large datasets. Classic application: market basket analysis (which products are frequently bought together).

An association rule has the form: IF {A} THEN {B} — written as $A \Rightarrow B$

Three Key Metrics

Metric	Formula	Interpretation
Support	$\text{supp}(A \cup B) = \frac{\|A \cup B\|}{N}$	Fraction of transactions containing both A and B. Measures frequency/popularity.
Confidence	$\text{conf}(A \Rightarrow B) = \frac{\text{supp}(A \cup B)}{\text{supp}(A)}$	Probability of B given A. "60% of customers who bought A also bought B."
Lift	$\text{lift}(A \Rightarrow B) = \frac{\text{conf}(A \Rightarrow B)}{\text{supp}(B)}$	How much more likely B is given A vs. random. Lift > 1 = positive correlation.

Lift Interpretation: Lift = 1 → A and B are independent. Lift > 1 → A promotes B. Lift < 1 → A discourages B. In practice, aim for lift > 2.0 for meaningful rules.
      

Apriori Algorithm

The Apriori algorithm efficiently mines frequent itemsets using the anti-monotone property: if an itemset is infrequent, all its supersets are also infrequent (allowing pruning of the search space).

Code Example

# pip install mlxtend
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# ── Sample transaction data ───────────────────────────────────
transactions = [
    ['milk', 'bread', 'butter'],
    ['beer', 'bread', 'diaper'],
    ['milk', 'diaper', 'beer', 'cola'],
    ['bread', 'milk', 'diaper', 'beer'],
    ['bread', 'milk', 'butter', 'cola'],
    ['milk', 'bread', 'butter', 'cola'],
    ['bread', 'butter'],
    ['milk', 'bread'],
    ['beer', 'cola', 'diaper'],
    ['milk', 'butter', 'cola'],
]

# ── Encode transactions to binary matrix ─────────────────────
te = TransactionEncoder()
te_array = te.fit_transform(transactions)
df = pd.DataFrame(te_array, columns=te.columns_)
print(df.head())
#    beer  bread  butter  cola  diaper  milk
# 0 False   True    True False   False  True
# ...

# ── Find frequent itemsets ────────────────────────────────────
freq_itemsets = apriori(
    df,
    min_support=0.3,       # Minimum support threshold (30%)
    use_colnames=True,     # Use column names instead of indices
    max_len=4              # Maximum itemset length
)
freq_itemsets = freq_itemsets.sort_values('support', ascending=False)
print(f"
Frequent itemsets ({len(freq_itemsets)} found):")
print(freq_itemsets.head(10))

# ── Generate association rules ────────────────────────────────
rules = association_rules(
    freq_itemsets,
    metric='lift',          # Filter by 'support', 'confidence', or 'lift'
    min_threshold=1.0       # Minimum lift value
)
rules = rules.sort_values('lift', ascending=False)
print(f"
Association rules ({len(rules)} found):")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(10))

# ── Filter high-quality rules ─────────────────────────────────
strong_rules = rules[
    (rules['confidence'] >= 0.6) &   # 60%+ confidence
    (rules['lift'] >= 1.5)           # 1.5x more likely than random
]
print(f"
Strong rules: {len(strong_rules)}")
print(strong_rules[['antecedents', 'consequents', 'confidence', 'lift']])

📌

FP-Growth — Faster than Apriori

For large datasets, Apriori is slow because it generates candidate itemsets. The FP-Growth algorithm uses a compressed FP-tree structure — no candidate generation needed, typically 10–100× faster. Use from mlxtend.frequent_patterns import fpgrowth as a drop-in replacement for apriori().

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain association rules and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Day 70 — Unsupervised Projects

Customer Segmentation — K-Means Case Study

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

End-to-end customer segmentation using the classic RFM (Recency, Frequency, Monetary) framework with K-Means. This is one of the most common unsupervised learning applications in e-commerce and retail.

RFM Features

Recency (R): Days since the customer's last purchase (lower = better)
Frequency (F): Total number of purchases (higher = better)
Monetary (M): Total amount spent (higher = better)

Code Example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')

# ── Simulate retail transaction data ─────────────────────────
np.random.seed(42)
n_customers = 500

data = pd.DataFrame({
    'CustomerID': range(1, n_customers + 1),
    'Recency':    np.random.exponential(scale=30, size=n_customers).astype(int) + 1,
    'Frequency':  np.random.poisson(lam=5, size=n_customers) + 1,
    'Monetary':   np.random.lognormal(mean=4.5, sigma=1.2, size=n_customers).round(2)
})

print("RFM Summary:")
print(data[['Recency', 'Frequency', 'Monetary']].describe())

# ── Feature Engineering & Scaling ────────────────────────────
rfm = data[['Recency', 'Frequency', 'Monetary']].copy()

# Log-transform Monetary (right-skewed distribution)
rfm['Monetary_log'] = np.log1p(rfm['Monetary'])
rfm_features = rfm[['Recency', 'Frequency', 'Monetary_log']]

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_features)

# ── Find Optimal K ────────────────────────────────────────────
inertias, sil_scores = [], []
K_range = range(2, 9)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=15, random_state=42)
    labels = km.fit_predict(rfm_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(rfm_scaled, labels))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_title('Elbow Method'); axes[0].set_xlabel('k'); axes[0].set_ylabel('Inertia')
axes[1].plot(K_range, sil_scores, 'ro-')
axes[1].set_title('Silhouette Score'); axes[1].set_xlabel('k'); axes[1].set_ylabel('Score')
plt.tight_layout(); plt.show()

# ── Apply K-Means with k=4 ────────────────────────────────────
k_optimal = 4
kmeans = KMeans(n_clusters=k_optimal, n_init=15, random_state=42)
data['Segment'] = kmeans.fit_predict(rfm_scaled)

# ── Interpret Segments ────────────────────────────────────────
segment_analysis = data.groupby('Segment').agg({
    'Recency':   ['mean', 'median'],
    'Frequency': ['mean', 'median'],
    'Monetary':  ['mean', 'median'],
    'CustomerID': 'count'
}).round(1)
segment_analysis.columns = ['_'.join(c) for c in segment_analysis.columns]
segment_analysis.rename(columns={'CustomerID_count': 'Count'}, inplace=True)
print("
Segment Analysis:")
print(segment_analysis)

# ── Name Segments Based on RFM Profile ───────────────────────
segment_names = {
    0: 'Champions',      # Low recency, high freq, high monetary
    1: 'At Risk',        # High recency, medium freq/monetary
    2: 'Loyal Customers',# Medium recency, high freq
    3: 'New Customers'   # Low recency, low freq (recently acquired)
}
# NOTE: Mapping depends on actual centroid values — always inspect!
data['Segment_Name'] = data['Segment'].map(segment_names)

# ── 3D Visualisation ─────────────────────────────────────────
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
colors = ['#d4af37', '#e74c3c', '#3498db', '#2ecc71']

for seg, color in zip(range(k_optimal), colors):
    mask = data['Segment'] == seg
    ax.scatter(data.loc[mask, 'Recency'],
               data.loc[mask, 'Frequency'],
               data.loc[mask, 'Monetary'],
               c=color, s=20, alpha=0.6,
               label=segment_names.get(seg, f'Segment {seg}'))

ax.set_xlabel('Recency (days)'); ax.set_ylabel('Frequency'); ax.set_zlabel('Monetary ($)')
ax.set_title('Customer Segments — RFM Analysis')
ax.legend()
plt.tight_layout(); plt.show()

# ── Business Recommendations per Segment ─────────────────────
print("
=== Business Actions by Segment ===")
print("Champions:       Send loyalty rewards, ask for reviews, referral program")
print("Loyal Customers: Offer membership, upsell premium products")
print("At Risk:         Send win-back emails, special discount offers")
print("New Customers:   Welcome series, onboarding, first-purchase discount")

💡

Module 5 Summary — When to Use Which Algorithm

K-Means: Fast, scalable, interpretable — best for spherical clusters, large datasets
Hierarchical: When you need a dendrogram or don't know k; small-medium data
DBSCAN: Arbitrary cluster shapes, automatic outlier detection
PCA: Linear compression, preprocessing for ML, multicollinearity removal
t-SNE: 2D/3D visualisation only — no other use
UMAP: Fast visualisation + feature engineering; better than t-SNE
Isolation Forest: Best general-purpose anomaly detector
Apriori/FP-Growth: Market basket analysis, recommendation systems

Common mistakes

Applying the technique without understanding its assumptions.
Copying defaults from tutorials without validating on your data.
Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

Basic: Explain the concept in plain language with one real-world example.
Intermediate: Implement on a sklearn toy dataset and interpret outputs.
Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

You can explain customer segmentation and when it applies.
You know the main pitfalls and how to detect them in practice.
You can connect this topic to the next step in the ML workflow.

Next: Continue to the next day in this module.

Unsupervised Learning: K-Means Clustering Centroids

Supervised Learning → Evaluation & Tuning →