Module 5: Unsupervised Learning
100 Days of ML Module 5 — Master unsupervised learning: K-Means, DBSCAN, hierarchical clustering, PCA, t-SNE, UMAP, anomaly detection, and association rules.
Unsupervised learning finds hidden structure in unlabeled data. No labels, no feedback — the algorithm discovers groupings, compressed representations, anomalies, and patterns entirely on its own. This module covers the most widely-used unsupervised algorithms in industry: K-Means, hierarchical clustering, DBSCAN, PCA, t-SNE, UMAP, anomaly detection, and association rule mining.
K-Means Clustering
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
What is K-Means?
K-Means partitions $n$ data points into $k$ clusters, where each point belongs to the cluster with the nearest mean (centroid). It minimises the Within-Cluster Sum of Squares (WCSS), also called inertia:
$$WCSS = \sum_{i=1}^{k}\sum_{x \in C_i} \|x - \mu_i\|^2$$where $C_i$ is cluster $i$, $\mu_i$ is its centroid, and $\|\cdot\|$ is the Euclidean distance.
Algorithm Steps (Lloyd's Algorithm)
- Initialise: Randomly select $k$ points from the dataset as initial centroids $\mu_1, \mu_2, \ldots, \mu_k$.
- Assign: For every data point $x_i$, assign it to the closest centroid: $C_j = \arg\min_j \|x_i - \mu_j\|^2$
- Update: Recompute each centroid as the mean of all points assigned to it: $\mu_j = \frac{1}{|C_j|}\sum_{x \in C_j} x$
- Repeat: Iterate steps 2–3 until the centroids stop moving (convergence) or a max iteration count is reached.
Typical convergence: 10–300 iterations
K-Means++ Initialisation
Standard random initialisation can converge to poor local minima. K-Means++ (default in sklearn) chooses initial centroids probabilistically — farther points are more likely to be chosen:
- Choose the first centroid $\mu_1$ uniformly at random from the data.
- For each subsequent centroid, choose $x$ with probability proportional to $D(x)^2$, where $D(x)$ is the distance to the nearest already-chosen centroid.
- Repeat until $k$ centroids are chosen.
K-Means++ Effect
K-Means++ guarantees an $O(\log k)$ approximation to the optimal solution and typically converges 2–5× faster than random initialisation. Always use init='k-means++' (the sklearn default).
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# ── Generate sample data ──────────────────────────────────────
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.7, random_state=42)
# ── Scale features (IMPORTANT for K-Means — it's distance-based) ─
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# ── Fit K-Means ───────────────────────────────────────────────
kmeans = KMeans(
n_clusters=4, # Number of clusters k
init='k-means++', # Smart initialisation (default)
n_init=10, # Run 10 times, keep best result
max_iter=300, # Max iterations per run
random_state=42
)
kmeans.fit(X_scaled)
# ── Inspect results ───────────────────────────────────────────
print("Cluster labels:", kmeans.labels_[:10]) # e.g. [2, 0, 1, 3, ...]
print("Inertia (WCSS):", kmeans.inertia_) # 45.23
print("Centroids shape:", kmeans.cluster_centers_.shape) # (4, 2)
print("Iterations to converge:", kmeans.n_iter_) # e.g. 12
# ── Predict new points ────────────────────────────────────────
new_points = np.array([[1.5, -0.5], [-2.0, 3.0]])
new_scaled = scaler.transform(new_points)
predictions = kmeans.predict(new_scaled)
print("New point clusters:", predictions)
# ── Visualise ─────────────────────────────────────────────────
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='tab10', s=20)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
marker='X', s=200, c='red', label='Centroids')
plt.title("K-Means Clusters (k=4)")
plt.legend()
plt.tight_layout()
plt.show()Key Limitations
| Limitation | Explanation | Solution |
|---|---|---|
| Must specify K | You must provide the number of clusters in advance | Use Elbow Method or Silhouette Score (Day 62) |
| Assumes spherical clusters | Fails on non-convex or elongated clusters | Use DBSCAN or Gaussian Mixture Models |
| Sensitive to outliers | Outliers pull centroids away from true cluster centers | Use K-Medoids or remove outliers first |
| Sensitive to scale | Features with larger ranges dominate distance | Always StandardScale before K-Means |
| Local minima | Random init may converge to suboptimal solution | Use K-Means++ and n_init≥10 |
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain k-means clustering and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 62 — Elbow Method
Choosing K — Elbow Method, Silhouette Score, Davies-Bouldin
Why this matters
Choosing K: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
The Elbow Method
Plot WCSS (inertia) against different values of $k$. As $k$ increases, inertia always decreases. The elbow point — where the rate of decrease sharply slows — suggests the optimal $k$.
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)
# ── Compute inertia for k = 1..10 ────────────────────────────
inertias = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
# ── Plot Elbow Curve ──────────────────────────────────────────
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method — Find Optimal K')
plt.axvline(x=4, color='red', linestyle='--', label='Elbow at k=4')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()Silhouette Score
The silhouette score for a single point $i$ is:
$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$where $a(i)$ = mean distance to points in the same cluster, $b(i)$ = mean distance to points in the nearest other cluster.
- $s(i) = +1$: Point is perfectly inside its own cluster, far from others
- $s(i) = 0$: Point is on the boundary between two clusters
- $s(i) = -1$: Point is probably in the wrong cluster
from sklearn.metrics import silhouette_score, davies_bouldin_score
import numpy as np
silhouette_scores = []
db_scores = []
K_range = range(2, 11) # Silhouette undefined for k=1
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X)
sil = silhouette_score(X, labels)
db = davies_bouldin_score(X, labels)
silhouette_scores.append(sil)
db_scores.append(db)
print(f"k={k:2d} Silhouette={sil:.4f} Davies-Bouldin={db:.4f}")
# Best k: highest silhouette score, lowest Davies-Bouldin
best_k_sil = K_range[np.argmax(silhouette_scores)]
best_k_db = K_range[np.argmin(db_scores)]
print(f"
Best k by Silhouette: {best_k_sil}")
print(f"Best k by Davies-Bouldin: {best_k_db}")Davies-Bouldin Index
Measures the average ratio of within-cluster scatter to between-cluster separation. Lower is better (unlike silhouette which is higher=better). No negative values — range is $[0, \infty)$.
Choosing the Right Method
Use all three methods together: Elbow for a quick visual, Silhouette for mathematical optimum, Davies-Bouldin as a cross-check. When they agree, you have high confidence. When they disagree, domain knowledge should guide the final choice.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain choosing k and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Hierarchical Clustering
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Two Approaches
| Approach | Direction | Process | Result |
|---|---|---|---|
| Agglomerative (bottom-up) | Merge upward | Start with each point as its own cluster. Iteratively merge the two closest clusters until one cluster remains. | Most commonly used |
| Divisive (top-down) | Split downward | Start with all points in one cluster. Recursively split the least cohesive cluster. | Computationally expensive; rarely used |
Linkage Criteria
When merging clusters, how do we measure the distance between two clusters?
| Linkage | Distance Measure | Pros | Cons |
|---|---|---|---|
| Single | Minimum distance between any two points in the two clusters | Can handle non-elliptic shapes | Chaining effect — elongated "sausage" clusters |
| Complete | Maximum distance between any two points | Produces compact, spherical clusters | Sensitive to outliers |
| Average | Mean distance between all pairs of points | Good compromise | Biased toward globular clusters |
| Ward | Minimises increase in total WCSS when merging | Most popular; produces compact equal-size clusters | Only valid with Euclidean distance |
Dendrograms
A dendrogram is a tree diagram showing the hierarchical merge sequence. The y-axis shows the distance (dissimilarity) at which clusters merge. You choose $k$ by drawing a horizontal line — the number of vertical lines it crosses is the number of clusters.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# ── Data ────────────────────────────────────────────────────
X, _ = make_blobs(n_samples=80, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)
# ── Dendrogram with scipy ────────────────────────────────────
linked = linkage(X, method='ward') # Ward linkage matrix
plt.figure(figsize=(14, 5))
dendrogram(
linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True,
leaf_font_size=6,
color_threshold=4.0 # Draw horizontal line at this height
)
plt.title("Hierarchical Clustering Dendrogram (Ward Linkage)")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.axhline(y=4.0, color='red', linestyle='--', label='Cut line → 4 clusters')
plt.legend()
plt.tight_layout()
plt.show()
# ── Agglomerative Clustering with sklearn ───────────────────
agg = AgglomerativeClustering(
n_clusters=4,
linkage='ward', # 'single', 'complete', 'average', 'ward'
metric='euclidean' # Only 'euclidean' valid for Ward
)
labels = agg.fit_predict(X)
print("Cluster labels:", np.unique(labels)) # [0, 1, 2, 3]
# With affinity='precomputed' you can use custom distance matrices
# Useful for text clustering with cosine similarityWhen to Choose Hierarchical over K-Means
Use hierarchical clustering when: (1) you don't know the number of clusters in advance and want to explore the dendrogram, (2) you need a tree-structured output, (3) you have small-to-medium data (<10,000 points — it's O(n² log n) memory/time). For large data, K-Means or DBSCAN scale better.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain hierarchical clustering and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 64 — DBSCAN
DBSCAN — Density-Based Spatial Clustering
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Core Idea
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as dense regions of points separated by sparse regions. Unlike K-Means, it does not require specifying the number of clusters and can detect arbitrarily-shaped clusters.
Two Parameters
- ε (epsilon): The radius of the neighbourhood around each point. A point's neighbourhood = all points within distance ε.
- min_samples: The minimum number of points required to form a dense region (including the point itself). Typical values: 3–10.
Three Types of Points
| Point Type | Definition | Label |
|---|---|---|
| Core Point | Has at least min_samples points (including itself) within radius ε | Belongs to a cluster |
| Border Point | Within ε of a core point but has fewer than min_samples neighbours | Belongs to the cluster of its core point |
| Noise Point | Not a core point AND not within ε of any core point | Label = -1 (outlier) |
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons, make_blobs
from sklearn.preprocessing import StandardScaler
# ── K-Means fails on crescent shapes; DBSCAN handles them ────
X, _ = make_moons(n_samples=300, noise=0.08, random_state=42)
X = StandardScaler().fit_transform(X)
dbscan = DBSCAN(
eps=0.3, # Neighbourhood radius
min_samples=5, # Minimum points to be a core point
metric='euclidean' # Distance metric
)
labels = dbscan.fit_predict(X)
# ── Analyse results ──────────────────────────────────────────
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = np.sum(labels == -1)
core_mask = np.zeros_like(labels, dtype=bool)
core_mask[dbscan.core_sample_indices_] = True
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
# ── Visualise ────────────────────────────────────────────────
plt.figure(figsize=(8, 4))
unique_labels = set(labels)
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))
for label, col in zip(unique_labels, colors):
if label == -1:
col = 'black' # Noise = black
mask = labels == label
plt.scatter(X[mask & core_mask, 0], X[mask & core_mask, 1],
c=[col], s=50, label=f'Cluster {label}' if label != -1 else 'Noise')
plt.scatter(X[mask & ~core_mask, 0], X[mask & ~core_mask, 1],
c=[col], s=15, alpha=0.5)
plt.title(f"DBSCAN — {n_clusters} clusters, {n_noise} noise points")
plt.legend()
plt.tight_layout()
plt.show()
# ── Choosing epsilon with the k-distance plot ────────────────
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=5).fit(X)
distances, _ = nbrs.kneighbors(X)
k_distances = np.sort(distances[:, -1])[::-1] # Sort descending
plt.figure(figsize=(8, 3))
plt.plot(k_distances)
plt.xlabel('Points (sorted)')
plt.ylabel(f'5th nearest neighbour distance')
plt.title('K-Distance Plot — elbow suggests optimal epsilon')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()DBSCAN vs K-Means Comparison
| Feature | K-Means | DBSCAN |
|---|---|---|
| Cluster shape | Spherical / convex only | Arbitrary (crescents, rings, blobs) |
| Number of clusters | Must be specified (k) | Auto-determined from data density |
| Outliers | Every point assigned to a cluster | Outliers explicitly labelled as -1 |
| Parameters | k, n_init | eps, min_samples |
| Scalability | O(nkI) — fast for large data | O(n log n) with spatial index; O(n²) naive |
| Varying density | Handles uniform density | Struggles with highly varying density (use HDBSCAN) |
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain dbscan and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 65 — PCA
PCA — Principal Component Analysis
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
What is PCA?
PCA is a linear dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data. Each principal component is a linear combination of original features and is orthogonal to all other components.
Mathematical Foundation
Given a data matrix $X \in \mathbb{R}^{n \times p}$ (n samples, p features):
- Centre the data: Subtract the mean: $\tilde{X} = X - \bar{X}$
- Compute covariance matrix: $\Sigma = \frac{1}{n-1}\tilde{X}^T\tilde{X}$
- Eigen-decomposition: $\Sigma v = \lambda v$ — find eigenvalues $\lambda$ and eigenvectors $v$
- Sort by variance: Order eigenvectors by their eigenvalues (largest first) — these are the principal components
- Project: $Z = \tilde{X} W$ where $W$ is the matrix of top-$d$ eigenvectors
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
# ── Load high-dimensional dataset (64 features) ──────────────
digits = load_digits()
X = digits.data # (1797, 64) — 64 pixel features
y = digits.target
# ── Always scale before PCA ──────────────────────────────────
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# ── Full PCA to examine variance ──────────────────────────────
pca_full = PCA() # All components
pca_full.fit(X_scaled)
# ── Scree Plot ────────────────────────────────────────────────
explained_var = pca_full.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.bar(range(1, 21), explained_var[:20], color='#d4af37', alpha=0.8)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot — Individual Variance')
plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumulative_var)+1), cumulative_var, 'b-o', markersize=3)
plt.axhline(y=0.95, color='red', linestyle='--', label='95% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance')
plt.legend()
plt.tight_layout()
plt.show()
# Find n_components for 95% variance
n_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 95% variance: {n_95} (out of 64)")
# Typically ~29 components needed for digits dataset
# ── Apply PCA for compression ─────────────────────────────────
pca = PCA(n_components=n_95)
X_pca = pca.fit_transform(X_scaled)
print(f"Original shape: {X_scaled.shape}") # (1797, 64)
print(f"Reduced shape: {X_pca.shape}") # (1797, 29)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.3f}")
# ── 2D Visualisation ─────────────────────────────────────────
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=10, alpha=0.7)
plt.colorbar(scatter, label='Digit Class')
plt.title('Digits Dataset — PCA 2D Projection')
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)')
plt.tight_layout()
plt.show()
# ── PCA for feature preprocessing ────────────────────────────
# Use PCA-compressed features in downstream ML
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(max_iter=2000)
scores_raw = cross_val_score(lr, X_scaled, y, cv=5, scoring='accuracy')
scores_pca = cross_val_score(lr, X_pca, y, cv=5, scoring='accuracy')
print(f"Accuracy (64 features): {scores_raw.mean():.4f}")
print(f"Accuracy (PCA {n_95}): {scores_pca.mean():.4f}")
# PCA often achieves similar accuracy with far fewer featuresPCA is for Linear Relationships Only
PCA finds linear combinations of features. If your data has non-linear structure, PCA may miss important patterns. Use Kernel PCA (sklearn.decomposition.KernelPCA) or t-SNE/UMAP for non-linear dimensionality reduction.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain pca and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 66 — t-SNE
t-SNE — t-Distributed Stochastic Neighbour Embedding
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
What is t-SNE?
t-SNE is a non-linear dimensionality reduction technique designed exclusively for 2D or 3D visualisation. It preserves local neighbourhood structure — points that are close in high-dimensional space remain close in the 2D embedding.
How it Works (Intuition)
- High-D similarities: For each pair of points, compute the probability $p_{ij}$ that point $j$ would be picked as a neighbour of $i$ using a Gaussian distribution centered at $i$.
- Low-D similarities: Place all points randomly in 2D. Compute $q_{ij}$ — the probability of $j$ being picked as a neighbour of $i$ in 2D, using a t-distribution (heavier tails than Gaussian to avoid crowding).
- Minimise KL-divergence: Move points in 2D to minimise $KL(P||Q) = \sum_{ij} p_{ij} \log\frac{p_{ij}}{q_{ij}}$. Similar points are pulled together; dissimilar points are pushed apart.
The Perplexity Parameter
Perplexity ≈ the effective number of neighbours each point considers. Rule of thumb: perplexity = 5–50 (default 30). Larger datasets can use larger perplexity. Try multiple values and pick the most meaningful visualisation.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
# ── Run t-SNE ────────────────────────────────────────────────
# Step 1: Reduce to ~50D with PCA first (speeds up t-SNE significantly)
from sklearn.decomposition import PCA
X_pca50 = PCA(n_components=50).fit_transform(X)
tsne = TSNE(
n_components=2, # ALWAYS 2 or 3 — not for ML features!
perplexity=30, # Effective neighbourhood size (5–50)
learning_rate=200, # Step size for gradient descent (auto in newer sklearn)
n_iter=1000, # Number of optimisation iterations (≥250)
random_state=42,
n_jobs=-1 # Use all CPU cores
)
X_tsne = tsne.fit_transform(X_pca50)
print(f"t-SNE KL divergence: {tsne.kl_divergence_:.4f}") # Lower = better fit
# ── Visualise ────────────────────────────────────────────────
plt.figure(figsize=(9, 7))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=15, alpha=0.8)
plt.colorbar(scatter, label='Digit Class')
plt.title('Digits Dataset — t-SNE 2D (perplexity=30)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.tight_layout()
plt.show()
# ── Compare different perplexities ───────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, perp in zip(axes, [5, 30, 100]):
tsne_p = TSNE(n_components=2, perplexity=perp, random_state=42)
X_t = tsne_p.fit_transform(X_pca50)
ax.scatter(X_t[:, 0], X_t[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
ax.set_title(f'Perplexity = {perp}')
ax.set_xticks([]); ax.set_yticks([])
plt.suptitle('t-SNE with Different Perplexity Values')
plt.tight_layout()
plt.show()Critical t-SNE Rules
- NEVER use t-SNE components as ML features. The distances are not meaningful across different runs or different perplexity values.
- Cluster sizes in t-SNE are meaningless. Dense clusters look the same as sparse ones.
- Distance between clusters is not meaningful. Only neighbourhood structure is preserved.
- Always run multiple times. t-SNE is non-deterministic — set random_state for reproducibility.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain t-sne and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Day 67 — UMAP
UMAP — Uniform Manifold Approximation and Projection
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
UMAP vs t-SNE
UMAP (McInnes et al., 2018) has largely replaced t-SNE in many workflows due to its superior speed and better preservation of global structure.
| Property | t-SNE | UMAP |
|---|---|---|
| Speed | Slow (hours on large data) | Fast (minutes on same data) |
| Global structure | Poor — only local structure | Better — preserves global relationships |
| Scalability | ~50,000 points max | Millions of points |
| Output dims | 2D or 3D only | Any dimension (can use for ML features!) |
| Stability | Variable across runs | More stable |
| Theory basis | KL divergence minimisation | Riemannian geometry + algebraic topology |
Key Parameters
- n_neighbors: Controls the balance between local and global structure. Low (2–5) = very local; High (50–200) = more global. Default: 15.
- min_dist: Minimum distance between points in the embedding (0–1). Low = tighter clusters; High = more spread out. Default: 0.1.
- n_components: Number of output dimensions. Use 2 for visualisation, higher for feature compression before ML.
- metric: Distance metric — 'euclidean', 'cosine', 'manhattan', etc.
# pip install umap-learn
import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
# ── Basic UMAP visualisation ──────────────────────────────────
reducer = umap.UMAP(
n_components=2,
n_neighbors=15, # Local neighborhood size
min_dist=0.1, # Minimum distance in embedding
metric='euclidean',
random_state=42,
n_jobs=-1
)
X_umap = reducer.fit_transform(X)
plt.figure(figsize=(9, 7))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=12, alpha=0.85)
plt.colorbar(scatter, label='Digit')
plt.title('Digits — UMAP 2D (n_neighbors=15, min_dist=0.1)')
plt.tight_layout()
plt.show()
# ── UMAP for feature engineering (unlike t-SNE this is valid!) ──
reducer_50d = umap.UMAP(n_components=10, n_neighbors=30, random_state=42)
X_umap_10d = reducer_50d.fit_transform(X)
print(f"UMAP 10D features shape: {X_umap_10d.shape}") # (1797, 10)
# These 10D features CAN be used as input to a classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
scores = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42),
X_umap_10d, y, cv=5, scoring='accuracy')
print(f"Random Forest on UMAP 10D: {scores.mean():.4f} ± {scores.std():.4f}")
# ── UMAP with transform() — apply to new data ─────────────────
# Unlike t-SNE, UMAP supports transform() for new data!
new_data = np.random.randn(5, 64) # 5 new samples
new_scaled = StandardScaler().fit(digits.data).transform(new_data)
new_umap = reducer.transform(new_scaled) # Project into learned 2D space
print("New data projected:", new_umap.shape) # (5, 2)UMAP for Feature Engineering
Unlike t-SNE, UMAP's output is a valid coordinate system — you can use UMAP-reduced features as input to downstream ML models. This is particularly useful for high-dimensional data like TF-IDF text vectors or image embeddings. Use UMAP → XGBoost pipelines for Kaggle competitions with tabular embeddings.
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain umap and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Anomaly Detection
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
What is Anomaly Detection?
Anomaly detection (outlier detection) finds data points that deviate significantly from the majority pattern. Applications: fraud detection, network intrusion, manufacturing defects, medical diagnosis, system monitoring.
Isolation Forest
Isolation Forest works by randomly isolating observations. Anomalies are isolated earlier (fewer splits needed) than normal points, because they are few and different. The anomaly score is the average path length to isolate a point across many trees — shorter = more anomalous.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
# ── Generate data with outliers ───────────────────────────────
X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.5, random_state=42)
X_outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X = np.vstack([X_normal, X_outliers])
# ── Fit Isolation Forest ─────────────────────────────────────
iso_forest = IsolationForest(
n_estimators=100, # Number of trees
max_samples='auto', # Subsample size per tree (default: min(256, n))
contamination=0.06, # Expected proportion of outliers (0.06 = 6%)
random_state=42,
n_jobs=-1
)
labels = iso_forest.fit_predict(X)
# Returns: +1 for inliers, -1 for outliers
# ── Anomaly scores ────────────────────────────────────────────
scores = iso_forest.decision_function(X)
# More negative = more anomalous; 0 = boundary
print(f"Inliers: {(labels == 1).sum()}")
print(f"Outliers: {(labels == -1).sum()}")
# ── Visualise ────────────────────────────────────────────────
plt.figure(figsize=(8, 6))
plt.scatter(X[labels==1, 0], X[labels==1, 1], c='steelblue', s=20, label='Normal', alpha=0.7)
plt.scatter(X[labels==-1, 0], X[labels==-1, 1], c='red', s=80, marker='x', label='Anomaly', linewidths=2)
plt.title('Isolation Forest — Anomaly Detection')
plt.legend()
plt.tight_layout()
plt.show()Local Outlier Factor (LOF)
LOF measures the local density deviation of a point relative to its neighbours. A point with much lower density than its neighbours gets a high LOF score (anomalous). It is effective for detecting anomalies in datasets with varying densities.
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(
n_neighbors=20, # Size of local neighbourhood
contamination=0.06, # Expected proportion of outliers
novelty=False # Set True for predict() on new data
)
lof_labels = lof.fit_predict(X)
# +1 = normal, -1 = outlier
lof_scores = -lof.negative_outlier_factor_ # Higher score = more anomalous
print(f"LOF outliers: {(lof_labels == -1).sum()}")One-Class SVM
One-Class SVM learns a boundary around the "normal" training data. Points outside this boundary are classified as anomalies. Useful when you only have normal data during training (semi-supervised anomaly detection).
from sklearn.svm import OneClassSVM
oc_svm = OneClassSVM(
kernel='rbf', # RBF kernel for non-linear boundary
gamma='auto', # Kernel coefficient
nu=0.05 # Upper bound on fraction of outliers (≈ contamination)
)
oc_svm.fit(X_normal) # Train ONLY on normal data
svm_labels = oc_svm.predict(X) # +1 = normal, -1 = outlier
print(f"One-Class SVM outliers: {(svm_labels == -1).sum()}")| Method | Best For | Speed | Note |
|---|---|---|---|
| Isolation Forest | High-dimensional data, large datasets | Fast | Usually the best default choice |
| LOF | Local anomalies in varying-density data | Medium | Doesn't support predict() by default |
| One-Class SVM | When only normal data available at training | Slow | Doesn't scale well to large datasets |
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain anomaly detection and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Association Rules — Apriori Algorithm
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
Core Concepts
Association rule mining finds interesting relationships between variables in large datasets. Classic application: market basket analysis (which products are frequently bought together).
An association rule has the form: IF {A} THEN {B} — written as $A \Rightarrow B$
Three Key Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| Support | $\text{supp}(A \cup B) = \frac{|A \cup B|}{N}$ | Fraction of transactions containing both A and B. Measures frequency/popularity. |
| Confidence | $\text{conf}(A \Rightarrow B) = \frac{\text{supp}(A \cup B)}{\text{supp}(A)}$ | Probability of B given A. "60% of customers who bought A also bought B." |
| Lift | $\text{lift}(A \Rightarrow B) = \frac{\text{conf}(A \Rightarrow B)}{\text{supp}(B)}$ | How much more likely B is given A vs. random. Lift > 1 = positive correlation. |
Apriori Algorithm
The Apriori algorithm efficiently mines frequent itemsets using the anti-monotone property: if an itemset is infrequent, all its supersets are also infrequent (allowing pruning of the search space).
# pip install mlxtend
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
# ── Sample transaction data ───────────────────────────────────
transactions = [
['milk', 'bread', 'butter'],
['beer', 'bread', 'diaper'],
['milk', 'diaper', 'beer', 'cola'],
['bread', 'milk', 'diaper', 'beer'],
['bread', 'milk', 'butter', 'cola'],
['milk', 'bread', 'butter', 'cola'],
['bread', 'butter'],
['milk', 'bread'],
['beer', 'cola', 'diaper'],
['milk', 'butter', 'cola'],
]
# ── Encode transactions to binary matrix ─────────────────────
te = TransactionEncoder()
te_array = te.fit_transform(transactions)
df = pd.DataFrame(te_array, columns=te.columns_)
print(df.head())
# beer bread butter cola diaper milk
# 0 False True True False False True
# ...
# ── Find frequent itemsets ────────────────────────────────────
freq_itemsets = apriori(
df,
min_support=0.3, # Minimum support threshold (30%)
use_colnames=True, # Use column names instead of indices
max_len=4 # Maximum itemset length
)
freq_itemsets = freq_itemsets.sort_values('support', ascending=False)
print(f"
Frequent itemsets ({len(freq_itemsets)} found):")
print(freq_itemsets.head(10))
# ── Generate association rules ────────────────────────────────
rules = association_rules(
freq_itemsets,
metric='lift', # Filter by 'support', 'confidence', or 'lift'
min_threshold=1.0 # Minimum lift value
)
rules = rules.sort_values('lift', ascending=False)
print(f"
Association rules ({len(rules)} found):")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(10))
# ── Filter high-quality rules ─────────────────────────────────
strong_rules = rules[
(rules['confidence'] >= 0.6) & # 60%+ confidence
(rules['lift'] >= 1.5) # 1.5x more likely than random
]
print(f"
Strong rules: {len(strong_rules)}")
print(strong_rules[['antecedents', 'consequents', 'confidence', 'lift']])FP-Growth — Faster than Apriori
For large datasets, Apriori is slow because it generates candidate itemsets. The FP-Growth algorithm uses a compressed FP-tree structure — no candidate generation needed, typically 10–100× faster. Use from mlxtend.frequent_patterns import fpgrowth as a drop-in replacement for apriori().
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain association rules and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Customer Segmentation — K-Means Case Study
Why this matters
This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.
End-to-end customer segmentation using the classic RFM (Recency, Frequency, Monetary) framework with K-Means. This is one of the most common unsupervised learning applications in e-commerce and retail.
RFM Features
- Recency (R): Days since the customer's last purchase (lower = better)
- Frequency (F): Total number of purchases (higher = better)
- Monetary (M): Total amount spent (higher = better)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')
# ── Simulate retail transaction data ─────────────────────────
np.random.seed(42)
n_customers = 500
data = pd.DataFrame({
'CustomerID': range(1, n_customers + 1),
'Recency': np.random.exponential(scale=30, size=n_customers).astype(int) + 1,
'Frequency': np.random.poisson(lam=5, size=n_customers) + 1,
'Monetary': np.random.lognormal(mean=4.5, sigma=1.2, size=n_customers).round(2)
})
print("RFM Summary:")
print(data[['Recency', 'Frequency', 'Monetary']].describe())
# ── Feature Engineering & Scaling ────────────────────────────
rfm = data[['Recency', 'Frequency', 'Monetary']].copy()
# Log-transform Monetary (right-skewed distribution)
rfm['Monetary_log'] = np.log1p(rfm['Monetary'])
rfm_features = rfm[['Recency', 'Frequency', 'Monetary_log']]
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_features)
# ── Find Optimal K ────────────────────────────────────────────
inertias, sil_scores = [], []
K_range = range(2, 9)
for k in K_range:
km = KMeans(n_clusters=k, n_init=15, random_state=42)
labels = km.fit_predict(rfm_scaled)
inertias.append(km.inertia_)
sil_scores.append(silhouette_score(rfm_scaled, labels))
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_title('Elbow Method'); axes[0].set_xlabel('k'); axes[0].set_ylabel('Inertia')
axes[1].plot(K_range, sil_scores, 'ro-')
axes[1].set_title('Silhouette Score'); axes[1].set_xlabel('k'); axes[1].set_ylabel('Score')
plt.tight_layout(); plt.show()
# ── Apply K-Means with k=4 ────────────────────────────────────
k_optimal = 4
kmeans = KMeans(n_clusters=k_optimal, n_init=15, random_state=42)
data['Segment'] = kmeans.fit_predict(rfm_scaled)
# ── Interpret Segments ────────────────────────────────────────
segment_analysis = data.groupby('Segment').agg({
'Recency': ['mean', 'median'],
'Frequency': ['mean', 'median'],
'Monetary': ['mean', 'median'],
'CustomerID': 'count'
}).round(1)
segment_analysis.columns = ['_'.join(c) for c in segment_analysis.columns]
segment_analysis.rename(columns={'CustomerID_count': 'Count'}, inplace=True)
print("
Segment Analysis:")
print(segment_analysis)
# ── Name Segments Based on RFM Profile ───────────────────────
segment_names = {
0: 'Champions', # Low recency, high freq, high monetary
1: 'At Risk', # High recency, medium freq/monetary
2: 'Loyal Customers',# Medium recency, high freq
3: 'New Customers' # Low recency, low freq (recently acquired)
}
# NOTE: Mapping depends on actual centroid values — always inspect!
data['Segment_Name'] = data['Segment'].map(segment_names)
# ── 3D Visualisation ─────────────────────────────────────────
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
colors = ['#d4af37', '#e74c3c', '#3498db', '#2ecc71']
for seg, color in zip(range(k_optimal), colors):
mask = data['Segment'] == seg
ax.scatter(data.loc[mask, 'Recency'],
data.loc[mask, 'Frequency'],
data.loc[mask, 'Monetary'],
c=color, s=20, alpha=0.6,
label=segment_names.get(seg, f'Segment {seg}'))
ax.set_xlabel('Recency (days)'); ax.set_ylabel('Frequency'); ax.set_zlabel('Monetary ($)')
ax.set_title('Customer Segments — RFM Analysis')
ax.legend()
plt.tight_layout(); plt.show()
# ── Business Recommendations per Segment ─────────────────────
print("
=== Business Actions by Segment ===")
print("Champions: Send loyalty rewards, ask for reviews, referral program")
print("Loyal Customers: Offer membership, upsell premium products")
print("At Risk: Send win-back emails, special discount offers")
print("New Customers: Welcome series, onboarding, first-purchase discount")Module 5 Summary — When to Use Which Algorithm
- K-Means: Fast, scalable, interpretable — best for spherical clusters, large datasets
- Hierarchical: When you need a dendrogram or don't know k; small-medium data
- DBSCAN: Arbitrary cluster shapes, automatic outlier detection
- PCA: Linear compression, preprocessing for ML, multicollinearity removal
- t-SNE: 2D/3D visualisation only — no other use
- UMAP: Fast visualisation + feature engineering; better than t-SNE
- Isolation Forest: Best general-purpose anomaly detector
- Apriori/FP-Growth: Market basket analysis, recommendation systems
Common mistakes
- Applying the technique without understanding its assumptions.
- Copying defaults from tutorials without validating on your data.
- Skipping validation — always measure impact with a proper holdout or CV.
Interview checkpoints
- Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
- Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.
Practice
- Basic: Explain the concept in plain language with one real-world example.
- Intermediate: Implement on a sklearn toy dataset and interpret outputs.
- Advanced: Compare two approaches on the same split and document tradeoffs.
Recap
- You can explain customer segmentation and when it applies.
- You know the main pitfalls and how to detect them in practice.
- You can connect this topic to the next step in the ML workflow.
Next: Continue to the next day in this module.
