Search topics…
Tutorials
Explore
June 6 Offline Event →
100 Days of ML · Module 5 (70)

Module 5: Unsupervised Learning

100 Days of ML Module 5 — Master unsupervised learning: K-Means, DBSCAN, hierarchical clustering, PCA, t-SNE, UMAP, anomaly detection, and association rules.

⏱ 50 Min Read 70 Updated: May 2026

Unsupervised learning finds hidden structure in unlabeled data. No labels, no feedback — the algorithm discovers groupings, compressed representations, anomalies, and patterns entirely on its own. This module covers the most widely-used unsupervised algorithms in industry: K-Means, hierarchical clustering, DBSCAN, PCA, t-SNE, UMAP, anomaly detection, and association rule mining.

K-Means Clustering

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is K-Means?

K-Means partitions $n$ data points into $k$ clusters, where each point belongs to the cluster with the nearest mean (centroid). It minimises the Within-Cluster Sum of Squares (WCSS), also called inertia:

$$WCSS = \sum_{i=1}^{k}\sum_{x \in C_i} \|x - \mu_i\|^2$$

where $C_i$ is cluster $i$, $\mu_i$ is its centroid, and $\|\cdot\|$ is the Euclidean distance.

Algorithm Steps (Lloyd's Algorithm)

  1. Initialise: Randomly select $k$ points from the dataset as initial centroids $\mu_1, \mu_2, \ldots, \mu_k$.
  2. Assign: For every data point $x_i$, assign it to the closest centroid: $C_j = \arg\min_j \|x_i - \mu_j\|^2$
  3. Update: Recompute each centroid as the mean of all points assigned to it: $\mu_j = \frac{1}{|C_j|}\sum_{x \in C_j} x$
  4. Repeat: Iterate steps 2–3 until the centroids stop moving (convergence) or a max iteration count is reached.
Init K centroids
Assign points
Update centroids
Converged?
→ No → Assign

Typical convergence: 10–300 iterations

K-Means++ Initialisation

Standard random initialisation can converge to poor local minima. K-Means++ (default in sklearn) chooses initial centroids probabilistically — farther points are more likely to be chosen:

  1. Choose the first centroid $\mu_1$ uniformly at random from the data.
  2. For each subsequent centroid, choose $x$ with probability proportional to $D(x)^2$, where $D(x)$ is the distance to the nearest already-chosen centroid.
  3. Repeat until $k$ centroids are chosen.
💡
K-Means++ Effect

K-Means++ guarantees an $O(\log k)$ approximation to the optimal solution and typically converges 2–5× faster than random initialisation. Always use init='k-means++' (the sklearn default).

Python Implementation

sklearn K-Means — Full Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# ── Generate sample data ──────────────────────────────────────
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.7, random_state=42)

# ── Scale features (IMPORTANT for K-Means — it's distance-based) ─
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ── Fit K-Means ───────────────────────────────────────────────
kmeans = KMeans(
    n_clusters=4,       # Number of clusters k
    init='k-means++',   # Smart initialisation (default)
    n_init=10,          # Run 10 times, keep best result
    max_iter=300,       # Max iterations per run
    random_state=42
)
kmeans.fit(X_scaled)

# ── Inspect results ───────────────────────────────────────────
print("Cluster labels:", kmeans.labels_[:10])   # e.g. [2, 0, 1, 3, ...]
print("Inertia (WCSS):", kmeans.inertia_)        # 45.23
print("Centroids shape:", kmeans.cluster_centers_.shape)  # (4, 2)
print("Iterations to converge:", kmeans.n_iter_)  # e.g. 12

# ── Predict new points ────────────────────────────────────────
new_points = np.array([[1.5, -0.5], [-2.0, 3.0]])
new_scaled = scaler.transform(new_points)
predictions = kmeans.predict(new_scaled)
print("New point clusters:", predictions)

# ── Visualise ─────────────────────────────────────────────────
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='tab10', s=20)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            marker='X', s=200, c='red', label='Centroids')
plt.title("K-Means Clusters (k=4)")
plt.legend()
plt.tight_layout()
plt.show()

Key Limitations

LimitationExplanationSolution
Must specify KYou must provide the number of clusters in advanceUse Elbow Method or Silhouette Score (Day 62)
Assumes spherical clustersFails on non-convex or elongated clustersUse DBSCAN or Gaussian Mixture Models
Sensitive to outliersOutliers pull centroids away from true cluster centersUse K-Medoids or remove outliers first
Sensitive to scaleFeatures with larger ranges dominate distanceAlways StandardScale before K-Means
Local minimaRandom init may converge to suboptimal solutionUse K-Means++ and n_init≥10

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain k-means clustering and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 62 — Elbow Method

Choosing K — Elbow Method, Silhouette Score, Davies-Bouldin

Why this matters

Choosing K: This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

The Elbow Method

Plot WCSS (inertia) against different values of $k$. As $k$ increases, inertia always decreases. The elbow point — where the rate of decrease sharply slows — suggests the optimal $k$.

Intuition: Adding more clusters always reduces inertia, but after the true number of clusters, the improvement becomes marginal. The "elbow" is where diminishing returns kick in.
Code Example
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)

# ── Compute inertia for k = 1..10 ────────────────────────────
inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

# ── Plot Elbow Curve ──────────────────────────────────────────
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method — Find Optimal K')
plt.axvline(x=4, color='red', linestyle='--', label='Elbow at k=4')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Silhouette Score

The silhouette score for a single point $i$ is:

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

where $a(i)$ = mean distance to points in the same cluster, $b(i)$ = mean distance to points in the nearest other cluster.

  • $s(i) = +1$: Point is perfectly inside its own cluster, far from others
  • $s(i) = 0$: Point is on the boundary between two clusters
  • $s(i) = -1$: Point is probably in the wrong cluster
Code Example
from sklearn.metrics import silhouette_score, davies_bouldin_score
import numpy as np

silhouette_scores = []
db_scores = []
K_range = range(2, 11)  # Silhouette undefined for k=1

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X)
    sil = silhouette_score(X, labels)
    db  = davies_bouldin_score(X, labels)
    silhouette_scores.append(sil)
    db_scores.append(db)
    print(f"k={k:2d}  Silhouette={sil:.4f}  Davies-Bouldin={db:.4f}")

# Best k: highest silhouette score, lowest Davies-Bouldin
best_k_sil = K_range[np.argmax(silhouette_scores)]
best_k_db  = K_range[np.argmin(db_scores)]
print(f"
Best k by Silhouette: {best_k_sil}")
print(f"Best k by Davies-Bouldin: {best_k_db}")

Davies-Bouldin Index

Measures the average ratio of within-cluster scatter to between-cluster separation. Lower is better (unlike silhouette which is higher=better). No negative values — range is $[0, \infty)$.

📌
Choosing the Right Method

Use all three methods together: Elbow for a quick visual, Silhouette for mathematical optimum, Davies-Bouldin as a cross-check. When they agree, you have high confidence. When they disagree, domain knowledge should guide the final choice.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain choosing k and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 63 — Hierarchical Clustering

Hierarchical Clustering

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Two Approaches

ApproachDirectionProcessResult
Agglomerative (bottom-up) Merge upward Start with each point as its own cluster. Iteratively merge the two closest clusters until one cluster remains. Most commonly used
Divisive (top-down) Split downward Start with all points in one cluster. Recursively split the least cohesive cluster. Computationally expensive; rarely used

Linkage Criteria

When merging clusters, how do we measure the distance between two clusters?

LinkageDistance MeasureProsCons
SingleMinimum distance between any two points in the two clustersCan handle non-elliptic shapesChaining effect — elongated "sausage" clusters
CompleteMaximum distance between any two pointsProduces compact, spherical clustersSensitive to outliers
AverageMean distance between all pairs of pointsGood compromiseBiased toward globular clusters
WardMinimises increase in total WCSS when mergingMost popular; produces compact equal-size clustersOnly valid with Euclidean distance

Dendrograms

A dendrogram is a tree diagram showing the hierarchical merge sequence. The y-axis shows the distance (dissimilarity) at which clusters merge. You choose $k$ by drawing a horizontal line — the number of vertical lines it crosses is the number of clusters.

Code Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# ── Data ────────────────────────────────────────────────────
X, _ = make_blobs(n_samples=80, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)

# ── Dendrogram with scipy ────────────────────────────────────
linked = linkage(X, method='ward')   # Ward linkage matrix

plt.figure(figsize=(14, 5))
dendrogram(
    linked,
    orientation='top',
    distance_sort='descending',
    show_leaf_counts=True,
    leaf_font_size=6,
    color_threshold=4.0   # Draw horizontal line at this height
)
plt.title("Hierarchical Clustering Dendrogram (Ward Linkage)")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.axhline(y=4.0, color='red', linestyle='--', label='Cut line → 4 clusters')
plt.legend()
plt.tight_layout()
plt.show()

# ── Agglomerative Clustering with sklearn ───────────────────
agg = AgglomerativeClustering(
    n_clusters=4,
    linkage='ward',          # 'single', 'complete', 'average', 'ward'
    metric='euclidean'       # Only 'euclidean' valid for Ward
)
labels = agg.fit_predict(X)
print("Cluster labels:", np.unique(labels))  # [0, 1, 2, 3]

# With affinity='precomputed' you can use custom distance matrices
# Useful for text clustering with cosine similarity
💡
When to Choose Hierarchical over K-Means

Use hierarchical clustering when: (1) you don't know the number of clusters in advance and want to explore the dendrogram, (2) you need a tree-structured output, (3) you have small-to-medium data (<10,000 points — it's O(n² log n) memory/time). For large data, K-Means or DBSCAN scale better.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain hierarchical clustering and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 64 — DBSCAN

DBSCAN — Density-Based Spatial Clustering

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Core Idea

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as dense regions of points separated by sparse regions. Unlike K-Means, it does not require specifying the number of clusters and can detect arbitrarily-shaped clusters.

Two Parameters

  • ε (epsilon): The radius of the neighbourhood around each point. A point's neighbourhood = all points within distance ε.
  • min_samples: The minimum number of points required to form a dense region (including the point itself). Typical values: 3–10.

Three Types of Points

Point TypeDefinitionLabel
Core PointHas at least min_samples points (including itself) within radius εBelongs to a cluster
Border PointWithin ε of a core point but has fewer than min_samples neighboursBelongs to the cluster of its core point
Noise PointNot a core point AND not within ε of any core pointLabel = -1 (outlier)
Key Insight: DBSCAN automatically labels outliers as -1. This makes it a natural anomaly detector for free! Any point that doesn't fit in a dense cluster is flagged as noise.
Code Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons, make_blobs
from sklearn.preprocessing import StandardScaler

# ── K-Means fails on crescent shapes; DBSCAN handles them ────
X, _ = make_moons(n_samples=300, noise=0.08, random_state=42)
X = StandardScaler().fit_transform(X)

dbscan = DBSCAN(
    eps=0.3,           # Neighbourhood radius
    min_samples=5,     # Minimum points to be a core point
    metric='euclidean' # Distance metric
)
labels = dbscan.fit_predict(X)

# ── Analyse results ──────────────────────────────────────────
n_clusters  = len(set(labels)) - (1 if -1 in labels else 0)
n_noise     = np.sum(labels == -1)
core_mask   = np.zeros_like(labels, dtype=bool)
core_mask[dbscan.core_sample_indices_] = True

print(f"Clusters found: {n_clusters}")
print(f"Noise points:   {n_noise}")

# ── Visualise ────────────────────────────────────────────────
plt.figure(figsize=(8, 4))
unique_labels = set(labels)
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))

for label, col in zip(unique_labels, colors):
    if label == -1:
        col = 'black'  # Noise = black
    mask = labels == label
    plt.scatter(X[mask & core_mask, 0], X[mask & core_mask, 1],
                c=[col], s=50, label=f'Cluster {label}' if label != -1 else 'Noise')
    plt.scatter(X[mask & ~core_mask, 0], X[mask & ~core_mask, 1],
                c=[col], s=15, alpha=0.5)

plt.title(f"DBSCAN — {n_clusters} clusters, {n_noise} noise points")
plt.legend()
plt.tight_layout()
plt.show()

# ── Choosing epsilon with the k-distance plot ────────────────
from sklearn.neighbors import NearestNeighbors

nbrs = NearestNeighbors(n_neighbors=5).fit(X)
distances, _ = nbrs.kneighbors(X)
k_distances = np.sort(distances[:, -1])[::-1]  # Sort descending

plt.figure(figsize=(8, 3))
plt.plot(k_distances)
plt.xlabel('Points (sorted)')
plt.ylabel(f'5th nearest neighbour distance')
plt.title('K-Distance Plot — elbow suggests optimal epsilon')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

DBSCAN vs K-Means Comparison

FeatureK-MeansDBSCAN
Cluster shapeSpherical / convex onlyArbitrary (crescents, rings, blobs)
Number of clustersMust be specified (k)Auto-determined from data density
OutliersEvery point assigned to a clusterOutliers explicitly labelled as -1
Parametersk, n_initeps, min_samples
ScalabilityO(nkI) — fast for large dataO(n log n) with spatial index; O(n²) naive
Varying densityHandles uniform densityStruggles with highly varying density (use HDBSCAN)

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain dbscan and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 65 — PCA

PCA — Principal Component Analysis

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is PCA?

PCA is a linear dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data. Each principal component is a linear combination of original features and is orthogonal to all other components.

Mathematical Foundation

Given a data matrix $X \in \mathbb{R}^{n \times p}$ (n samples, p features):

  1. Centre the data: Subtract the mean: $\tilde{X} = X - \bar{X}$
  2. Compute covariance matrix: $\Sigma = \frac{1}{n-1}\tilde{X}^T\tilde{X}$
  3. Eigen-decomposition: $\Sigma v = \lambda v$ — find eigenvalues $\lambda$ and eigenvectors $v$
  4. Sort by variance: Order eigenvectors by their eigenvalues (largest first) — these are the principal components
  5. Project: $Z = \tilde{X} W$ where $W$ is the matrix of top-$d$ eigenvectors
Explained Variance Ratio: Each principal component explains $\frac{\lambda_i}{\sum_j \lambda_j}$ of the total variance. Aim to retain components explaining ≥ 95% cumulative variance for most tasks.
Code Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

# ── Load high-dimensional dataset (64 features) ──────────────
digits = load_digits()
X = digits.data          # (1797, 64) — 64 pixel features
y = digits.target

# ── Always scale before PCA ──────────────────────────────────
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ── Full PCA to examine variance ──────────────────────────────
pca_full = PCA()   # All components
pca_full.fit(X_scaled)

# ── Scree Plot ────────────────────────────────────────────────
explained_var = pca_full.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.bar(range(1, 21), explained_var[:20], color='#d4af37', alpha=0.8)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot — Individual Variance')

plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumulative_var)+1), cumulative_var, 'b-o', markersize=3)
plt.axhline(y=0.95, color='red', linestyle='--', label='95% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance')
plt.legend()
plt.tight_layout()
plt.show()

# Find n_components for 95% variance
n_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 95% variance: {n_95} (out of 64)")
# Typically ~29 components needed for digits dataset

# ── Apply PCA for compression ─────────────────────────────────
pca = PCA(n_components=n_95)
X_pca = pca.fit_transform(X_scaled)
print(f"Original shape: {X_scaled.shape}")    # (1797, 64)
print(f"Reduced shape:  {X_pca.shape}")       # (1797, 29)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.3f}")

# ── 2D Visualisation ─────────────────────────────────────────
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=10, alpha=0.7)
plt.colorbar(scatter, label='Digit Class')
plt.title('Digits Dataset — PCA 2D Projection')
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)')
plt.tight_layout()
plt.show()

# ── PCA for feature preprocessing ────────────────────────────
# Use PCA-compressed features in downstream ML
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

lr = LogisticRegression(max_iter=2000)
scores_raw = cross_val_score(lr, X_scaled, y, cv=5, scoring='accuracy')
scores_pca = cross_val_score(lr, X_pca, y, cv=5, scoring='accuracy')
print(f"Accuracy (64 features): {scores_raw.mean():.4f}")
print(f"Accuracy (PCA {n_95}): {scores_pca.mean():.4f}")
# PCA often achieves similar accuracy with far fewer features
⚠️
PCA is for Linear Relationships Only

PCA finds linear combinations of features. If your data has non-linear structure, PCA may miss important patterns. Use Kernel PCA (sklearn.decomposition.KernelPCA) or t-SNE/UMAP for non-linear dimensionality reduction.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain pca and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 66 — t-SNE

t-SNE — t-Distributed Stochastic Neighbour Embedding

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is t-SNE?

t-SNE is a non-linear dimensionality reduction technique designed exclusively for 2D or 3D visualisation. It preserves local neighbourhood structure — points that are close in high-dimensional space remain close in the 2D embedding.

How it Works (Intuition)

  1. High-D similarities: For each pair of points, compute the probability $p_{ij}$ that point $j$ would be picked as a neighbour of $i$ using a Gaussian distribution centered at $i$.
  2. Low-D similarities: Place all points randomly in 2D. Compute $q_{ij}$ — the probability of $j$ being picked as a neighbour of $i$ in 2D, using a t-distribution (heavier tails than Gaussian to avoid crowding).
  3. Minimise KL-divergence: Move points in 2D to minimise $KL(P||Q) = \sum_{ij} p_{ij} \log\frac{p_{ij}}{q_{ij}}$. Similar points are pulled together; dissimilar points are pushed apart.

The Perplexity Parameter

Perplexity ≈ the effective number of neighbours each point considers. Rule of thumb: perplexity = 5–50 (default 30). Larger datasets can use larger perplexity. Try multiple values and pick the most meaningful visualisation.

Code Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target

# ── Run t-SNE ────────────────────────────────────────────────
# Step 1: Reduce to ~50D with PCA first (speeds up t-SNE significantly)
from sklearn.decomposition import PCA
X_pca50 = PCA(n_components=50).fit_transform(X)

tsne = TSNE(
    n_components=2,    # ALWAYS 2 or 3 — not for ML features!
    perplexity=30,     # Effective neighbourhood size (5–50)
    learning_rate=200, # Step size for gradient descent (auto in newer sklearn)
    n_iter=1000,       # Number of optimisation iterations (≥250)
    random_state=42,
    n_jobs=-1          # Use all CPU cores
)
X_tsne = tsne.fit_transform(X_pca50)

print(f"t-SNE KL divergence: {tsne.kl_divergence_:.4f}")  # Lower = better fit

# ── Visualise ────────────────────────────────────────────────
plt.figure(figsize=(9, 7))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=15, alpha=0.8)
plt.colorbar(scatter, label='Digit Class')
plt.title('Digits Dataset — t-SNE 2D (perplexity=30)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.tight_layout()
plt.show()

# ── Compare different perplexities ───────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, perp in zip(axes, [5, 30, 100]):
    tsne_p = TSNE(n_components=2, perplexity=perp, random_state=42)
    X_t = tsne_p.fit_transform(X_pca50)
    ax.scatter(X_t[:, 0], X_t[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
    ax.set_title(f'Perplexity = {perp}')
    ax.set_xticks([]); ax.set_yticks([])
plt.suptitle('t-SNE with Different Perplexity Values')
plt.tight_layout()
plt.show()
⚠️
Critical t-SNE Rules
  • NEVER use t-SNE components as ML features. The distances are not meaningful across different runs or different perplexity values.
  • Cluster sizes in t-SNE are meaningless. Dense clusters look the same as sparse ones.
  • Distance between clusters is not meaningful. Only neighbourhood structure is preserved.
  • Always run multiple times. t-SNE is non-deterministic — set random_state for reproducibility.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain t-sne and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 67 — UMAP

UMAP — Uniform Manifold Approximation and Projection

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

UMAP vs t-SNE

UMAP (McInnes et al., 2018) has largely replaced t-SNE in many workflows due to its superior speed and better preservation of global structure.

Propertyt-SNEUMAP
SpeedSlow (hours on large data)Fast (minutes on same data)
Global structurePoor — only local structureBetter — preserves global relationships
Scalability~50,000 points maxMillions of points
Output dims2D or 3D onlyAny dimension (can use for ML features!)
StabilityVariable across runsMore stable
Theory basisKL divergence minimisationRiemannian geometry + algebraic topology

Key Parameters

  • n_neighbors: Controls the balance between local and global structure. Low (2–5) = very local; High (50–200) = more global. Default: 15.
  • min_dist: Minimum distance between points in the embedding (0–1). Low = tighter clusters; High = more spread out. Default: 0.1.
  • n_components: Number of output dimensions. Use 2 for visualisation, higher for feature compression before ML.
  • metric: Distance metric — 'euclidean', 'cosine', 'manhattan', etc.
Code Example
# pip install umap-learn
import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target

# ── Basic UMAP visualisation ──────────────────────────────────
reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,     # Local neighborhood size
    min_dist=0.1,       # Minimum distance in embedding
    metric='euclidean',
    random_state=42,
    n_jobs=-1
)
X_umap = reducer.fit_transform(X)

plt.figure(figsize=(9, 7))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=12, alpha=0.85)
plt.colorbar(scatter, label='Digit')
plt.title('Digits — UMAP 2D (n_neighbors=15, min_dist=0.1)')
plt.tight_layout()
plt.show()

# ── UMAP for feature engineering (unlike t-SNE this is valid!) ──
reducer_50d = umap.UMAP(n_components=10, n_neighbors=30, random_state=42)
X_umap_10d = reducer_50d.fit_transform(X)
print(f"UMAP 10D features shape: {X_umap_10d.shape}")  # (1797, 10)

# These 10D features CAN be used as input to a classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
scores = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42),
                         X_umap_10d, y, cv=5, scoring='accuracy')
print(f"Random Forest on UMAP 10D: {scores.mean():.4f} ± {scores.std():.4f}")

# ── UMAP with transform() — apply to new data ─────────────────
# Unlike t-SNE, UMAP supports transform() for new data!
new_data = np.random.randn(5, 64)  # 5 new samples
new_scaled = StandardScaler().fit(digits.data).transform(new_data)
new_umap = reducer.transform(new_scaled)  # Project into learned 2D space
print("New data projected:", new_umap.shape)  # (5, 2)
💡
UMAP for Feature Engineering

Unlike t-SNE, UMAP's output is a valid coordinate system — you can use UMAP-reduced features as input to downstream ML models. This is particularly useful for high-dimensional data like TF-IDF text vectors or image embeddings. Use UMAP → XGBoost pipelines for Kaggle competitions with tabular embeddings.

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain umap and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 68 — Anomaly Detection

Anomaly Detection

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

What is Anomaly Detection?

Anomaly detection (outlier detection) finds data points that deviate significantly from the majority pattern. Applications: fraud detection, network intrusion, manufacturing defects, medical diagnosis, system monitoring.

Isolation Forest

Isolation Forest works by randomly isolating observations. Anomalies are isolated earlier (fewer splits needed) than normal points, because they are few and different. The anomaly score is the average path length to isolate a point across many trees — shorter = more anomalous.

Code Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs

# ── Generate data with outliers ───────────────────────────────
X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.5, random_state=42)
X_outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X = np.vstack([X_normal, X_outliers])

# ── Fit Isolation Forest ─────────────────────────────────────
iso_forest = IsolationForest(
    n_estimators=100,       # Number of trees
    max_samples='auto',     # Subsample size per tree (default: min(256, n))
    contamination=0.06,     # Expected proportion of outliers (0.06 = 6%)
    random_state=42,
    n_jobs=-1
)
labels = iso_forest.fit_predict(X)
# Returns: +1 for inliers, -1 for outliers

# ── Anomaly scores ────────────────────────────────────────────
scores = iso_forest.decision_function(X)
# More negative = more anomalous; 0 = boundary
print(f"Inliers:  {(labels == 1).sum()}")
print(f"Outliers: {(labels == -1).sum()}")

# ── Visualise ────────────────────────────────────────────────
plt.figure(figsize=(8, 6))
plt.scatter(X[labels==1, 0], X[labels==1, 1], c='steelblue', s=20, label='Normal', alpha=0.7)
plt.scatter(X[labels==-1, 0], X[labels==-1, 1], c='red', s=80, marker='x', label='Anomaly', linewidths=2)
plt.title('Isolation Forest — Anomaly Detection')
plt.legend()
plt.tight_layout()
plt.show()

Local Outlier Factor (LOF)

LOF measures the local density deviation of a point relative to its neighbours. A point with much lower density than its neighbours gets a high LOF score (anomalous). It is effective for detecting anomalies in datasets with varying densities.

Code Example
from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(
    n_neighbors=20,        # Size of local neighbourhood
    contamination=0.06,    # Expected proportion of outliers
    novelty=False          # Set True for predict() on new data
)
lof_labels = lof.fit_predict(X)
# +1 = normal, -1 = outlier

lof_scores = -lof.negative_outlier_factor_  # Higher score = more anomalous
print(f"LOF outliers: {(lof_labels == -1).sum()}")

One-Class SVM

One-Class SVM learns a boundary around the "normal" training data. Points outside this boundary are classified as anomalies. Useful when you only have normal data during training (semi-supervised anomaly detection).

Code Example
from sklearn.svm import OneClassSVM

oc_svm = OneClassSVM(
    kernel='rbf',    # RBF kernel for non-linear boundary
    gamma='auto',    # Kernel coefficient
    nu=0.05          # Upper bound on fraction of outliers (≈ contamination)
)
oc_svm.fit(X_normal)           # Train ONLY on normal data
svm_labels = oc_svm.predict(X) # +1 = normal, -1 = outlier
print(f"One-Class SVM outliers: {(svm_labels == -1).sum()}")
MethodBest ForSpeedNote
Isolation ForestHigh-dimensional data, large datasetsFastUsually the best default choice
LOFLocal anomalies in varying-density dataMediumDoesn't support predict() by default
One-Class SVMWhen only normal data available at trainingSlowDoesn't scale well to large datasets

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain anomaly detection and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 69 — Association Rules

Association Rules — Apriori Algorithm

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

Core Concepts

Association rule mining finds interesting relationships between variables in large datasets. Classic application: market basket analysis (which products are frequently bought together).

An association rule has the form: IF {A} THEN {B} — written as $A \Rightarrow B$

Three Key Metrics

MetricFormulaInterpretation
Support $\text{supp}(A \cup B) = \frac{|A \cup B|}{N}$ Fraction of transactions containing both A and B. Measures frequency/popularity.
Confidence $\text{conf}(A \Rightarrow B) = \frac{\text{supp}(A \cup B)}{\text{supp}(A)}$ Probability of B given A. "60% of customers who bought A also bought B."
Lift $\text{lift}(A \Rightarrow B) = \frac{\text{conf}(A \Rightarrow B)}{\text{supp}(B)}$ How much more likely B is given A vs. random. Lift > 1 = positive correlation.
Lift Interpretation: Lift = 1 → A and B are independent. Lift > 1 → A promotes B. Lift < 1 → A discourages B. In practice, aim for lift > 2.0 for meaningful rules.

Apriori Algorithm

The Apriori algorithm efficiently mines frequent itemsets using the anti-monotone property: if an itemset is infrequent, all its supersets are also infrequent (allowing pruning of the search space).

Code Example
# pip install mlxtend
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# ── Sample transaction data ───────────────────────────────────
transactions = [
    ['milk', 'bread', 'butter'],
    ['beer', 'bread', 'diaper'],
    ['milk', 'diaper', 'beer', 'cola'],
    ['bread', 'milk', 'diaper', 'beer'],
    ['bread', 'milk', 'butter', 'cola'],
    ['milk', 'bread', 'butter', 'cola'],
    ['bread', 'butter'],
    ['milk', 'bread'],
    ['beer', 'cola', 'diaper'],
    ['milk', 'butter', 'cola'],
]

# ── Encode transactions to binary matrix ─────────────────────
te = TransactionEncoder()
te_array = te.fit_transform(transactions)
df = pd.DataFrame(te_array, columns=te.columns_)
print(df.head())
#    beer  bread  butter  cola  diaper  milk
# 0 False   True    True False   False  True
# ...

# ── Find frequent itemsets ────────────────────────────────────
freq_itemsets = apriori(
    df,
    min_support=0.3,       # Minimum support threshold (30%)
    use_colnames=True,     # Use column names instead of indices
    max_len=4              # Maximum itemset length
)
freq_itemsets = freq_itemsets.sort_values('support', ascending=False)
print(f"
Frequent itemsets ({len(freq_itemsets)} found):")
print(freq_itemsets.head(10))

# ── Generate association rules ────────────────────────────────
rules = association_rules(
    freq_itemsets,
    metric='lift',          # Filter by 'support', 'confidence', or 'lift'
    min_threshold=1.0       # Minimum lift value
)
rules = rules.sort_values('lift', ascending=False)
print(f"
Association rules ({len(rules)} found):")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(10))

# ── Filter high-quality rules ─────────────────────────────────
strong_rules = rules[
    (rules['confidence'] >= 0.6) &   # 60%+ confidence
    (rules['lift'] >= 1.5)           # 1.5x more likely than random
]
print(f"
Strong rules: {len(strong_rules)}")
print(strong_rules[['antecedents', 'consequents', 'confidence', 'lift']])
📌
FP-Growth — Faster than Apriori

For large datasets, Apriori is slow because it generates candidate itemsets. The FP-Growth algorithm uses a compressed FP-tree structure — no candidate generation needed, typically 10–100× faster. Use from mlxtend.frequent_patterns import fpgrowth as a drop-in replacement for apriori().

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain association rules and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Day 70 — Unsupervised Projects

Customer Segmentation — K-Means Case Study

Why this matters

This topic connects directly to model quality, debugging, and interviews — master it before moving to the next day.

End-to-end customer segmentation using the classic RFM (Recency, Frequency, Monetary) framework with K-Means. This is one of the most common unsupervised learning applications in e-commerce and retail.

RFM Features

  • Recency (R): Days since the customer's last purchase (lower = better)
  • Frequency (F): Total number of purchases (higher = better)
  • Monetary (M): Total amount spent (higher = better)
Code Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')

# ── Simulate retail transaction data ─────────────────────────
np.random.seed(42)
n_customers = 500

data = pd.DataFrame({
    'CustomerID': range(1, n_customers + 1),
    'Recency':    np.random.exponential(scale=30, size=n_customers).astype(int) + 1,
    'Frequency':  np.random.poisson(lam=5, size=n_customers) + 1,
    'Monetary':   np.random.lognormal(mean=4.5, sigma=1.2, size=n_customers).round(2)
})

print("RFM Summary:")
print(data[['Recency', 'Frequency', 'Monetary']].describe())

# ── Feature Engineering & Scaling ────────────────────────────
rfm = data[['Recency', 'Frequency', 'Monetary']].copy()

# Log-transform Monetary (right-skewed distribution)
rfm['Monetary_log'] = np.log1p(rfm['Monetary'])
rfm_features = rfm[['Recency', 'Frequency', 'Monetary_log']]

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_features)

# ── Find Optimal K ────────────────────────────────────────────
inertias, sil_scores = [], []
K_range = range(2, 9)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=15, random_state=42)
    labels = km.fit_predict(rfm_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(rfm_scaled, labels))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_title('Elbow Method'); axes[0].set_xlabel('k'); axes[0].set_ylabel('Inertia')
axes[1].plot(K_range, sil_scores, 'ro-')
axes[1].set_title('Silhouette Score'); axes[1].set_xlabel('k'); axes[1].set_ylabel('Score')
plt.tight_layout(); plt.show()

# ── Apply K-Means with k=4 ────────────────────────────────────
k_optimal = 4
kmeans = KMeans(n_clusters=k_optimal, n_init=15, random_state=42)
data['Segment'] = kmeans.fit_predict(rfm_scaled)

# ── Interpret Segments ────────────────────────────────────────
segment_analysis = data.groupby('Segment').agg({
    'Recency':   ['mean', 'median'],
    'Frequency': ['mean', 'median'],
    'Monetary':  ['mean', 'median'],
    'CustomerID': 'count'
}).round(1)
segment_analysis.columns = ['_'.join(c) for c in segment_analysis.columns]
segment_analysis.rename(columns={'CustomerID_count': 'Count'}, inplace=True)
print("
Segment Analysis:")
print(segment_analysis)

# ── Name Segments Based on RFM Profile ───────────────────────
segment_names = {
    0: 'Champions',      # Low recency, high freq, high monetary
    1: 'At Risk',        # High recency, medium freq/monetary
    2: 'Loyal Customers',# Medium recency, high freq
    3: 'New Customers'   # Low recency, low freq (recently acquired)
}
# NOTE: Mapping depends on actual centroid values — always inspect!
data['Segment_Name'] = data['Segment'].map(segment_names)

# ── 3D Visualisation ─────────────────────────────────────────
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
colors = ['#d4af37', '#e74c3c', '#3498db', '#2ecc71']

for seg, color in zip(range(k_optimal), colors):
    mask = data['Segment'] == seg
    ax.scatter(data.loc[mask, 'Recency'],
               data.loc[mask, 'Frequency'],
               data.loc[mask, 'Monetary'],
               c=color, s=20, alpha=0.6,
               label=segment_names.get(seg, f'Segment {seg}'))

ax.set_xlabel('Recency (days)'); ax.set_ylabel('Frequency'); ax.set_zlabel('Monetary ($)')
ax.set_title('Customer Segments — RFM Analysis')
ax.legend()
plt.tight_layout(); plt.show()

# ── Business Recommendations per Segment ─────────────────────
print("
=== Business Actions by Segment ===")
print("Champions:       Send loyalty rewards, ask for reviews, referral program")
print("Loyal Customers: Offer membership, upsell premium products")
print("At Risk:         Send win-back emails, special discount offers")
print("New Customers:   Welcome series, onboarding, first-purchase discount")
💡
Module 5 Summary — When to Use Which Algorithm
  • K-Means: Fast, scalable, interpretable — best for spherical clusters, large datasets
  • Hierarchical: When you need a dendrogram or don't know k; small-medium data
  • DBSCAN: Arbitrary cluster shapes, automatic outlier detection
  • PCA: Linear compression, preprocessing for ML, multicollinearity removal
  • t-SNE: 2D/3D visualisation only — no other use
  • UMAP: Fast visualisation + feature engineering; better than t-SNE
  • Isolation Forest: Best general-purpose anomaly detector
  • Apriori/FP-Growth: Market basket analysis, recommendation systems

Common mistakes

  • Applying the technique without understanding its assumptions.
  • Copying defaults from tutorials without validating on your data.
  • Skipping validation — always measure impact with a proper holdout or CV.

Interview checkpoints

  • Q: When would you use this vs a simpler baseline? A: When measurable lift on the right metric justifies complexity and maintenance cost.
  • Q: Biggest failure mode? A: Wrong data split or leakage inflating offline scores.

Practice

  1. Basic: Explain the concept in plain language with one real-world example.
  2. Intermediate: Implement on a sklearn toy dataset and interpret outputs.
  3. Advanced: Compare two approaches on the same split and document tradeoffs.

Recap

  • You can explain customer segmentation and when it applies.
  • You know the main pitfalls and how to detect them in practice.
  • You can connect this topic to the next step in the ML workflow.

Next: Continue to the next day in this module.

Unsupervised Learning: K-Means Clustering Centroids
Centroid 1 Centroid 2
Supervised Learning → Evaluation & Tuning →