Module 1 · foundations of ml & dl

Module 1: Foundations of Machine Learning & Deep Learning

Master Math Prerequisites, Classical ML, Neural Networks, and Deep Learning Architectures for Generative AI.

⏱ 25 Min Read • Author: GenAIWallah Team • Updated: May 2026

1.1 Math Prerequisites

Before diving into modern generative models like Large Language Models (LLMs) or Diffusion networks, you must establish a concrete understanding of core mathematical principles. Neural networks, at their core, are complex compositions of calculus, linear algebra operations, and probability metrics.

💡

Why Math Matters for GenAI

You can call APIs without knowing this math. However, if you want to understand how a Transformer computes token embeddings, how self-attention weights are aligned, or how a diffusion model adds and subtracts Gaussian noise, understanding these equations is non-negotiable.

A. Linear Algebra

In ML and Deep Learning, data (images, text tokens, audio clips) is represented as numerical vectors. Models process these inputs in parallel batches using matrices.

A vector is an ordered array of numbers representing a point in multi-dimensional space. In language models, each word is mapped to a high-dimensional vector (e.g., 4096 dimensions in LLaMA) known as an embedding.

The dot product (or scalar product) of two vectors measures their directional alignment. It is computed as:

u \cdot v = \sum_{i=1}^{n} u_i v_i = u^T v

If the dot product is high, the vectors point in similar directions (highly correlated). This is the key building block of the Self-Attention Mechanism, which multiplies Query and Key vectors to evaluate token relationships.

Python (NumPy Vector Operations)

import numpy as np

# Define two token embeddings
u = np.array([0.1, 0.8, -0.2])
v = np.array([0.3, 0.6, 0.1])

# Compute Dot Product
dot_product = np.dot(u, v)
print(f"Dot Product: {dot_product:.4f}")

# Compute Matrix Multiplication (simulating batch weights)
W = np.random.randn(3, 4)
output = np.dot(u, W)
print(f"Layer Outputs: {output}")

B. Eigenvalues & Eigenvectors

For a square matrix $$A$$ , an eigenvector $$v$$ and its corresponding eigenvalue $\lambda$ satisfy:

A v = \lambda v

This means when a linear transformation represented by matrix $$A$$ is applied to vector $$v$$ , the vector only scales by factor $\lambda$ without shifting its direction. Eigenvalues are crucial in Principal Component Analysis (PCA) for embedding dimensionality reduction and optimizing model convergence states.

C. Calculus (Gradients & Optimization)

Calculus is used to train neural networks. It enables optimization routines to adjust model parameters (weights) to minimize error.

For functions with multiple variables, a partial derivative computes the rate of change along a single parameter axis while holding other values constant. The gradient ( $$ abla f$$ ) is a vector containing all partial derivatives:

abla f(x_1, x_2, \dots, x_n) = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n} \right]

The gradient points in the direction of steepest functional ascent. In backpropagation, we calculate the gradient of the loss function and step in the opposite direction (gradient descent) to minimize network errors.

D. The Chain Rule

Since deep networks are stacked layers (composite functions like $$f(g(x))$$ ), the Chain Rule allows us to calculate how changes in early weights impact the final output loss:

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

⚠️

Vanishing Gradients

In very deep architectures or long recurrent loops, multiplying many partial derivatives smaller than 1.0 causes early layer gradients to shrink to zero. This stops parameters from learning, which led to the creation of Residual Connections (skip gates) in Transformers.

E. Probability & Information Theory

Generative AI is fundamentally probabilistic. Language models predict the likelihood of a next token sequence, while image generators map random noise back to plausible pixels.

Bayes Theorem: Bayes theorem computes conditional probability (the probability of event A occurring given that event B has occurred):

P(A|B) = \frac{P(B|A) P(A)}{P(B)}

In language generation, we continuously evaluate the conditional probability of the next word given all preceding context tokens: $P(w_t | w_{1}, \dots, w_{t-1})$ .

Entropy & Cross-Entropy Loss: Entropy measures the unpredictability or information disorder of a probability distribution. In classification and token generation, we use Cross-Entropy Loss to compare the model's predicted probability distribution $$q$$ against the target ground truth distribution $$p$$ :

H(p, q) = -\sum_{x} p(x) \log q(x)

Minimizing cross-entropy loss forces the model's predictions closer to target human text outputs.

1.2 Classical ML Refresher

Before diving straight into neural networks, it is vital to understand basic ML taxonomy:

Supervised Learning: Training the model on labeled data (e.g. mapping inputs $$X$$ to targets $$Y$$ ).
Unsupervised Learning: Finding hidden patterns in unlabeled data (e.g. clustering or density estimation).
Self-Supervised Learning: Masking parts of the input data and training the model to predict the missing pieces (e.g. next-token predictions in LLMs).

A. Loss Functions & Optimizers

To train models, we define a loss function (like Mean Squared Error for regression, or Cross-Entropy for text tokens) and minimize it using optimization techniques.

Stochastic Gradient Descent (SGD): Updates weights in the opposite direction of the gradient of the loss function with respect to weights:

W \leftarrow W - \eta abla_W L(W)

RMSprop: Modulates the learning rate by dividing by the running average of the squares of the gradients, preventing gradient explosions.

Adam (Adaptive Moment Estimation): Combines SGD with momentum (running average of gradients) and RMSprop (running average of squared gradients). This is the standard optimizer used to train LLMs.

B. Regularization & Trades

Bias-Variance Tradeoff: Models with high bias underfit (too simple), whereas models with high variance overfit (memorize the training noise instead of general patterns).

Regularization: Techniques like L1 (Lasso) and L2 (Ridge/Weight Decay) penalize large weights, pushing parameters closer to zero to force generalization:

L_{total} = L_{data} + \lambda \sum_{j} w_j^2

1.3 Neural Network Fundamentals

A neural network is a network of artificial neurons organized in layers.

The Perceptron: The fundamental unit of neural networks. It computes a weighted sum of inputs plus a bias and passes it through an activation function:

y = \sigma\left( \sum_{i} w_i x_i + b \right)

A. Activation Functions

Without activation functions, stacking layers behaves just like a single linear regression. Activation functions introduce non-linearities, allowing neural networks to learn arbitrary complex shapes.

Sigmoid: Maps inputs to [0, 1]. Historically popular, but suffers from vanishing gradients for extreme values.
ReLU (Rectified Linear Unit): $f(x) = \max(0, x)$ . Extremely fast and resolves vanishing gradients for positive inputs, but positive values can grow unboundedly.
GELU (Gaussian Error Linear Unit): Weights inputs by their value according to the cumulative distribution function of the Gaussian distribution. This is the activation function used in modern Transformers (like GPT and LLaMA).
Softmax: Converts vectors into probability distributions summing to 1.0 (used at the output layer of LLMs).

B. Training & Normalization

Backpropagation: The backward pass through a neural network. It applies the chain rule recursively to compute gradients of the loss with respect to all parameters, then updates weights using an optimizer.

Batch Normalization vs Layer Normalization: Batch Norm normalizes activations across the batch dimension. Since sequence lengths and batch sizes vary dynamically in NLP, Transformers use **Layer Normalization**, which normalizes features within a single input vector.

1.4 Deep Learning Architectures

Different data structures require different architectures:

Convolutional Neural Networks (CNNs): Use spatial filters (convolutions) to process grid-like structures (images). Efficient due to weight sharing.
Recurrent Neural Networks (RNNs) & LSTMs: Use feedback loops to process sequential data. However, processing must be done step-by-step, creating bottlenecks and preventing parallel computing on modern GPUs.
Seq2Seq with Attention: Introduced in machine translation (encoder-decoder architectures). Instead of compressing a sentence into a single static bottleneck vector, attention allows the decoder to look back at specific encoder states dynamically, paving the way for the Transformer.

AI vs. Machine Learning vs. Deep Learning Hierarchy

💡

Next Steps

Now that you have completed the foundations, proceed to Module 2: The Transformer Architecture to learn how self-attention is built from scratch.