Module 4: Dataset & DataLoader
Design custom Dataset subclasses. Manage indices lookup, batch loaders, multi-process workers loading, and data shufflers.
Day 7
Creating a Custom Dataset Class
Why this matters
Dataset: Dataset maps indices to samples — your hook for loading CSV rows or image paths.
torch.utils.data.Dataset defines __len__ and __getitem__. Return a sample (tensor or tuple) per index.
from torch.utils.data import Dataset
class ToyDataset(Dataset):
def __len__(self):
return 1000
def __getitem__(self, idx):
return torch.randn(28, 28), idx % 10- Apply transforms inside
__getitem__(resize, normalize). - Keep I/O light — cache paths, decode images on the fly.
Common mistakes
- Forgetting optimizer.zero_grad() so gradients accumulate across batches.
- Tensor shape mismatches (especially batch/channel dimensions for CNNs).
- Training on GPU but leaving tensors on CPU (or vice versa).
Interview checkpoints
- Q: Explain dataset in PyTorch. A: One-sentence definition + shape/device note.
- Q: Common bug? A: Gradients, shapes, or device mismatch.
Practice
- Basic: Define Dataset and sketch a minimal code snippet.
- Intermediate: Run a notebook cell demonstrating Dataset.
- Advanced: Intentionally break Dataset and interpret the error.
Recap
- You can explain dataset clearly.
- You know one mistake to avoid.
- You see how this connects to the next lesson.
Next: DataLoader
Day 8
DataLoader Batch Loading and Shuffling
Why this matters
DataLoader: DataLoader batches, shuffles, and parallelizes loading — essential for GPU throughput.
DataLoader wraps a Dataset to produce mini-batches.
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx]
loader = DataLoader(MyDataset(), batch_size=32, shuffle=True, num_workers=2)shuffle=Truefor training;Falsefor validation/test.num_workers > 0prefetches batches in subprocesses.pin_memory=Truespeeds CPU→GPU transfer when using CUDA.
Common mistakes
- Forgetting optimizer.zero_grad() so gradients accumulate across batches.
- Tensor shape mismatches (especially batch/channel dimensions for CNNs).
- Training on GPU but leaving tensors on CPU (or vice versa).
Interview checkpoints
- Q: Explain dataloader in PyTorch. A: One-sentence definition + shape/device note.
- Q: Common bug? A: Gradients, shapes, or device mismatch.
Practice
- Basic: Define DataLoader and sketch a minimal code snippet.
- Intermediate: Run a notebook cell demonstrating DataLoader.
- Advanced: Intentionally break DataLoader and interpret the error.
Recap
- You can explain dataloader clearly.
- You know one mistake to avoid.
- You see how this connects to the next lesson.
Next: CPU vs GPU
