Search topics…
Tutorials
Explore
June 6 Offline Event →
Module 4 · PyTorch Deep Learning

Module 4: Dataset & DataLoader

Design custom Dataset subclasses. Manage indices lookup, batch loaders, multi-process workers loading, and data shufflers.

⏱ 20 Min Read Author: GenAIWallah Team Updated: May 2026
Day 7

Creating a Custom Dataset Class

Why this matters

Dataset: Dataset maps indices to samples — your hook for loading CSV rows or image paths.

torch.utils.data.Dataset defines __len__ and __getitem__. Return a sample (tensor or tuple) per index.

from torch.utils.data import Dataset

class ToyDataset(Dataset):
    def __len__(self):
        return 1000
    def __getitem__(self, idx):
        return torch.randn(28, 28), idx % 10
  • Apply transforms inside __getitem__ (resize, normalize).
  • Keep I/O light — cache paths, decode images on the fly.

Common mistakes

  • Forgetting optimizer.zero_grad() so gradients accumulate across batches.
  • Tensor shape mismatches (especially batch/channel dimensions for CNNs).
  • Training on GPU but leaving tensors on CPU (or vice versa).

Interview checkpoints

  • Q: Explain dataset in PyTorch. A: One-sentence definition + shape/device note.
  • Q: Common bug? A: Gradients, shapes, or device mismatch.

Practice

  1. Basic: Define Dataset and sketch a minimal code snippet.
  2. Intermediate: Run a notebook cell demonstrating Dataset.
  3. Advanced: Intentionally break Dataset and interpret the error.

Recap

  • You can explain dataset clearly.
  • You know one mistake to avoid.
  • You see how this connects to the next lesson.

Next: DataLoader

Day 8

DataLoader Batch Loading and Shuffling

Why this matters

DataLoader: DataLoader batches, shuffles, and parallelizes loading — essential for GPU throughput.

DataLoader wraps a Dataset to produce mini-batches.

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

loader = DataLoader(MyDataset(), batch_size=32, shuffle=True, num_workers=2)
  • shuffle=True for training; False for validation/test.
  • num_workers > 0 prefetches batches in subprocesses.
  • pin_memory=True speeds CPU→GPU transfer when using CUDA.

Common mistakes

  • Forgetting optimizer.zero_grad() so gradients accumulate across batches.
  • Tensor shape mismatches (especially batch/channel dimensions for CNNs).
  • Training on GPU but leaving tensors on CPU (or vice versa).

Interview checkpoints

  • Q: Explain dataloader in PyTorch. A: One-sentence definition + shape/device note.
  • Q: Common bug? A: Gradients, shapes, or device mismatch.

Practice

  1. Basic: Define DataLoader and sketch a minimal code snippet.
  2. Intermediate: Run a notebook cell demonstrating DataLoader.
  3. Advanced: Intentionally break DataLoader and interpret the error.

Recap

  • You can explain dataloader clearly.
  • You know one mistake to avoid.
  • You see how this connects to the next lesson.

Next: CPU vs GPU

← Module 3: nn.Module Pipeline Module 5: GPU Acceleration →