PyTorch DataLoader: Load and Batch Data Efficiently

I was working on a deep learning project that required me to efficiently load and batch large datasets for training a neural network. Manually managing data batching, shuffling, and parallel loading can be very tedious and prone to errors. This is where PyTorch’s DataLoader becomes extremely helpful.

In this article, I will cover everything you need to know about PyTorch’s DataLoader, from basic usage to advanced configurations.

So let’s get started..!

PyTorch DataLoader

PyTorch DataLoader is a utility class that helps you load data in batches, shuffle it, and even load it in parallel using multiprocessing workers. It’s one of the most fundamental tools in the PyTorch ecosystem for efficiently feeding data to your models.

The DataLoader wraps a Dataset object and provides an iterator over the dataset, handling all the complexity of batching, shuffling, and parallel data loading for you.

Basic Usage of PyTorch DataLoader

Now, I will explain the basic usage of PyTorch DataLoader.

Create a Simple DataLoader

Let’s start with a basic example of how to create and use a DataLoader:

import torch
from torch.utils.data import Dataset, DataLoader

# Create a simple dataset
class NumbersDataset(Dataset):
    def __init__(self, start, end):
        self.data = list(range(start, end))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return torch.tensor([self.data[idx]], dtype=torch.float32)

# Create dataset and dataloader
dataset = NumbersDataset(0, 100)
dataloader = DataLoader(dataset, batch_size=10, shuffle=True)

# Iterate through the dataloader
for batch in dataloader:
    print(batch.shape, batch)
    break  # Just print the first batch

I executed the above example code and added the screenshot below.

pytorch dataloader

In this example, we create a simple dataset containing numbers from 0 to 99, then use a DataLoader to load this data in batches of 10 while shuffling the order.

Key Parameters of DataLoader

The DataLoader class has several important parameters that control its behavior:

Batch Size

The batch_size parameter determines how many samples are loaded in each batch:

# Create a dataloader with batch size of 32
dataloader = DataLoader(dataset, batch_size=32)

Shuffling the Data

The shuffle parameter controls whether the data is randomly shuffled before batching:

# Create a dataloader with shuffling enabled
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)

This is particularly important for training neural networks, as it helps prevent the model from learning the order of samples.

Number of Workers

The num_workers parameter allows you to load data in parallel using multiple processes:

# Create a dataloader with 4 worker processes
dataloader = DataLoader(dataset, batch_size=16, shuffle=True, num_workers=4)

This can significantly speed up data loading, especially when working with images or other data that requires preprocessing.

Pin Memory

For GPU training, the pin_memory parameter can improve transfer speed:

# Create a dataloader with pinned memory for faster GPU transfer
dataloader = DataLoader(dataset, batch_size=16, pin_memory=True)

This allocates the memory in a way that makes CPU-to-GPU data transfer faster.

Read Cross-Entropy Loss PyTorch

Create Custom Datasets for DataLoader

To use DataLoader effectively, you need to create a custom Dataset class. Let’s look at a more realistic example using a dataset of US stock prices:

import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

class StockDataset(Dataset):
    def __init__(self, csv_file, window_size=10):
        self.data = pd.read_csv(csv_file)
        self.window_size = window_size

    def __len__(self):
        return len(self.data) - self.window_size

    def __getitem__(self, idx):
        window = self.data.iloc[idx:idx+self.window_size]['close'].values
        target = self.data.iloc[idx+self.window_size]['close']
        x = torch.tensor(window, dtype=torch.float32)
        y = torch.tensor([target], dtype=torch.float32)
        return x, y

if __name__ == "__main__":
    # Load the dataset
    stock_dataset = StockDataset('sp500_prices.csv')
    
    # Use num_workers=0 for Windows compatibility
    stock_dataloader = DataLoader(
        stock_dataset,
        batch_size=32,
        shuffle=True,
        num_workers=0
    )

    for inputs, targets in stock_dataloader:
        print(f"Batch inputs shape: {inputs.shape}, targets shape: {targets.shape}")
        break

I executed the above example code and added the screenshot below.

dataloader pytorch

This example creates a dataset for time series prediction on S&P 500 stock prices, using a sliding window approach.

Check out Adam Optimizer PyTorch

Use DataLoader with Built-in PyTorch Datasets

PyTorch also provides several built-in datasets that work seamlessly with DataLoader:

import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader

# Define transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load MNIST dataset
train_dataset = torchvision.datasets.MNIST(
    root='./data', 
    train=True, 
    download=True, 
    transform=transform
)

# Create dataloader
train_loader = DataLoader(
    train_dataset, 
    batch_size=64, 
    shuffle=True,
    num_workers=2
)

# Examine a batch
for images, labels in train_loader:
    print(f"Batch shape: {images.shape}, Labels shape: {labels.shape}")
    print(f"Sample labels: {labels[:5]}")
    break

Advanced DataLoader Techniques

Let me explain to you the advanced DataLoader Techniques

Read PyTorch nn Linear

Custom Sampling with Sampler

Sometimes you need more control over how samples are drawn from your dataset:

from torch.utils.data import WeightedRandomSampler

# Create weights for each sample (e.g., to balance classes)
class_counts = [5000, 1000]  # Example: 5000 samples of class 0, 1000 of class 1
weights = [1.0/class_counts[label] for label in train_dataset.targets]
sampler = WeightedRandomSampler(weights, len(weights))

# Use the sampler in your DataLoader
balanced_loader = DataLoader(
    train_dataset, 
    batch_size=64, 
    sampler=sampler,
    num_workers=2
)

This example creates a weighted sampler that gives higher probability to samples from underrepresented classes.

Check out PyTorch Batch Normalization

Collate Functions for Custom Batching

The collate_fn parameter lets you control how individual samples are combined into batches:

def custom_collate(batch):
    # Sort batch by sequence length (descending)
    batch.sort(key=lambda x: len(x[0]), reverse=True)

    # Separate inputs and targets
    sequences, targets = zip(*batch)

    # Get lengths for packing
    lengths = [len(seq) for seq in sequences]

    # Pad sequences to same length
    padded_seqs = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=True)

    return padded_seqs, torch.tensor(targets), lengths

# Use with variable-length sequence data
seq_loader = DataLoader(
    sequence_dataset,
    batch_size=32,
    collate_fn=custom_collate
)

This example shows a collate function for handling variable-length sequences, which is common in NLP tasks.

Optimize DataLoader Performance

Now, I will discuss the Performance of DataLoader.

Read PyTorch Load Model

Prefetch with prefetch_factor

When using multiple workers, you can control how many samples each worker prefetches:

dataloader = DataLoader(
    dataset, 
    batch_size=32, 
    num_workers=4,
    prefetch_factor=2  # Each worker will prefetch 2 batches
)

Use Persistent Workers

Keep worker processes alive between data loading iterations:

dataloader = DataLoader(
    dataset, 
    batch_size=32, 
    num_workers=4,
    persistent_workers=True  # Keep workers alive between epochs
)

This can improve performance when training for multiple epochs by avoiding the overhead of starting and stopping workers.

Check out PyTorch Tensor to Numpy

DataLoader for Distributed Training

When training across multiple GPUs or machines, DataLoader can be configured for distributed training:

from torch.utils.data.distributed import DistributedSampler

# Create a sampler for distributed training
sampler = DistributedSampler(
    dataset,
    num_replicas=world_size,  # Total number of processes
    rank=rank  # Process rank
)

# Create dataloader with the distributed sampler
distributed_loader = DataLoader(
    dataset,
    batch_size=32,
    sampler=sampler,
    num_workers=4
)

This ensures that each process gets a different subset of the data, preventing duplicate processing.

PyTorch’s DataLoader is a powerful tool that simplifies the data loading pipeline for deep learning models. It handles all the complexity of batching, shuffling, and parallel loading, allowing you to focus on building and training your models.

Whether you’re working with images, text, or time series data, DataLoader provides the flexibility and performance you need. I’ve found it to be an indispensable part of my PyTorch workflow, making data handling much more efficient and less error-prone.

You may also like to read:

51 Python Programs

51 PYTHON PROGRAMS PDF FREE

Download a FREE PDF (112 Pages) Containing 51 Useful Python Programs.

pyython developer roadmap

Aspiring to be a Python developer?

Download a FREE PDF on how to become a Python developer.

Let’s be friends

Be the first to know about sales and special discounts.