Load and Preprocess Datasets with TensorFlow

Data is the foundation of every machine learning project. Before any model can learn, we must carefully prepare the dataset: loading it efficiently, cleaning it, scaling it, and sometimes augmenting it to improve model generalization.

In TensorFlow, managing data pipelines is not just a side task; it directly affects training speed, memory usage, and final accuracy.

This tutorial will guide you through the process of loading and preprocessing datasets with TensorFlow. We will explore built-in datasets, custom dataset handling, and the tf.data API, and preprocessing techniques for images, text, and structured tabular data.

Data Handling in Python TensorFlow

TensorFlow provides robust tools to handle datasets efficiently. The key concept is the tf.data.Dataset API, which represents a sequence of elements (tensors). Instead of manually feeding data with NumPy arrays, you can build pipelines that load files, normalize values, apply augmentations, shuffle, batch, and prefetch data across CPU and GPU/TPU devices.

Why use TensorFlow’s data pipelines?

They speed up training through parallelism and asynchronous prefetching.
They handle large datasets that do not fit into memory.
They integrate seamlessly with distributed training and GPUs.

Load Datasets with TensorFlow

Let me explain to you how to load datasets with Tensorflow.

Built-in Keras Datasets

TensorFlow’s tf.keras.datasets module includes popular datasets like MNIST, CIFAR-10, and IMDB. These are great for quick experiments.

Example: Loading MNIST handwritten digits:

import tensorflow as tf
from tensorflow.keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

You can see the output in the screenshot below.

This automatically splits into training and testing sets.

TensorFlow Datasets (TFDS)

For more complex datasets, use TensorFlow Datasets (TFDS). It provides a wide collection of ready-to-use datasets, both small and large.

import tensorflow_datasets as tfds

dataset, info = tfds.load("cifar10", with_info=True, as_supervised=True)
train_ds, test_ds = dataset["train"], dataset["test"]

TFDS datasets include labels, splits, and metadata for easier handling.

Load Custom Image Datasets

For your own image folder, use image_dataset_from_directory:

train_ds = tf.keras.utils.image_dataset_from_directory(
    "data/train",
    image_size=(128, 128),
    batch_size=32
)

The directory should be organized like:

data/
   train/
      dogs/
      cats/
   test/
      dogs/
      cats/

Load Text Datasets

Text data can be read line by line using:

text_ds = tf.data.TextLineDataset("dataset.txt")
for line in text_ds.take(3):
    print(line.numpy())

You can see the output in the screenshot below.

Load and Preprocess Datasets with TensorFlow

For large CSV files, there’s a convenient function:

csv_ds = tf.data.experimental.make_csv_dataset(
    "data.csv",
    batch_size=32,
    label_name="label"
)

Work with the tf.data API

The tf.data API lets you build pipelines with transformations applied step by step.

A Simple Pipeline

dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.shuffle(10000).batch(32).prefetch(tf.data.AUTOTUNE)

This dataset is now ready to be trained using .fit().

Common Operations

map(): Apply a function to each element.
shuffle(): Randomize order.
batch(): Group examples into batches.
repeat(): Iterate indefinitely.
prefetch(): Fetch items in the background to avoid GPU starvation.

Performance Tips

Use num_parallel_calls=tf.data.AUTOTUNE inside map().
Cache datasets after the first pass for smaller datasets.
For multiple large files, use interleave to load.

Preprocess Images

Image preprocessing is essential since deep learning models expect a consistent input format.

Scaling and Normalization

Most models work best when input pixel values are between 0 and 1:

normalization_layer = tf.keras.layers.Rescaling(1./255)
train_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))

Resize Images

Resize images to a consistent shape:

train_ds = train_ds.map(
    lambda x, y: (tf.image.resize(x, [128, 128]), y)
)

Data Augmentation

Augmentation creates variations, improving generalization:

data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.RandomRotation(0.1),
])

train_ds = train_ds.map(lambda x, y: (data_augmentation(x, training=True), y))

Example Pipeline

train_ds = tf.keras.utils.image_dataset_from_directory("data/train", image_size=(128,128))
train_ds = train_ds.map(lambda x,y: (normalization_layer(x), y))
train_ds = train_ds.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)

Preprocess Text Data

Natural Language Processing (NLP) requires converting text into a numeric format.

Tokenization

Use TextVectorization:

vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=10000,
    output_mode="int",
    output_sequence_length=200
)

text_ds = tf.data.Dataset.from_tensor_slices(["This is an example", "TensorFlow is powerful"])
vectorizer.adapt(text_ds)

Pad and Sequences

When dealing with sequences of different lengths, padding ensures uniform arrays. TensorFlow automatically pads when using sequence vectorization.

Embed Layer

Instead of raw integers, embeddings map tokens into dense vectors:

embedding = tf.keras.layers.Embedding(input_dim=10000, output_dim=128)

Example: IMDB Sentiment Analysis

Load IMDB dataset via keras.datasets.
Apply text vectorization.
Feed into an Embedding layer + LSTM/GRU.

Preprocess Structured Data

Many machine learning projects rely on tabular data.

Load CSV Data

dataset = tf.data.experimental.make_csv_dataset(
    "titanic.csv", batch_size=32, label_name="Survived"
)

Normalization

Numeric values should be standardized.

normalizer = tf.keras.layers.Normalization()
normalizer.adapt(numeric_values)

Handle Categorical Features

Convert string categories into indices using:

lookup = tf.keras.layers.StringLookup(output_mode="int")

Then one-hot encode:

encoder = tf.keras.layers.CategoryEncoding(num_tokens=lookup.vocabulary_size())

Combine Features

Combine everything:

combined = tf.keras.layers.Concatenate()([normalized_numeric, encoded_categorical])

Integrate Preprocessing into Model Training

There are two ways to integrate preprocessing:

Inside the Model using Keras preprocessing layers. This is portable; you save the preprocessing with the model.
Outside the Model using tf.data. This may improve training speed for big datasets.

Example with preprocessing layers inside the model:

model = tf.keras.Sequential([
    tf.keras.layers.Rescaling(1./255, input_shape=(128, 128, 3)),
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.Conv2D(32, (3,3), activation="relu"),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(10)
])

Case Study: Image Classification with CIFAR-10

Let’s create a complete workflow:

(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.cifar10.load_data()

train_ds = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
train_ds = train_ds.map(lambda x,y: (tf.image.resize(x, [128,128])/255.0, y))
train_ds = train_ds.shuffle(1000).batch(64).prefetch(tf.data.AUTOTUNE)

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3,3), activation="relu", input_shape=(128,128,3)),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, (3,3), activation="relu"),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(10)
])

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(train_ds, epochs=5)

This showcases a full preprocessing and training pipeline.

Debug Dataset Pipelines

When working with datasets, common issues include:

Shape mismatches between images and labels.
Data type errors when mapping.
Infinite loops caused by .repeat() with no limit.

Debugging tips:

Print shapes with for images, labels in dataset.take(1).
Visualize samples with Matplotlib:

import matplotlib.pyplot as plt
for images, labels in train_ds.take(1):
    plt.imshow(images[0].numpy().astype("uint8"))
    plt.show()

Best Practices for Large-Scale Training

Cache small datasets to memory after the first pass.
Use prefetching with AUTOTUNE.
Store big datasets as TFRecord files, which are more efficient for distributed training.
For multi-GPU or TPU environments, shard datasets across different devices.
Enable mixed precision for faster training without loss in accuracy.

Conclusion

Efficient dataset pipelines are essential for building real-world machine learning models. With TensorFlow, you have the tools to load images, text, and structured data easily. By combining tf.data, preprocessing layers, and best practices like caching and prefetching, you can create scalable and fast training pipelines.

To recap:

Use built-in datasets for quick prototyping.
Create pipelines with tf.data.Dataset.
Preprocess images with rescaling, resizing, and augmentation.
Preprocess text with tokenization, padding, and embeddings.
Handle structured data with normalization and categorical encoding.
Integrate preprocessing either inside or outside the model, depending on requirements.

By mastering dataset loading and preprocessing in TensorFlow, you unlock the full potential of your training pipeline and set the foundation for higher accuracy and faster training.

You may read:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/