Data is the foundation of every machine learning project. Before any model can learn, we must carefully prepare the dataset: loading it efficiently, cleaning it, scaling it, and sometimes augmenting it to improve model generalization.
In TensorFlow, managing data pipelines is not just a side task; it directly affects training speed, memory usage, and final accuracy.
This tutorial will guide you through the process of loading and preprocessing datasets with TensorFlow. We will explore built-in datasets, custom dataset handling, and the tf.data API, and preprocessing techniques for images, text, and structured tabular data.
Data Handling in Python TensorFlow
TensorFlow provides robust tools to handle datasets efficiently. The key concept is the tf.data.Dataset API, which represents a sequence of elements (tensors). Instead of manually feeding data with NumPy arrays, you can build pipelines that load files, normalize values, apply augmentations, shuffle, batch, and prefetch data across CPU and GPU/TPU devices.
Why use TensorFlow’s data pipelines?
- They speed up training through parallelism and asynchronous prefetching.
- They handle large datasets that do not fit into memory.
- They integrate seamlessly with distributed training and GPUs.
Load Datasets with TensorFlow
Let me explain to you how to load datasets with Tensorflow.
Built-in Keras Datasets
TensorFlow’s tf.keras.datasets module includes popular datasets like MNIST, CIFAR-10, and IMDB. These are great for quick experiments.
Example: Loading MNIST handwritten digits:
import tensorflow as tf
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)You can see the output in the screenshot below.

This automatically splits into training and testing sets.
TensorFlow Datasets (TFDS)
For more complex datasets, use TensorFlow Datasets (TFDS). It provides a wide collection of ready-to-use datasets, both small and large.
import tensorflow_datasets as tfds
dataset, info = tfds.load("cifar10", with_info=True, as_supervised=True)
train_ds, test_ds = dataset["train"], dataset["test"]TFDS datasets include labels, splits, and metadata for easier handling.
Load Custom Image Datasets
For your own image folder, use image_dataset_from_directory:
train_ds = tf.keras.utils.image_dataset_from_directory(
"data/train",
image_size=(128, 128),
batch_size=32
)The directory should be organized like:
data/
train/
dogs/
cats/
test/
dogs/
cats/Load Text Datasets
Text data can be read line by line using:
text_ds = tf.data.TextLineDataset("dataset.txt")
for line in text_ds.take(3):
print(line.numpy())You can see the output in the screenshot below.

For large CSV files, there’s a convenient function:
csv_ds = tf.data.experimental.make_csv_dataset(
"data.csv",
batch_size=32,
label_name="label"
)Work with the tf.data API
The tf.data API lets you build pipelines with transformations applied step by step.
A Simple Pipeline
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.shuffle(10000).batch(32).prefetch(tf.data.AUTOTUNE)This dataset is now ready to be trained using .fit().
Common Operations
map(): Apply a function to each element.shuffle(): Randomize order.batch(): Group examples into batches.repeat(): Iterate indefinitely.prefetch(): Fetch items in the background to avoid GPU starvation.
Performance Tips
- Use num_parallel_calls=tf.data.AUTOTUNE inside map().
- Cache datasets after the first pass for smaller datasets.
- For multiple large files, use interleave to load.
Preprocess Images
Image preprocessing is essential since deep learning models expect a consistent input format.
Scaling and Normalization
Most models work best when input pixel values are between 0 and 1:
normalization_layer = tf.keras.layers.Rescaling(1./255)
train_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))Resize Images
Resize images to a consistent shape:
train_ds = train_ds.map(
lambda x, y: (tf.image.resize(x, [128, 128]), y)
)Data Augmentation
Augmentation creates variations, improving generalization:
data_augmentation = tf.keras.Sequential([
tf.keras.layers.RandomFlip("horizontal"),
tf.keras.layers.RandomRotation(0.1),
])
train_ds = train_ds.map(lambda x, y: (data_augmentation(x, training=True), y))Example Pipeline
train_ds = tf.keras.utils.image_dataset_from_directory("data/train", image_size=(128,128))
train_ds = train_ds.map(lambda x,y: (normalization_layer(x), y))
train_ds = train_ds.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)Preprocess Text Data
Natural Language Processing (NLP) requires converting text into a numeric format.
Tokenization
Use TextVectorization:
vectorizer = tf.keras.layers.TextVectorization(
max_tokens=10000,
output_mode="int",
output_sequence_length=200
)
text_ds = tf.data.Dataset.from_tensor_slices(["This is an example", "TensorFlow is powerful"])
vectorizer.adapt(text_ds)Pad and Sequences
When dealing with sequences of different lengths, padding ensures uniform arrays. TensorFlow automatically pads when using sequence vectorization.
Embed Layer
Instead of raw integers, embeddings map tokens into dense vectors:
embedding = tf.keras.layers.Embedding(input_dim=10000, output_dim=128)Example: IMDB Sentiment Analysis
- Load IMDB dataset via keras.datasets.
- Apply text vectorization.
- Feed into an Embedding layer + LSTM/GRU.
Preprocess Structured Data
Many machine learning projects rely on tabular data.
Load CSV Data
dataset = tf.data.experimental.make_csv_dataset(
"titanic.csv", batch_size=32, label_name="Survived"
)Normalization
Numeric values should be standardized.
normalizer = tf.keras.layers.Normalization()
normalizer.adapt(numeric_values)
Handle Categorical Features
Convert string categories into indices using:
lookup = tf.keras.layers.StringLookup(output_mode="int")Then one-hot encode:
encoder = tf.keras.layers.CategoryEncoding(num_tokens=lookup.vocabulary_size())Combine Features
Combine everything:
combined = tf.keras.layers.Concatenate()([normalized_numeric, encoded_categorical])Integrate Preprocessing into Model Training
There are two ways to integrate preprocessing:
- Inside the Model using Keras preprocessing layers. This is portable; you save the preprocessing with the model.
- Outside the Model using tf.data. This may improve training speed for big datasets.
Example with preprocessing layers inside the model:
model = tf.keras.Sequential([
tf.keras.layers.Rescaling(1./255, input_shape=(128, 128, 3)),
tf.keras.layers.RandomFlip("horizontal"),
tf.keras.layers.Conv2D(32, (3,3), activation="relu"),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(10)
])Case Study: Image Classification with CIFAR-10
Let’s create a complete workflow:
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.cifar10.load_data()
train_ds = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
train_ds = train_ds.map(lambda x,y: (tf.image.resize(x, [128,128])/255.0, y))
train_ds = train_ds.shuffle(1000).batch(64).prefetch(tf.data.AUTOTUNE)
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, (3,3), activation="relu", input_shape=(128,128,3)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(64, (3,3), activation="relu"),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(10)
])
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(train_ds, epochs=5)This showcases a full preprocessing and training pipeline.
Debug Dataset Pipelines
When working with datasets, common issues include:
- Shape mismatches between images and labels.
- Data type errors when mapping.
- Infinite loops caused by .repeat() with no limit.
Debugging tips:
- Print shapes with for images, labels in dataset.take(1).
- Visualize samples with Matplotlib:
import matplotlib.pyplot as plt
for images, labels in train_ds.take(1):
plt.imshow(images[0].numpy().astype("uint8"))
plt.show()Best Practices for Large-Scale Training
- Cache small datasets to memory after the first pass.
- Use prefetching with
AUTOTUNE. - Store big datasets as TFRecord files, which are more efficient for distributed training.
- For multi-GPU or TPU environments, shard datasets across different devices.
- Enable mixed precision for faster training without loss in accuracy.
Conclusion
Efficient dataset pipelines are essential for building real-world machine learning models. With TensorFlow, you have the tools to load images, text, and structured data easily. By combining tf.data, preprocessing layers, and best practices like caching and prefetching, you can create scalable and fast training pipelines.
To recap:
- Use built-in datasets for quick prototyping.
- Create pipelines with tf.data.Dataset.
- Preprocess images with rescaling, resizing, and augmentation.
- Preprocess text with tokenization, padding, and embeddings.
- Handle structured data with normalization and categorical encoding.
- Integrate preprocessing either inside or outside the model, depending on requirements.
By mastering dataset loading and preprocessing in TensorFlow, you unlock the full potential of your training pipeline and set the foundation for higher accuracy and faster training.
You may read:
- Build an Artificial Neural Network in Tensorflow
- Training a Neural Network in TensorFlow
- Tensorflow Gradient Descent in Neural Network
- Tensorflow Activation Functions

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.