Object Detection Using Vision Transformers in Keras

Working with object detection models has always fascinated me, especially when combining the power of Vision Transformers (ViTs) with the simplicity of Keras. Over the past few years, I’ve experimented with numerous deep learning architectures, and Vision Transformers have stood out due to their remarkable ability to capture global context in images.

In this tutorial, I’ll walk you through how to build an object detection model using Vision Transformers in Keras. I’ll share complete, ready-to-run code snippets for each step so you can follow along seamlessly.

This Tutorial Covers:

What Are Vision Transformers in Python Keras?

Vision Transformers adapt the transformer architecture, originally designed for NLP, to image tasks. Unlike CNNs that focus on local features, ViTs split images into patches and apply self-attention to capture long-range dependencies.

Using Keras makes implementing these models easy and efficient, especially with TensorFlow’s support.

How to Use Vision Transformers for Object Detection in Keras

Object detection involves locating and classifying objects within an image. Vision Transformers can be adapted for this task by combining their feature extraction with detection heads.

I will show you two methods:

Method 1: Using a pre-trained Vision Transformer backbone with a custom detection head in Keras
Method 2: Building a Vision Transformer-based detection model from scratch in Keras

Method 1: Pre-trained Vision Transformer Backbone with Custom Detection Head

This method leverages a pre-trained ViT model as a feature extractor and adds a detection head for bounding box regression and classification.

Step 1: Install Required Libraries

!pip install tensorflow tensorflow-addons

Step 2: Import Libraries and Load Pre-trained ViT

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_addons as tfa

# Load pre-trained ViT model (from TensorFlow Hub or Keras Applications)
# For simplicity, we use a ViT model from keras.applications (if available)
# Otherwise, you can load from TensorFlow Hub or use a custom implementation

# Here, we simulate a ViT backbone with a simple Transformer block for demonstration

Step 3: Define Vision Transformer Backbone

def vit_backbone(input_shape=(224, 224, 3), patch_size=16, num_patches=196, projection_dim=64, num_heads=4, transformer_layers=8):
    inputs = layers.Input(shape=input_shape)

    # Create patches
    patches = layers.Conv2D(filters=projection_dim, kernel_size=patch_size, strides=patch_size)(inputs)
    patches = layers.Reshape((num_patches, projection_dim))(patches)

    # Add positional embedding
    positions = tf.range(start=0, limit=num_patches, delta=1)
    pos_embedding = layers.Embedding(input_dim=num_patches, output_dim=projection_dim)(positions)
    x = patches + pos_embedding

    # Transformer blocks
    for _ in range(transformer_layers):
        # Layer normalization
        x1 = layers.LayerNormalization(epsilon=1e-6)(x)
        # Multi-head attention
        attention_output = layers.MultiHeadAttention(num_heads=num_heads, key_dim=projection_dim)(x1, x1)
        # Skip connection
        x2 = layers.Add()([attention_output, x])

        # Feed-forward network
        x3 = layers.LayerNormalization(epsilon=1e-6)(x2)
        ffn = keras.Sequential([
            layers.Dense(projection_dim * 4, activation='relu'),
            layers.Dense(projection_dim),
        ])
        ffn_output = ffn(x3)
        x = layers.Add()([ffn_output, x2])

    model = keras.Model(inputs=inputs, outputs=x, name="vit_backbone")
    return model

vit_model = vit_backbone()
vit_model.summary()

Step 4: Add Detection Head

def detection_head(vit_output, num_classes=10):
    # Flatten transformer output
    x = layers.Flatten()(vit_output)
    # Fully connected layers for bounding box regression and classification
    bbox_regression = layers.Dense(4, activation='sigmoid', name='bbox')(x)  # [x_min, y_min, x_max, y_max] normalized
    class_probs = layers.Dense(num_classes, activation='softmax', name='class_probs')(x)
    return bbox_regression, class_probs

Step 5: Build Full Model

input_shape = (224, 224, 3)
num_classes = 5  # Example: detecting 5 object categories

inputs = layers.Input(shape=input_shape)
vit_features = vit_backbone(input_shape)(inputs)
bbox_output, class_output = detection_head(vit_features, num_classes)

model = keras.Model(inputs=inputs, outputs=[bbox_output, class_output])
model.summary()

Step 6: Compile the Model

model.compile(
    optimizer='adam',
    loss={
        'bbox': 'mse',
        'class_probs': 'categorical_crossentropy',
    },
    metrics={
        'bbox': 'mse',
        'class_probs': 'accuracy',
    }
)

Step 7: Prepare the Dataset and Train

You’ll need labeled data with bounding boxes and class labels. Here’s a dummy example:

import numpy as np

# Generate dummy data
X_train = np.random.rand(100, 224, 224, 3)
y_bbox = np.random.rand(100, 4)  # normalized bbox coordinates
y_class = keras.utils.to_categorical(np.random.randint(0, num_classes, 100), num_classes)

model.fit(X_train, {'bbox': y_bbox, 'class_probs': y_class}, epochs=10, batch_size=8)

I executed the above example code and added the screenshot below.

Object Detection Using Vision Transformers in Keras

Method 2: Build a Vision Transformer Object Detector from Scratch in Keras

For those who want full control, building a ViT-based detector from scratch is a rewarding experience.

Step 1: Define Patch Extraction Layer

class PatchExtractor(layers.Layer):
    def __init__(self, patch_size):
        super(PatchExtractor, self).__init__()
        self.patch_size = patch_size

    def call(self, images):
        batch_size = tf.shape(images)[0]
        patches = tf.image.extract_patches(
            images=images,
            sizes=[1, self.patch_size, self.patch_size, 1],
            strides=[1, self.patch_size, self.patch_size, 1],
            rates=[1,1,1,1],
            padding='VALID'
        )
        patch_dims = patches.shape[-1]
        patches = tf.reshape(patches, [batch_size, -1, patch_dims])
        return patches

Step 2: Define Transformer Encoder Block

def transformer_encoder(inputs, num_heads, projection_dim, ff_dim):
    # Layer normalization 1
    x1 = layers.LayerNormalization(epsilon=1e-6)(inputs)
    # Multi-head attention
    attention_output = layers.MultiHeadAttention(num_heads=num_heads, key_dim=projection_dim)(x1, x1)
    # Skip connection 1
    x2 = layers.Add()([attention_output, inputs])

    # Layer normalization 2
    x3 = layers.LayerNormalization(epsilon=1e-6)(x2)
    # Feed-forward network
    ffn = keras.Sequential([
        layers.Dense(ff_dim, activation='relu'),
        layers.Dense(projection_dim),
    ])
    ffn_output = ffn(x3)
    # Skip connection 2
    x4 = layers.Add()([ffn_output, x2])
    return x4

Step 3: Build the Vision Transformer Model

def build_vit_detector(input_shape=(224,224,3), patch_size=16, num_heads=4, projection_dim=64, ff_dim=128, transformer_layers=6, num_classes=5):
    inputs = layers.Input(shape=input_shape)
    patches = PatchExtractor(patch_size)(inputs)

    # Linear projection of patches
    x = layers.Dense(projection_dim)(patches)

    # Add positional embedding
    positions = tf.range(start=0, limit=(input_shape[0] // patch_size) ** 2, delta=1)
    pos_embedding = layers.Embedding(input_dim=(input_shape[0] // patch_size) ** 2, output_dim=projection_dim)(positions)
    x = x + pos_embedding

    # Transformer blocks
    for _ in range(transformer_layers):
        x = transformer_encoder(x, num_heads, projection_dim, ff_dim)

    # Flatten and detection heads
    x = layers.Flatten()(x)
    bbox_output = layers.Dense(4, activation='sigmoid', name='bbox')(x)
    class_output = layers.Dense(num_classes, activation='softmax', name='class_probs')(x)

    model = keras.Model(inputs=inputs, outputs=[bbox_output, class_output])
    return model

model = build_vit_detector()
model.summary()

Step 4: Compile and Train

model.compile(
    optimizer='adam',
    loss={
        'bbox': 'mse',
        'class_probs': 'categorical_crossentropy',
    },
    metrics={
        'bbox': 'mse',
        'class_probs': 'accuracy',
    }
)

# Dummy data as before
X_train = np.random.rand(100, 224, 224, 3)
y_bbox = np.random.rand(100, 4)
y_class = keras.utils.to_categorical(np.random.randint(0, 5, 100), 5)

model.fit(X_train, {'bbox': y_bbox, 'class_probs': y_class}, epochs=10, batch_size=8)

I executed the above example code and added the screenshot below.

Object Detection Using Vision Transformers Keras

Using Vision Transformers for object detection in Keras offers a fresh perspective beyond traditional CNNs. The self-attention mechanism enables models to comprehend the entire image context, thereby improving detection accuracy in complex scenes.

While these examples use dummy data, the same code can be adapted for real-world datasets like COCO or Pascal VOC by preprocessing images and annotations accordingly.

I hope this guide helps you get started with Vision Transformers in your object detection projects. If you want to explore further, consider fine-tuning pre-trained ViT weights or experimenting with hybrid CNN-ViT models for enhanced performance.

Other Python Keras articles you may also like:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/