Image Classification with Global Context Vision Transformer

Image classification is a fundamental task in computer vision, and with the rise of transformer architectures, the Global Context Vision Transformer (GC ViT) has emerged as a powerful alternative to traditional CNNs.

Drawing from my 4+ years of experience with Python Keras, I will guide you step-by-step on how to implement image classification using GC ViT. This tutorial is designed to be straightforward and practical, so you can apply it directly to your projects.

Table of Contents

What is Global Context Vision Transformer?

The Global Context Vision Transformer enhances the standard Vision Transformer by integrating global contextual information into the attention mechanism. This helps the model capture long-range dependencies more effectively, improving classification accuracy on complex datasets.

Set Up Your Environment for Python Keras

Before diving into the code, ensure you have the necessary libraries installed. You can use pip to install TensorFlow and other dependencies:

!pip install tensorflow numpy matplotlib

This setup is essential to run Keras models smoothly.

Prepare the Dataset for Image Classification

For this tutorial, I’ll use the CIFAR-10 dataset, a popular benchmark for image classification. It contains 60,000 32×32 color images across 10 classes.

import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical

# Load CIFAR-10 data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# Normalize pixel values
x_train, x_test = x_train / 255.0, x_test / 255.0

# One-hot encode labels
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

Normalising the images and encoding labels prepares the data for training.

Build the Global Context Vision Transformer Model in Keras

Here’s a complete implementation of a simplified GC ViT model. I’ll explain each part as we go.

import tensorflow as tf
from tensorflow.keras import layers, models

def global_context_attention(x, reduction=16):
    # Compute channel-wise attention with global context
    input_shape = x.shape
    channel = input_shape[-1]

    # Global average pooling
    context = layers.GlobalAveragePooling2D()(x)
    context = layers.Reshape((1,1,channel))(context)

    # Bottleneck layers for channel attention
    context = layers.Conv2D(channel // reduction, kernel_size=1, activation='relu')(context)
    context = layers.Conv2D(channel, kernel_size=1, activation='sigmoid')(context)

    # Scale input features
    return layers.Multiply()([x, context])

def patch_embedding(input_shape, patch_size=4, embed_dim=64):
    inputs = layers.Input(shape=input_shape)
    # Extract patches
    patches = layers.Conv2D(embed_dim, kernel_size=patch_size, strides=patch_size)(inputs)
    # Flatten patches
    patches = layers.Reshape((-1, embed_dim))(patches)
    return inputs, patches

def transformer_block(x, embed_dim, num_heads, mlp_dim, dropout=0.1):
    # Multi-head self-attention
    attention_output = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)(x, x)
    attention_output = layers.Dropout(dropout)(attention_output)
    x = layers.LayerNormalization(epsilon=1e-6)(x + attention_output)

    # MLP block
    mlp_output = layers.Dense(mlp_dim, activation='relu')(x)
    mlp_output = layers.Dense(embed_dim)(mlp_output)
    mlp_output = layers.Dropout(dropout)(mlp_output)
    x = layers.LayerNormalization(epsilon=1e-6)(x + mlp_output)
    return x

def build_gc_vit(input_shape=(32,32,3), patch_size=4, embed_dim=64, num_heads=4, mlp_dim=128, num_layers=6, num_classes=10):
    inputs, patches = patch_embedding(input_shape, patch_size, embed_dim)

    x = patches
    for _ in range(num_layers):
        x = transformer_block(x, embed_dim, num_heads, mlp_dim)

    # Reshape back to spatial dimensions for global context attention
    spatial_dim = (input_shape[0] // patch_size, input_shape[1] // patch_size)
    x = layers.Reshape((spatial_dim[0], spatial_dim[1], embed_dim))(x)

    # Apply global context attention
    x = global_context_attention(x)

    # Classification head
    x = layers.GlobalAveragePooling2D()(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)

    model = models.Model(inputs=inputs, outputs=outputs)
    return model

# Instantiate the model
model = build_gc_vit()
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

This model extracts image patches, applies transformer blocks with multi-head attention, incorporates global context attention, and outputs class probabilities.

Train the GC ViT Model with Python Keras

Training is straightforward with Keras’s .fit() method. Here’s how I train the model on CIFAR-10 data:

history = model.fit(
    x_train, y_train,
    validation_split=0.2,
    epochs=20,
    batch_size=64
)

This trains the model for 20 epochs with a batch size of 64, while validating on 20% of the training data.

Evaluate Model Performance

After training, evaluate the model on the test set to see its accuracy:

test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")

I executed the above example code and added the screenshot below.

Image Classification with Global Context Vision Transformer

This gives a clear metric of how well the model generalises to unseen data.

Tips for Improving Your GC ViT Model

Increase model depth or embed dimension: More transformer layers or larger embeddings can improve accuracy, but require more resources.
Use data augmentation: Techniques like random flips and rotations help the model generalise better.
Experiment with learning rates: Adjusting the optimiser’s learning rate can speed up convergence.

Implementing image classification with Global Context Vision Transformer in Python Keras is both exciting and practical. The model’s ability to capture global dependencies makes it a powerful choice for complex image tasks.

The code I shared is a solid foundation you can build upon. Feel free to tweak parameters and add enhancements like data augmentation or advanced optimisers.

You may also read:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/