Vision Transformer Without Attention Using Python Keras

Vision Transformers (ViTs) have revolutionized computer vision by leveraging the power of transformers. Traditionally, ViTs rely heavily on the attention mechanism to capture relationships within image patches. But what if we could simplify this architecture by removing the attention component altogether?

In this article, I’ll share my firsthand experience building a Vision Transformer without attention using Python Keras. This approach reduces complexity while still maintaining competitive performance on image classification tasks. I’ll walk you through the entire process with clear explanations and full code examples.

Table of Contents

What Is a Vision Transformer Without Attention?

A Vision Transformer without attention replaces the multi-head self-attention blocks with simpler operations like convolutions or shifts to capture spatial information. This reduces computational cost and can be easier to train on smaller datasets.

From my experience, this approach is great when you want the benefits of transformer-like architectures but need faster training or have limited hardware resources.

Step 1: Prepare the Dataset

For this tutorial, I’ll use the CIFAR-10 dataset, a popular image classification benchmark. It contains 60,000 32×32 color images in 10 classes.

Here’s how to load and preprocess the data using Python Keras:

import tensorflow as tf
from tensorflow.keras.utils import to_categorical

# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Normalize pixel values
x_train, x_test = x_train / 255.0, x_test / 255.0

# One-hot encode labels
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

This simple preprocessing normalizes images and prepares labels for classification.

Step 2: Create the Patch Embedding Layer in Python Keras

Vision Transformers work by splitting images into patches. Instead of flattening, we use a convolution layer to create patch embeddings efficiently.

from tensorflow.keras.layers import Conv2D, Reshape, Layer

class PatchEmbedding(Layer):
    def __init__(self, patch_size, embed_dim):
        super(PatchEmbedding, self).__init__()
        self.patch_size = patch_size
        self.embed_dim = embed_dim
        self.proj = Conv2D(filters=embed_dim, kernel_size=patch_size, strides=patch_size)

    def call(self, x):
        x = self.proj(x)  # Shape: (batch, height/patch_size, width/patch_size, embed_dim)
        x = tf.reshape(x, [tf.shape(x)[0], -1, self.embed_dim])  # Flatten patches
        return x

# Example usage
patch_embed = PatchEmbedding(patch_size=4, embed_dim=64)

This layer converts the image into a sequence of patch embeddings, a critical step for transformer architectures.

Step 3: Implement the Shift Operation Instead of Attention

Instead of using attention, we apply a shift operation to capture local spatial dependencies. This is a lightweight alternative inspired by ShiftViT.

from tensorflow.keras.layers import LayerNormalization, Dense, Dropout
import tensorflow as tf

class ShiftMLP(Layer):
    def __init__(self, embed_dim, mlp_dim, dropout_rate=0.1):
        super(ShiftMLP, self).__init__()
        self.norm1 = LayerNormalization()
        self.mlp1 = Dense(mlp_dim, activation='gelu')
        self.dropout1 = Dropout(dropout_rate)
        self.mlp2 = Dense(embed_dim)
        self.dropout2 = Dropout(dropout_rate)

    def call(self, x):
        # Shift operation: roll the tensor along the sequence dimension
        x_shifted = tf.roll(x, shift=1, axis=1)
        x = x + x_shifted  # Add shifted features

        x_norm = self.norm1(x)
        x_mlp = self.mlp1(x_norm)
        x_mlp = self.dropout1(x_mlp)
        x_mlp = self.mlp2(x_mlp)
        x_mlp = self.dropout2(x_mlp)

        return x + x_mlp

# Example usage
shift_mlp = ShiftMLP(embed_dim=64, mlp_dim=128)

This method captures relationships between patches by shifting features along the sequence dimension, replacing the attention mechanism.

Step 4: Build the Vision Transformer Without Attention Model

Now, let’s put everything together to build the full model.

from tensorflow.keras.layers import Input, GlobalAveragePooling1D
from tensorflow.keras.models import Model

def build_shift_vit(image_size=32, patch_size=4, embed_dim=64, mlp_dim=128, num_classes=10, num_blocks=4):
    inputs = Input(shape=(image_size, image_size, 3))

    # Patch embedding
    x = PatchEmbedding(patch_size, embed_dim)(inputs)

    # Apply multiple ShiftMLP blocks
    for _ in range(num_blocks):
        x = ShiftMLP(embed_dim, mlp_dim)(x)

    # Pooling and classification head
    x = GlobalAveragePooling1D()(x)
    outputs = Dense(num_classes, activation='softmax')(x)

    model = Model(inputs, outputs)
    return model

# Instantiate the model
model = build_shift_vit()
model.summary()

The model consists of patch embedding, multiple shift-based MLP blocks, and a classification head.

Step 5: Compile and Train the Model Using Python Keras

Compile the model with an optimizer and loss function suitable for multi-class classification:

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Train the model
history = model.fit(
    x_train, y_train,
    validation_split=0.1,
    epochs=20,
    batch_size=64
)

Training will take a few minutes, depending on your hardware. You can monitor accuracy improvements with each epoch.

Step 6: Evaluate Model Performance

After training, evaluate the model on the test set:

test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_acc:.4f}')

You can refer to the screenshot below to see the output.

Vision Transformer Without Attention Using Python Keras

From my experience, this simpler architecture achieves competitive accuracy with faster training times compared to traditional ViTs.

In my projects, I found that removing attention reduces complexity and speeds up training. This is especially useful when working with limited computational resources or smaller datasets.

This method still captures spatial dependencies effectively by shifting embeddings, making it a practical alternative for many image classification tasks using Python Keras.

Other Python Keras articles you may also like:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/