Vision Transformers (ViTs) have revolutionized computer vision by leveraging the power of transformers. Traditionally, ViTs rely heavily on the attention mechanism to capture relationships within image patches. But what if we could simplify this architecture by removing the attention component altogether?
In this article, I’ll share my firsthand experience building a Vision Transformer without attention using Python Keras. This approach reduces complexity while still maintaining competitive performance on image classification tasks. I’ll walk you through the entire process with clear explanations and full code examples.
What Is a Vision Transformer Without Attention?
A Vision Transformer without attention replaces the multi-head self-attention blocks with simpler operations like convolutions or shifts to capture spatial information. This reduces computational cost and can be easier to train on smaller datasets.
From my experience, this approach is great when you want the benefits of transformer-like architectures but need faster training or have limited hardware resources.
Step 1: Prepare the Dataset
For this tutorial, I’ll use the CIFAR-10 dataset, a popular image classification benchmark. It contains 60,000 32×32 color images in 10 classes.
Here’s how to load and preprocess the data using Python Keras:
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
# Normalize pixel values
x_train, x_test = x_train / 255.0, x_test / 255.0
# One-hot encode labels
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)This simple preprocessing normalizes images and prepares labels for classification.
Step 2: Create the Patch Embedding Layer in Python Keras
Vision Transformers work by splitting images into patches. Instead of flattening, we use a convolution layer to create patch embeddings efficiently.
from tensorflow.keras.layers import Conv2D, Reshape, Layer
class PatchEmbedding(Layer):
def __init__(self, patch_size, embed_dim):
super(PatchEmbedding, self).__init__()
self.patch_size = patch_size
self.embed_dim = embed_dim
self.proj = Conv2D(filters=embed_dim, kernel_size=patch_size, strides=patch_size)
def call(self, x):
x = self.proj(x) # Shape: (batch, height/patch_size, width/patch_size, embed_dim)
x = tf.reshape(x, [tf.shape(x)[0], -1, self.embed_dim]) # Flatten patches
return x
# Example usage
patch_embed = PatchEmbedding(patch_size=4, embed_dim=64)This layer converts the image into a sequence of patch embeddings, a critical step for transformer architectures.
Step 3: Implement the Shift Operation Instead of Attention
Instead of using attention, we apply a shift operation to capture local spatial dependencies. This is a lightweight alternative inspired by ShiftViT.
from tensorflow.keras.layers import LayerNormalization, Dense, Dropout
import tensorflow as tf
class ShiftMLP(Layer):
def __init__(self, embed_dim, mlp_dim, dropout_rate=0.1):
super(ShiftMLP, self).__init__()
self.norm1 = LayerNormalization()
self.mlp1 = Dense(mlp_dim, activation='gelu')
self.dropout1 = Dropout(dropout_rate)
self.mlp2 = Dense(embed_dim)
self.dropout2 = Dropout(dropout_rate)
def call(self, x):
# Shift operation: roll the tensor along the sequence dimension
x_shifted = tf.roll(x, shift=1, axis=1)
x = x + x_shifted # Add shifted features
x_norm = self.norm1(x)
x_mlp = self.mlp1(x_norm)
x_mlp = self.dropout1(x_mlp)
x_mlp = self.mlp2(x_mlp)
x_mlp = self.dropout2(x_mlp)
return x + x_mlp
# Example usage
shift_mlp = ShiftMLP(embed_dim=64, mlp_dim=128)This method captures relationships between patches by shifting features along the sequence dimension, replacing the attention mechanism.
Step 4: Build the Vision Transformer Without Attention Model
Now, let’s put everything together to build the full model.
from tensorflow.keras.layers import Input, GlobalAveragePooling1D
from tensorflow.keras.models import Model
def build_shift_vit(image_size=32, patch_size=4, embed_dim=64, mlp_dim=128, num_classes=10, num_blocks=4):
inputs = Input(shape=(image_size, image_size, 3))
# Patch embedding
x = PatchEmbedding(patch_size, embed_dim)(inputs)
# Apply multiple ShiftMLP blocks
for _ in range(num_blocks):
x = ShiftMLP(embed_dim, mlp_dim)(x)
# Pooling and classification head
x = GlobalAveragePooling1D()(x)
outputs = Dense(num_classes, activation='softmax')(x)
model = Model(inputs, outputs)
return model
# Instantiate the model
model = build_shift_vit()
model.summary()The model consists of patch embedding, multiple shift-based MLP blocks, and a classification head.
Step 5: Compile and Train the Model Using Python Keras
Compile the model with an optimizer and loss function suitable for multi-class classification:
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Train the model
history = model.fit(
x_train, y_train,
validation_split=0.1,
epochs=20,
batch_size=64
)Training will take a few minutes, depending on your hardware. You can monitor accuracy improvements with each epoch.
Step 6: Evaluate Model Performance
After training, evaluate the model on the test set:
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_acc:.4f}')You can refer to the screenshot below to see the output.

From my experience, this simpler architecture achieves competitive accuracy with faster training times compared to traditional ViTs.
In my projects, I found that removing attention reduces complexity and speeds up training. This is especially useful when working with limited computational resources or smaller datasets.
This method still captures spatial dependencies effectively by shifting embeddings, making it a practical alternative for many image classification tasks using Python Keras.
Other Python Keras articles you may also like:
- Implement Few-Shot Learning with Reptile in Keras
- Semi-Supervised Image Classification with Contrastive Pretraining Using SimCLR in Keras
- Image Classification with Swin Transformers in Keras
- Train a Vision Transformer on Small Datasets Using Keras

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.