Over the years, I have managed massive datasets of product images for e-commerce platforms.
One of the biggest headaches I faced was dealing with nearly identical images, photos taken from slightly different angles or with different lighting.
In this tutorial, I will show you exactly how I solved this using Python Keras to efficiently identify these near-duplicates.
The Problem with Near-Duplicate Images
When dealing with thousands of images, simple file-name checks or file-size comparisons are insufficient.
A near-duplicate is an image that looks almost exactly like another but might have been resized, cropped, or slightly color-corrected.
Method 1: Feature Extraction using Pre-trained VGG16 in Python Keras
I prefer using the VGG16 model because it is incredibly reliable for extracting “embeddings” or numerical representations of images.
By converting an image into a vector of numbers, we can mathematically calculate how similar one photo is to another.
import numpy as np
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model
# Load the VGG16 model pre-trained on ImageNet
base_model = VGG16(weights='imagenet')
# We remove the final classification layer to get the feature vector
model = Model(inputs=base_model.input, outputs=base_model.get_layer('fc1').output)
def extract_features(img_path):
# Loading image and resizing to 224x224 for VGG16
img = image.load_img(img_path, target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
# Extracting the 4096-dimensional feature vector
feature = model.predict(img_data)
return feature.flatten()
# Example: Extracting features from a photo of a Mustang in a dealership
features_1 = extract_features('ford_mustang_front.jpg')
print(features_1.shape)I executed the above example code and added the screenshot below.

Method 2: Calculate Cosine Similarity for Python Keras Embeddings
Once I have the feature vectors, I need a way to compare them to see how “close” they are in a multi-dimensional space.
I use Cosine Similarity because it measures the angle between two vectors, making it perfect for identifying images with the same content.
from sklearn.metrics.pairwise import cosine_similarity
def get_similarity(feat1, feat2):
# Reshaping for sklearn compatibility
feat1 = feat1.reshape(1, -1)
feat2 = feat2.reshape(1, -1)
# Returning a score between 0 and 1
return cosine_similarity(feat1, feat2)[0][0]
# Comparing a front view and a slightly angled view of a California beach house
img1_features = extract_features('beach_house_1.jpg')
img2_features = extract_features('beach_house_2.jpg')
similarity_score = get_similarity(img1_features, img2_features)
print(f"Similarity Score: {similarity_score}")
if similarity_score > 0.95:
print("These are near-duplicate images.")I executed the above example code and added the screenshot below.

Method 3: Build a Bulk Near-Duplicate Detector with Python Keras
In my experience, you usually aren’t just comparing two images, but rather cleaning up a whole directory of photos.
This script loops through a folder, extracts features for every image, and flags duplicates based on a specific threshold.
import os
def find_all_duplicates(image_folder, threshold=0.98):
image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith('.jpg')]
features_list = []
duplicates = []
# Extract features for all images first
for img_path in image_files:
features_list.append((img_path, extract_features(img_path)))
# Compare each image with every other image
for i in range(len(features_list)):
for j in range(i + 1, len(features_list)):
score = get_similarity(features_list[i][1], features_list[j][1])
if score > threshold:
duplicates.append((features_list[i][0], features_list[j][0], score))
return duplicates
# Running this on a local dataset of Grand Canyon tourist photos
found_duplicates = find_all_duplicates('./my_travel_photos/')
for d in found_duplicates:
print(f"Duplicate found: {d[0]} and {d[1]} with score {d[2]}")Method 4: Use Global Average Pooling for Faster Python Keras Processing
If you are working with a massive dataset (like 50,000+ images), the previous method might be a bit slow.
I use Global Average Pooling (GAP) layers instead of Fully Connected layers to reduce the vector size, which speeds up comparisons significantly.
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import GlobalAveragePooling2D
# Using ResNet50 for more modern and efficient feature extraction
base_resnet = ResNet50(weights='imagenet', include_top=False)
gap_layer = GlobalAveragePooling2D()(base_resnet.output)
fast_model = Model(inputs=base_resnet.input, outputs=gap_layer)
def extract_features_fast(img_path):
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
# This returns a smaller 2048-length vector
return fast_model.predict(x).flatten()
# Faster extraction for a batch of images from a Seattle tech conference
fast_feat = extract_features_fast('conference_room.jpg')
print(fast_feat.shape)I executed the above example code and added the screenshot below.

Method 5: Visualize Results of Python Keras Near-Duplicate Search
When I build these tools for clients, they often want to see the duplicates side-by-side to verify the AI is working.
Using Matplotlib, we can create a simple visualizer that displays the pair of images flagged as duplicates.
import matplotlib.pyplot as plt
def plot_duplicates(img_path1, img_path2, score):
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
img1 = image.load_img(img_path1)
img2 = image.load_img(img_path2)
axes[0].imshow(img1)
axes[0].set_title("Original Image")
axes[1].imshow(img2)
axes[1].set_title(f"Duplicate (Score: {score:.4f})")
plt.show()
# Visualize a detected duplicate pair from a real estate listing in Texas
plot_duplicates('house_front.jpg', 'house_front_filtered.jpg', 0.992)In this tutorial, I showed you how to use Python Keras to build a robust near-duplicate image search system. I covered everything from feature extraction with VGG16 to high-speed processing with ResNet50.
You may also read:
- Ways to Visualize Convolutional Neural Network Filters in Keras
- Keras Model Predictions with Integrated Gradients
- Explore Vision Transformer (ViT) Representations in Keras
- Keras Grad-CAM Class Activation Maps

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.