Near-Duplicate Image Search in Python Keras

Over the years, I have managed massive datasets of product images for e-commerce platforms.

One of the biggest headaches I faced was dealing with nearly identical images, photos taken from slightly different angles or with different lighting.

In this tutorial, I will show you exactly how I solved this using Python Keras to efficiently identify these near-duplicates.

Table of Contents

The Problem with Near-Duplicate Images

When dealing with thousands of images, simple file-name checks or file-size comparisons are insufficient.

A near-duplicate is an image that looks almost exactly like another but might have been resized, cropped, or slightly color-corrected.

Method 1: Feature Extraction using Pre-trained VGG16 in Python Keras

I prefer using the VGG16 model because it is incredibly reliable for extracting “embeddings” or numerical representations of images.

By converting an image into a vector of numbers, we can mathematically calculate how similar one photo is to another.

import numpy as np
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model

# Load the VGG16 model pre-trained on ImageNet
base_model = VGG16(weights='imagenet')
# We remove the final classification layer to get the feature vector
model = Model(inputs=base_model.input, outputs=base_model.get_layer('fc1').output)

def extract_features(img_path):
    # Loading image and resizing to 224x224 for VGG16
    img = image.load_img(img_path, target_size=(224, 224))
    img_data = image.img_to_array(img)
    img_data = np.expand_dims(img_data, axis=0)
    img_data = preprocess_input(img_data)
    
    # Extracting the 4096-dimensional feature vector
    feature = model.predict(img_data)
    return feature.flatten()

# Example: Extracting features from a photo of a Mustang in a dealership
features_1 = extract_features('ford_mustang_front.jpg')
print(features_1.shape)

I executed the above example code and added the screenshot below.

Method 2: Calculate Cosine Similarity for Python Keras Embeddings

Once I have the feature vectors, I need a way to compare them to see how “close” they are in a multi-dimensional space.

I use Cosine Similarity because it measures the angle between two vectors, making it perfect for identifying images with the same content.

from sklearn.metrics.pairwise import cosine_similarity

def get_similarity(feat1, feat2):
    # Reshaping for sklearn compatibility
    feat1 = feat1.reshape(1, -1)
    feat2 = feat2.reshape(1, -1)
    
    # Returning a score between 0 and 1
    return cosine_similarity(feat1, feat2)[0][0]

# Comparing a front view and a slightly angled view of a California beach house
img1_features = extract_features('beach_house_1.jpg')
img2_features = extract_features('beach_house_2.jpg')

similarity_score = get_similarity(img1_features, img2_features)
print(f"Similarity Score: {similarity_score}")

if similarity_score > 0.95:
    print("These are near-duplicate images.")

I executed the above example code and added the screenshot below.

Method 3: Build a Bulk Near-Duplicate Detector with Python Keras

In my experience, you usually aren’t just comparing two images, but rather cleaning up a whole directory of photos.

This script loops through a folder, extracts features for every image, and flags duplicates based on a specific threshold.

import os

def find_all_duplicates(image_folder, threshold=0.98):
    image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith('.jpg')]
    features_list = []
    duplicates = []

    # Extract features for all images first
    for img_path in image_files:
        features_list.append((img_path, extract_features(img_path)))

    # Compare each image with every other image
    for i in range(len(features_list)):
        for j in range(i + 1, len(features_list)):
            score = get_similarity(features_list[i][1], features_list[j][1])
            if score > threshold:
                duplicates.append((features_list[i][0], features_list[j][0], score))
                
    return duplicates

# Running this on a local dataset of Grand Canyon tourist photos
found_duplicates = find_all_duplicates('./my_travel_photos/')
for d in found_duplicates:
    print(f"Duplicate found: {d[0]} and {d[1]} with score {d[2]}")

Method 4: Use Global Average Pooling for Faster Python Keras Processing

If you are working with a massive dataset (like 50,000+ images), the previous method might be a bit slow.

I use Global Average Pooling (GAP) layers instead of Fully Connected layers to reduce the vector size, which speeds up comparisons significantly.

from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import GlobalAveragePooling2D

# Using ResNet50 for more modern and efficient feature extraction
base_resnet = ResNet50(weights='imagenet', include_top=False)
gap_layer = GlobalAveragePooling2D()(base_resnet.output)
fast_model = Model(inputs=base_resnet.input, outputs=gap_layer)

def extract_features_fast(img_path):
    img = image.load_img(img_path, target_size=(224, 224))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    
    # This returns a smaller 2048-length vector
    return fast_model.predict(x).flatten()

# Faster extraction for a batch of images from a Seattle tech conference
fast_feat = extract_features_fast('conference_room.jpg')
print(fast_feat.shape)

I executed the above example code and added the screenshot below.

Near-Duplicate Image Search Python Keras

Method 5: Visualize Results of Python Keras Near-Duplicate Search

When I build these tools for clients, they often want to see the duplicates side-by-side to verify the AI is working.

Using Matplotlib, we can create a simple visualizer that displays the pair of images flagged as duplicates.

import matplotlib.pyplot as plt

def plot_duplicates(img_path1, img_path2, score):
    fig, axes = plt.subplots(1, 2, figsize=(10, 5))
    
    img1 = image.load_img(img_path1)
    img2 = image.load_img(img_path2)
    
    axes[0].imshow(img1)
    axes[0].set_title("Original Image")
    
    axes[1].imshow(img2)
    axes[1].set_title(f"Duplicate (Score: {score:.4f})")
    
    plt.show()

# Visualize a detected duplicate pair from a real estate listing in Texas
plot_duplicates('house_front.jpg', 'house_front_filtered.jpg', 0.992)

In this tutorial, I showed you how to use Python Keras to build a robust near-duplicate image search system. I covered everything from feature extraction with VGG16 to high-speed processing with ResNet50.

You may also read:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/