Active Learning for Text Classification with Python Keras

Have you ever built a sentiment analysis model, only to realize you have thousands of reviews but zero labels? It is a common headache I have faced many times while working on large-scale Python Keras projects for retail clients.

Manually labeling 10,000 reviews is not just boring; it is a massive waste of your technical expertise and time.

This is where Active Learning comes in to save the day by picking the most “confusing” data points for you to label.

In this guide, I will show you how I use Active Learning to train a high-performing Keras model with only a fraction of the data.

This Tutorial Covers:

Set Up the Python Keras Environment for Review Classification

First, I always ensure my environment is stocked with the necessary libraries like TensorFlow and Scikit-Learn.

I prefer using the modAL framework alongside Keras because it simplifies the query logic significantly.

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from scikeras.wrappers import KerasClassifier
from modAL.models import ActiveLearner

# Setting seeds for reproducibility in my Python Keras workflow
tf.random.set_seed(42)
np.random.seed(42)

Prepare the USA Customer Review Dataset

For this example, I am using a simulated dataset of customer reviews for a popular coffee chain in the USA.

I like to split the data into a small “seed” set to start the model and a large “unlabeled” pool.

# Simulated reviews for a USA coffee shop
reviews = [
    "The latte was amazing and the staff was friendly",
    "Terrible service and my cold brew was bitter",
    "Best breakfast sandwich I have had in Seattle",
    "Wait times are ridiculous even for a simple drip coffee",
    "Loved the seasonal pumpkin spice flavors",
    "The cafe was dirty and the tables were sticky"
]
# 1 for Positive, 0 for Negative
labels = np.array([1, 0, 1, 0, 1, 0])

# Tokenizing the text for our Python Keras model
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(reviews)
X = tokenizer.texts_to_sequences(reviews)
X = pad_sequences(X, maxlen=10)

# Initial seed data (the first 2 samples)
X_initial, y_initial = X[:2], labels[:2]
# The rest is our unlabeled pool
X_pool, y_pool = X[2:], labels[2:]

Build the Keras Architecture for Classification

I usually build a simple MLP (Multi-Layer Perceptron) for text tasks when data is scarce.

This Python Keras function defines the structure of the neural network we will use as our learner.

def create_keras_model():
    model = Sequential([
        Dense(16, activation='relu', input_shape=(10,)),
        Dropout(0.2),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Wrapping the model for Scikit-Learn compatibility
classifier = KerasClassifier(model=create_keras_model, epochs=10, batch_size=2, verbose=0)

Initialize the Active Learner with Python Keras

Now, I initialize the ActiveLearner using the wrapper we just created and our small seed set.

This step establishes the baseline intelligence of our model before it starts asking for more data.

learner = ActiveLearner(
    estimator=classifier,
    X_initial=X_initial, y_initial=y_initial
)

# Checking initial performance
print(f"Initial accuracy: {learner.score(X, labels)}")

Method 1: Uncertainty Sampling Query Strategy

In my experience, Uncertainty Sampling is the most effective way to find reviews that the model is unsure about.

# We query the learner for the most uncertain record in the pool
query_idx, query_inst = learner.query(X_pool)

# Simulate a human (me) labeling this specific review
print(f"The model is confused by: {reviews[query_idx+2]}")
new_label = np.array([1]) # I provide the label here

# Teaching the Python Keras model the new information
learner.teach(query_inst, new_label)

# Remove the labeled instance from the pool
X_pool = np.delete(X_pool, query_idx, axis=0)
print("Active learning step complete.")

I executed the above example code and added the screenshot below.

Active Learning for Text Classification with Keras

The model looks at the unlabeled pool and picks the review where the probability of being positive is closest to 0.5.

Method 2: Batch Querying for Faster Training

Sometimes I don’t want to retrain the model after every single label; instead, I query a batch of five or ten.

# Function to perform a batch update in Python Keras
def run_batch_active_learning(learner, X_pool, iterations=2):
    for i in range(iterations):
        query_idx, query_inst = learner.query(X_pool)
        # Providing labels for the queried instances
        learner.teach(X_pool[query_idx], labels[query_idx:query_idx+1])
        # Update pool
        X_pool = np.delete(X_pool, query_idx, axis=0)
        print(f"Iteration {i+1}: The model is confused by: {reviews[query_idx + len(X_labeled)]}")
        print(f"Iteration {i+1} complete. Model updated.")

run_batch_active_learning(learner, X_pool)

I executed the above example code and added the screenshot below.

Active Learning for Text Classification with Python Keras

This approach is much more efficient when you are working on real-world USA business datasets with tight deadlines.

Evaluate the Model Performance Improvement

After a few rounds of active learning, I always compare the accuracy against the initial baseline.

It is always satisfying to see the accuracy jump significantly after only labeling a handful of difficult reviews.

final_accuracy = learner.score(X, labels)
print(f"Final accuracy after Active Learning: {final_accuracy}")

Handle Overfitting in Python Keras Active Learning

One thing I have noticed is that active learning can lead to overfitting on specific “hard” examples.

# Updated model with regularization
def create_robust_model():
    model = Sequential([
        Dense(32, activation='relu', input_shape=(10,)),
        Dropout(0.5),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Re-wrapping the updated Python Keras model
robust_classifier = KerasClassifier(model=create_robust_model, epochs=20, verbose=0)

To counter this, I add Dropout layers and use Early Stopping within the Keras model configuration.

Save the Active Learner for Production Use

Once I am happy with the performance, I save the underlying Keras model to a file for deployment.

# Accessing the Keras model from the learner
final_model = learner.estimator.model_

# Saving the model in H5 format
final_model.save("active_learning_review_model.h5")
print("Model saved successfully!")

This allows me to use the “smart” model in a production API to classify incoming USA customer feedback in real-time.

In this tutorial, I showed you how to implement Review Classification using Active Learning in Python Keras.

This approach is a total game-changer when you have limited time and a massive amount of unlabeled text data.

I have found that focusing on the most uncertain samples helps build a robust model much faster than random sampling.

You may read:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/