Genetic Algorithms With Scikit-Learn In Python

I’ve been working with Python for over a decade, and throughout my journey, I’ve explored numerous optimization techniques. One approach that has fascinated me is the genetic algorithm, a powerful method inspired by natural selection. When combined with Scikit-Learn, it offers a unique way to optimize machine learning models beyond traditional methods.

If you’re like me and want to explore genetic algorithms in Python, this tutorial will walk you through everything you need to know. We’ll keep it simple, practical, and focused on real-world applications.

Let’s begin.

This Tutorial Covers:

What Is a Genetic Algorithm?

At its core, a genetic algorithm (GA) mimics biological evolution. Think of it as natural selection in code form, where potential solutions to a problem evolve. Instead of manually tuning parameters or relying solely on gradient-based methods, GAs explore a population of solutions, selecting the fittest, combining them, and introducing mutations to find optimal or near-optimal results.

I’ve found GAs especially useful when the search space is complex or non-differentiable, such as feature selection or hyperparameter tuning in machine learning.

Why Use Genetic Algorithms with Scikit-Learn?

Scikit-Learn is my go-to Python library for machine learning because of its clean API and extensive functionality. However, it doesn’t natively support genetic algorithms for optimization. Integrating GAs lets you:

Optimize hyperparameters more creatively than grid or random search.
Perform feature selection automatically.
Solve complex constrained optimization problems.

This combination can significantly improve model performance, especially in business scenarios like predicting customer churn in telecom or optimizing marketing campaigns.

Get Started: Installing Required Libraries

Before getting in, ensure you have Python installed (I recommend Python 3.8+). Then, install Scikit-Learn and a genetic algorithm library like DEAP or sklearn-genetic that integrates well with Scikit-Learn:

pip install scikit-learn deap

Or, for a more Scikit-Learn-friendly genetic algorithm wrapper:

pip install sklearn-genetic

I prefer sklearn-genetic for its seamless integration.

Read Scikit-Learn Gradient Descent

Method 1: Use sklearn-genetic for Hyperparameter Optimization

This method is easy and feels native to Scikit-Learn users.

Step 1: Import Libraries and Prepare Data

Let’s say we want to optimize a Random Forest classifier to predict customer churn in a US telecom dataset.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")  # Optional: to suppress warnings

# Load dataset (replace with your own CSV if needed)
data = fetch_openml(name='adult', version=2, as_frame=True)

# Select only numeric columns
X = data.data.select_dtypes(include=['float64', 'int64'])

# Set target column (income classification)
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Step 2: Set Up Genetic Algorithm Feature Selection

selector = GAFeatureSelectionCV(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    cv=5,
    scoring='accuracy',
    population_size=10,     # You can set 50, but for faster testing, use 10
    generations=5,          # You can set 20, but for faster testing, use 5
    n_jobs=-1,
    verbose=True,
    keep_top_k=5
)

Step 3: Fit and Evaluate

# Fit the selector on training data
selector.fit(X_train, y_train)

# Get the names of selected features
selected_features = X.columns[selector.support_]
print("✅ Selected Features:", list(selected_features))

# Make predictions using selected features
y_pred = selector.predict(X_test)

# Evaluate the model
print("✅ Test Accuracy:", accuracy_score(y_test, y_pred))

You can refer to the screenshot below to see the output.

This approach not only tunes hyperparameters but also selects the most relevant features, which is a big win for interpretability and performance.

Check out Scikit-Learn Non-Linear

Method 2: Custom Genetic Algorithm Using DEAP

For those who want more control, I’ve used the DEAP library to build custom GAs.

Step 1: Define the Problem

Suppose you want to optimize hyperparameters like n_estimators and max_depth of a Random Forest.

Step 2: Set Up DEAP Environment

import random
from deap import base, creator, tools
from sklearn.model_selection import cross_val_score
import numpy as np

# Define evaluation function
def eval_rf(individual):
    n_estimators, max_depth = individual
    clf = RandomForestClassifier(n_estimators=int(n_estimators), max_depth=int(max_depth), random_state=42)
    score = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy').mean()
    return score,

Step 3: Configure GA Components

creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)

toolbox = base.Toolbox()
toolbox.register("attr_n_estimators", random.randint, 10, 200)
toolbox.register("attr_max_depth", random.randint, 1, 30)
toolbox.register("individual", tools.initCycle, creator.Individual,
                 (toolbox.attr_n_estimators, toolbox.attr_max_depth), n=1)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", eval_rf)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutUniformInt, low=[10,1], up=[200,30], indpb=0.2)
toolbox.register("select", tools.selTournament, tournsize=3)

Step 4: Run the Genetic Algorithm

population = toolbox.population(n=50)
NGEN = 20
for gen in range(NGEN):
    offspring = toolbox.select(population, len(population))
    offspring = list(map(toolbox.clone, offspring))
    
    for child1, child2 in zip(offspring[::2], offspring[1::2]):
        if random.random() < 0.5:
            toolbox.mate(child1, child2)
            del child1.fitness.values
            del child2.fitness.values
    
    for mutant in offspring:
        if random.random() < 0.2:
            toolbox.mutate(mutant)
            del mutant.fitness.values
    
    invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
    fitnesses = map(toolbox.evaluate, invalid_ind)
    for ind, fit in zip(invalid_ind, fitnesses):
        ind.fitness.values = fit
    
    population[:] = offspring
    
best_ind = tools.selBest(population, 1)[0]
print(f"Best Parameters: n_estimators={best_ind[0]}, max_depth={best_ind[1]}")

This method requires more setup but gives you full flexibility to tailor the GA to your problem.

Tips From My Experience

Always start with a smaller population and fewer generations to test your GA setup.
Use parallel processing (n_jobs=-1 in Scikit-Learn or multiprocessing in DEAP) to speed up evaluation.
Monitor convergence; if the fitness stops improving, it might be time to stop early.
For business problems in the USA, like predicting loan defaults or customer retention, GAs can uncover feature combinations that traditional methods might miss.

Genetic algorithms are a fantastic addition to your Python toolkit, especially when paired with Scikit-Learn. Whether you choose a ready-made library like sklearn-genetic or build your own with DEAP, you’ll gain a powerful method to optimize complex models.

Genetic Algorithms with Scikit-Learn in Python