PyTorch Model Eval: Evaluate Your Models

Recently, I was working on a deep learning project where I needed to evaluate a PyTorch model’s performance on a test dataset. I realized that many beginners don’t fully understand the importance of putting a model in evaluation mode before testing it.

In this article, I will guide you through everything you need to know about evaluating models in PyTorch, including common issues and best practices I’ve learned over my more than 10 years as a Python developer.

Let’s get started..!

This Tutorial Covers:

PyTorch Model Eval

When you’re training deep learning models with PyTorch, you switch between two modes: training and evaluation. The model.eval() method is how you tell PyTorch that you’re ready to evaluate your model rather than train it.

This might seem like a small detail, but it dramatically affects how your model behaves and the results you’ll get.

Read PyTorch Conv3d

How to Use model.eval() in PyTorch

Let’s look at the basic pattern for properly evaluating a model:

import torch
import torch.nn as nn

# Define a simple model
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(128, 10)
)

# Put the model in evaluation mode
model.eval()

# Disable gradient computation for inference
with torch.no_grad():
    # Your evaluation code here
    predictions = model(test_data)

This pattern ensures that:

Your model’s layers behave correctly for evaluation
PyTorch doesn’t waste memory tracking gradients

Check out PyTorch Flatten

Method 1: Use model.eval() with torch.no_grad()

The most efficient way to evaluate a PyTorch model is to combine model.eval() with the torch.no_grad() context manager:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

# Dummy test dataset (100 samples, 10 features, 3 classes)
X_test = torch.randn(100, 10)
y_test = torch.randint(0, 3, (100,))
test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=16)

# Dummy model for classification
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 3)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# Initialize model and set to evaluation mode
model = SimpleModel()
model.eval()

# Evaluate the model without tracking gradients
total_correct = 0
total_samples = 0

with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total_samples += labels.size(0)
        total_correct += (predicted == labels).sum().item()

accuracy = 100 * total_correct / total_samples
print(f'Test Accuracy: {accuracy:.2f}%')

Output:

Test Accuracy: 29.00%

You can refer to the screenshot below to see the output.

I use this pattern in all my production code because it’s both memory-efficient and prevents accidental model updates.

Method 2: Use a Model Evaluation Function

For more complex evaluations, I like to create a dedicated evaluation function:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

# Dummy test dataset
X_test = torch.randn(100, 10)
y_test = torch.randint(0, 3, (100,))
test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=16)

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 3)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# Define evaluation function
def evaluate_model(model, data_loader, device):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, targets in data_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += targets.size(0)
            correct += (predicted == targets).sum().item()

    accuracy = 100 * correct / total
    return accuracy

# Prepare model and device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleModel().to(device)

# Run evaluation
accuracy = evaluate_model(model, test_loader, device)
print(f"Model accuracy: {accuracy:.2f}%")

Output:

Model accuracy: 31.00%

You can refer to the screenshot below to see the output.

This approach keeps your evaluation code organized and reusable across different models.

Read Create PyTorch Empty Tensor

Method 3: Evaluate with Different Metrics

Different applications require different evaluation metrics. Here’s how to evaluate a model for a classification task with multiple metrics:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Dummy test dataset
X_test = torch.randn(100, 10)
y_test = torch.randint(0, 3, (100,))
test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=16)

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 3)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# Evaluation function with multiple metrics
def evaluate_with_metrics(model, data_loader, device):
    model.eval()
    all_targets = []
    all_predictions = []

    with torch.no_grad():
        for inputs, targets in data_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            _, predictions = torch.max(outputs, 1)
            all_targets.extend(targets.cpu().numpy())
            all_predictions.extend(predictions.cpu().numpy())

    # Calculate metrics
    accuracy = accuracy_score(all_targets, all_predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        all_targets, all_predictions, average='weighted'
    )

    return {
        'accuracy': accuracy * 100,
        'precision': precision * 100,
        'recall': recall * 100,
        'f1': f1 * 100
    }

# Setup model and device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleModel().to(device)

# Evaluate the model
metrics = evaluate_with_metrics(model, test_loader, device)
print("Evaluation Metrics:")
for key, value in metrics.items():
    print(f"{key.capitalize()}: {value:.2f}%")

Output:

Evaluation Metrics:
Accuracy: 38.00%
Precision: 38.96%
Recall: 38.00%
F1: 32.69%

You can refer to the screenshot below to see the output.

I’ve used this approach for many customer churn prediction models for US telecom companies, where having multiple metrics gives better insight into model performance.

Common Issues When Using model.eval()

Through my years of developing PyTorch models, I’ve encountered several common mistakes:

Read PyTorch Stack Tutorial

1. Forgetting to switch back to training mode

After evaluation, if you continue training, you need to call model.train() to switch back to training mode:

# Evaluate model
model.eval()
with torch.no_grad():
    # Evaluation code here

# Switch back to training mode
model.train()
# Continue training

2. Not using torch.no_grad()

Even with model.eval(), PyTorch still computes gradients by default. Always use torch.no_grad() during evaluation to save memory:

model.eval()
# Wrong: Not using torch.no_grad()
predictions = model(test_data)  # Still computes gradients!

# Correct:
with torch.no_grad():
    predictions = model(test_data)  # No gradients computed

3. Applying model.eval() to only part of the model

If you’re using a complex model with multiple components, make sure to call eval() on the entire model:

# Wrong: Only setting part of the model to eval mode
model.feature_extractor.eval()

# Correct: Set the entire model to eval mode
model.eval()  # This recursively sets all submodules to eval mode

Check out Use PyTorch Cat function

Practical Example: Evaluate a CNN on CIFAR-10

Let’s put everything together with a real-world example. Here’s how I recently evaluated a CNN model on the CIFAR-10 dataset:

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

# Load CIFAR-10 test dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

testset = torchvision.datasets.CIFAR10(root='./data', train=False, 
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=100, 
                                         shuffle=False, num_workers=2)

# Load your trained model
class CNN(nn.Module):
    # (Your CNN model definition here)
    pass

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = CNN()
model.load_state_dict(torch.load('cifar_cnn.pth'))
model = model.to(device)

# Evaluate the model
model.eval()
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))

with torch.no_grad():
    for data in testloader:
        images, labels = data[0].to(device), data[1].to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        c = (predicted == labels).squeeze()
        for i in range(4):
            label = labels[i]
            class_correct[label] += c[i].item()
            class_total[label] += 1

# Print per-class accuracy
classes = ('plane', 'car', 'bird', 'cat', 'deer', 
           'dog', 'frog', 'horse', 'ship', 'truck')
for i in range(10):
    print(f'Accuracy of {classes[i]}: {100 * class_correct[i] / class_total[i]:.2f}%')

This evaluation not only gives us the overall accuracy but also provides insights into how well our model performs on each class, which is critical for real-world applications.

Evaluate Models in Production

When deploying models in production environments, consistent evaluation becomes even more critical. Here’s the approach I use for production systems:

def production_inference(model, input_data, device):
    # Always ensure model is in eval mode
    model.eval()

    # Convert input to appropriate format
    if not isinstance(input_data, torch.Tensor):
        input_data = torch.tensor(input_data, dtype=torch.float32)

    # Move to the right device
    input_data = input_data.to(device)

    # Perform inference without gradient calculation
    with torch.no_grad():
        output = model(input_data)

    # Process output as needed
    processed_output = process_output(output)

    return processed_output

This pattern ensures consistent behavior and avoids common issues like forgetting to set evaluation mode or unnecessary gradient computation.

I hope you found this guide helpful for properly evaluating your PyTorch models. Remember, the key is to always set your model to evaluation mode with model.eval() and use torch.no_grad() before running inference. These simple steps can make a big difference in both the accuracy and efficiency of your model evaluation.

PyTorch Model Eval: Evaluate Your Models

PyTorch Model Eval

How to Use model.eval() in PyTorch

Method 1: Use model.eval() with torch.no_grad()

Method 2: Use a Model Evaluation Function

Method 3: Evaluate with Different Metrics

Common Issues When Using model.eval()

1. Forgetting to switch back to training mode

2. Not using torch.no_grad()

3. Applying model.eval() to only part of the model

Practical Example: Evaluate a CNN on CIFAR-10

Evaluate Models in Production

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends