Adam Optimizer In PyTorch With Examples

Throughout my more than ten years as a Python developer, I have worked with various optimization algorithms for deep learning models. Among these, Adam has consistently been one of my preferred choices due to its efficiency and reliability.

Adam (short for Adaptive Moment Estimation) combines the best aspects of other popular optimizers like AdaGrad and RMSProp. It’s particularly well-suited for problems with noisy or sparse gradients.

In this tutorial, I will show you how to implement Adam optimizer in PyTorch with practical examples. You’ll learn when to use it, how to configure its parameters, and see real-world applications.

This Tutorial Covers:

Adam Optimizer

Adam is an adaptive learning rate optimization algorithm designed specifically for training deep neural networks. It was introduced in 2014 by Diederik Kingma and Jimmy Ba in their paper “Adam: A Method for Stochastic Optimization.”

The algorithm calculates adaptive learning rates for each parameter by storing both:

First moment (mean) of past gradients
Second moment (uncentered variance) of past gradients

This approach offers several advantages:

Requires minimal memory
Works well with large datasets and parameters
Appropriate for non-stationary objectives
Handles noisy gradients effectively
Suitable for problems with sparse gradients

Implement Adam Optimizer in PyTorch

PyTorch makes it incredibly simple to use Adam. Let’s start with the basic implementation:

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple model
model = nn.Sequential(
    nn.Linear(10, 5),
    nn.ReLU(),
    nn.Linear(5, 1)
)

# Create Adam optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define loss function (e.g., Mean Squared Error for regression)
loss_function = nn.MSELoss()

# Dummy input and target data
# Let's assume 32 samples (batch size), each with 10 features
input_data = torch.randn(32, 10)      # shape: [32, 10]
target = torch.randn(32, 1)           # shape: [32, 1]

# Training loop (simplified)
for epoch in range(100):
    # Forward pass
    output = model(input_data)
    loss = loss_function(output, target)

    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print loss every 10 epochs
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/100], Loss: {loss.item():.4f}")

You can refer to the screenshot below to see the output:

This basic implementation uses Adam with default parameters, which work well for many tasks.

Understand Adam’s Parameters

Adam has several parameters that can be adjusted for different scenarios:

optimizer = optim.Adam(
    model.parameters(),
    lr=0.001,           # Learning rate
    betas=(0.9, 0.999), # Coefficients for computing running averages of gradient and its square
    eps=1e-8,           # Term added to the denominator for numerical stability
    weight_decay=0,     # L2 penalty (regularization)
    amsgrad=False       # Whether to use the AMSGrad variant
)

Learning Rate (lr)

The learning rate determines the step size at each iteration. I typically start with:

0.001 for most problems
0.0001 for more complex models or sensitive problems

# For standard problems
optimizer = optim.Adam(model.parameters(), lr=0.001)

# For complex problems
optimizer = optim.Adam(model.parameters(), lr=0.0001)

Beta Parameters (betas)

The beta parameters control the exponential moving averages:

First beta: Controls the exponential decay rate for the first moment estimates
Second beta: Controls the exponential decay rate for the second moment estimates

The default values (0.9, 0.999) work well for most cases.

Check out PyTorch Model Eval

Epsilon (eps)

This small constant prevents division by zero. The default value of 1e-8 is suitable for most applications.

Weight Decay

Weight decay implements L2 regularization to prevent overfitting:

# Adding weight decay for regularization
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

I often use values between 1e-6 and 1e-4 depending on the dataset size and model complexity.

Real-World Example: Image Classification with CIFAR-10

Let’s implement a convolutional neural network for classifying CIFAR-10 images using Adam:

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

# Load and normalize CIFAR10
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

# Define a CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(32 * 8 * 8, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(-1, 32 * 8 * 8)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Initialize the model, loss function, and optimizer
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

# Training loop
for epoch in range(5):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data[0].to(device), data[1].to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 200 == 199:
            print(f'[{epoch + 1}, {i + 1}] loss: {running_loss / 200:.3f}')
            running_loss = 0.0

print('Finished Training')

You can refer to the screenshot below to see the output:

Compare Adam with Other Optimizers

In my experience, Adam often outperforms other optimizers for deep learning tasks. Let’s compare Adam with SGD on the same problem:

# Define models and criteria
model_adam = SimpleCNN().to(device)
model_sgd = SimpleCNN().to(device)

criterion = nn.CrossEntropyLoss()

# Define optimizers
optimizer_adam = optim.Adam(model_adam.parameters(), lr=0.001)
optimizer_sgd = optim.SGD(model_sgd.parameters(), lr=0.01, momentum=0.9)

# Training functions (simplified for demonstration)
def train_model(model, optimizer, epochs=5):
    losses = []
    for epoch in range(epochs):
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data[0].to(device), data[1].to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if i % 200 == 199:
                losses.append(running_loss / 200)
                running_loss = 0.0
    return losses

# Train both models
adam_losses = train_model(model_adam, optimizer_adam)
sgd_losses = train_model(model_sgd, optimizer_sgd)

# Now you can compare results
# Adam typically converges faster and achieves better results in fewer epochs

Check out PyTorch Dataloader

Learn Rate Scheduling with Adam

For complex tasks, you might want to adjust the learning rate during training:

# Define the model and optimizer
model = SimpleCNN().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define a learning rate scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.1, patience=5, verbose=True
)

# Training loop with learning rate scheduling
for epoch in range(20):
    train_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data[0].to(device), data[1].to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    # Calculate average loss for the epoch
    avg_train_loss = train_loss / len(trainloader)

    # Step the scheduler
    scheduler.step(avg_train_loss)

    print(f'Epoch {epoch+1}, Loss: {avg_train_loss:.4f}, LR: {optimizer.param_groups[0]["lr"]:.6f}')

Read PyTorch Binary Cross-Entropy

When to Use Adam Optimizer

Over my years of experience, I’ve found Adam performs exceptionally well in these scenarios:

Training deep neural networks with many parameters
Working with sparse data or sparse gradients
Dealing with noisy gradients
Training natural language processing models
Computer vision tasks
Generative models like GANs

For some problems, like reinforcement learning or highly sensitive systems, you might need to use a smaller learning rate or consider other optimizers.

In my experience, Adam is an excellent default choice for most deep learning applications. It requires less tuning than SGD and generally converges faster, making it ideal for both beginners and experts.

If you’re working on deep learning projects in PyTorch, I highly recommend starting with the Adam optimizer. Its adaptive learning rate properties make it forgiving of poor initialization and hyperparameter choices, while still delivering excellent performance across a wide range of tasks.

For even better results, remember that combining Adam with techniques like learning rate scheduling, proper weight initialization, and batch normalization can further improve your model’s performance and convergence speed.

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/

Adam Optimizer in PyTorch with Examples