Throughout my more than ten years as a Python developer, I have worked with various optimization algorithms for deep learning models. Among these, Adam has consistently been one of my preferred choices due to its efficiency and reliability.
Adam (short for Adaptive Moment Estimation) combines the best aspects of other popular optimizers like AdaGrad and RMSProp. It’s particularly well-suited for problems with noisy or sparse gradients.
In this tutorial, I will show you how to implement Adam optimizer in PyTorch with practical examples. You’ll learn when to use it, how to configure its parameters, and see real-world applications.
Adam Optimizer
Adam is an adaptive learning rate optimization algorithm designed specifically for training deep neural networks. It was introduced in 2014 by Diederik Kingma and Jimmy Ba in their paper “Adam: A Method for Stochastic Optimization.”
The algorithm calculates adaptive learning rates for each parameter by storing both:
- First moment (mean) of past gradients
- Second moment (uncentered variance) of past gradients
This approach offers several advantages:
- Requires minimal memory
- Works well with large datasets and parameters
- Appropriate for non-stationary objectives
- Handles noisy gradients effectively
- Suitable for problems with sparse gradients
Implement Adam Optimizer in PyTorch
PyTorch makes it incredibly simple to use Adam. Let’s start with the basic implementation:
import torch
import torch.nn as nn
import torch.optim as optim
# Define a simple model
model = nn.Sequential(
nn.Linear(10, 5),
nn.ReLU(),
nn.Linear(5, 1)
)
# Create Adam optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Define loss function (e.g., Mean Squared Error for regression)
loss_function = nn.MSELoss()
# Dummy input and target data
# Let's assume 32 samples (batch size), each with 10 features
input_data = torch.randn(32, 10) # shape: [32, 10]
target = torch.randn(32, 1) # shape: [32, 1]
# Training loop (simplified)
for epoch in range(100):
# Forward pass
output = model(input_data)
loss = loss_function(output, target)
# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Print loss every 10 epochs
if (epoch + 1) % 10 == 0:
print(f"Epoch [{epoch+1}/100], Loss: {loss.item():.4f}")You can refer to the screenshot below to see the output:

This basic implementation uses Adam with default parameters, which work well for many tasks.
Understand Adam’s Parameters
Adam has several parameters that can be adjusted for different scenarios:
optimizer = optim.Adam(
model.parameters(),
lr=0.001, # Learning rate
betas=(0.9, 0.999), # Coefficients for computing running averages of gradient and its square
eps=1e-8, # Term added to the denominator for numerical stability
weight_decay=0, # L2 penalty (regularization)
amsgrad=False # Whether to use the AMSGrad variant
)Learning Rate (lr)
The learning rate determines the step size at each iteration. I typically start with:
0.001for most problems0.0001for more complex models or sensitive problems
# For standard problems
optimizer = optim.Adam(model.parameters(), lr=0.001)
# For complex problems
optimizer = optim.Adam(model.parameters(), lr=0.0001)Beta Parameters (betas)
The beta parameters control the exponential moving averages:
- First beta: Controls the exponential decay rate for the first moment estimates
- Second beta: Controls the exponential decay rate for the second moment estimates
The default values (0.9, 0.999) work well for most cases.
Check out PyTorch Model Eval
Epsilon (eps)
This small constant prevents division by zero. The default value of 1e-8 is suitable for most applications.
Weight Decay
Weight decay implements L2 regularization to prevent overfitting:
# Adding weight decay for regularization
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)I often use values between 1e-6 and 1e-4 depending on the dataset size and model complexity.
Real-World Example: Image Classification with CIFAR-10
Let’s implement a convolutional neural network for classifying CIFAR-10 images using Adam:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
# Load and normalize CIFAR10
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
# Define a CNN model
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(32 * 8 * 8, 128)
self.fc2 = nn.Linear(128, 10)
self.relu = nn.ReLU()
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = self.pool(self.relu(self.conv2(x)))
x = x.view(-1, 32 * 8 * 8)
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
# Initialize the model, loss function, and optimizer
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
# Training loop
for epoch in range(5):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
# Zero the parameter gradients
optimizer.zero_grad()
# Forward + backward + optimize
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 200 == 199:
print(f'[{epoch + 1}, {i + 1}] loss: {running_loss / 200:.3f}')
running_loss = 0.0
print('Finished Training')You can refer to the screenshot below to see the output:

Compare Adam with Other Optimizers
In my experience, Adam often outperforms other optimizers for deep learning tasks. Let’s compare Adam with SGD on the same problem:
# Define models and criteria
model_adam = SimpleCNN().to(device)
model_sgd = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
# Define optimizers
optimizer_adam = optim.Adam(model_adam.parameters(), lr=0.001)
optimizer_sgd = optim.SGD(model_sgd.parameters(), lr=0.01, momentum=0.9)
# Training functions (simplified for demonstration)
def train_model(model, optimizer, epochs=5):
losses = []
for epoch in range(epochs):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 200 == 199:
losses.append(running_loss / 200)
running_loss = 0.0
return losses
# Train both models
adam_losses = train_model(model_adam, optimizer_adam)
sgd_losses = train_model(model_sgd, optimizer_sgd)
# Now you can compare results
# Adam typically converges faster and achieves better results in fewer epochsCheck out PyTorch Dataloader
Learn Rate Scheduling with Adam
For complex tasks, you might want to adjust the learning rate during training:
# Define the model and optimizer
model = SimpleCNN().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Define a learning rate scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.1, patience=5, verbose=True
)
# Training loop with learning rate scheduling
for epoch in range(20):
train_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Calculate average loss for the epoch
avg_train_loss = train_loss / len(trainloader)
# Step the scheduler
scheduler.step(avg_train_loss)
print(f'Epoch {epoch+1}, Loss: {avg_train_loss:.4f}, LR: {optimizer.param_groups[0]["lr"]:.6f}')Read PyTorch Binary Cross-Entropy
When to Use Adam Optimizer
Over my years of experience, I’ve found Adam performs exceptionally well in these scenarios:
- Training deep neural networks with many parameters
- Working with sparse data or sparse gradients
- Dealing with noisy gradients
- Training natural language processing models
- Computer vision tasks
- Generative models like GANs
For some problems, like reinforcement learning or highly sensitive systems, you might need to use a smaller learning rate or consider other optimizers.
In my experience, Adam is an excellent default choice for most deep learning applications. It requires less tuning than SGD and generally converges faster, making it ideal for both beginners and experts.
If you’re working on deep learning projects in PyTorch, I highly recommend starting with the Adam optimizer. Its adaptive learning rate properties make it forgiving of poor initialization and hyperparameter choices, while still delivering excellent performance across a wide range of tasks.
For even better results, remember that combining Adam with techniques like learning rate scheduling, proper weight initialization, and batch normalization can further improve your model’s performance and convergence speed.

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.