When I was building a neural network to predict housing prices in California, I ran into a common issue: my model wasn’t learning effectively. The culprit? I hadn’t properly configured my gradient descent optimizer.
Gradient descent is the backbone of neural network training, yet many developers struggle to implement it correctly in TensorFlow. In this tutorial, I will share practical approaches to gradient descent that have helped me build more accurate models.
Whether you’re new to neural networks or looking to optimize your existing TensorFlow models, this guide will walk you through everything you need to know about gradient descent.
Gradient Descent in Neural Networks
Gradient descent in Neural Networks is an optimization algorithm that minimizes the loss function by iteratively adjusting the model’s parameters in the direction of the steepest descent.
Think of it like hiking down a mountain in foggy conditions. You can’t see the entire path, but you can feel which way is steepest downward from your current position. You take a step in that direction, reassess, and repeat until you reach the bottom (or minimum).
In neural networks, gradient descent works by:
- Calculating the gradient (slope) of the loss function concerning each weight
- Updating the weights in the opposite direction of the gradient
- Repeating this process until the loss function is minimized
Read Tensorflow Activation Functions
Types of Gradient Descent in TensorFlow
When implementing gradient descent in TensorFlow, I’ve found these three variants to be the most useful:
1. Batch Gradient Descent
In batch gradient descent, we use the entire dataset to compute the gradient for each step.
import tensorflow as tf
# Create a simple model
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu', input_shape=(4,)),
tf.keras.layers.Dense(1)
])
# Configure with batch gradient descent
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mse')
# Train the model with full batch (entire dataset)
model.fit(X_train, y_train, epochs=10, batch_size=len(X_train))This method provides the most accurate gradient estimation but can be computationally expensive for large datasets.
2. Stochastic Gradient Descent (SGD)
With SGD, we update the weights using just one training example at a time.
# Configure with stochastic gradient descent
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mse')
# Train with batch_size=1 for true SGD
model.fit(X_train, y_train, epochs=10, batch_size=1)SGD introduces more noise in the gradient updates, which can help escape local minima, but convergence can be slower and less stable.
3. Mini-Batch Gradient Descent
This is the sweet spot I typically use in practice. We update weights using a small batch of examples (usually 32, 64, or 128).
# Configure with mini-batch gradient descent
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mse')
# Train with mini-batches
model.fit(X_train, y_train, epochs=10, batch_size=32)Mini-batch gradient descent combines the benefits of both previous methods: faster computation than batch gradient descent and more stable convergence than SGD.
Check out Use TensorFlow’s get_shape Function
Learning Rate in Gradient Descent
The learning rate is the most critical hyperparameter to configure properly. It determines how large a step we take in the direction of the negative gradient.
# Too small: slow convergence
slow_optimizer = tf.keras.optimizers.SGD(learning_rate=0.0001)
# Too large: may overshoot or diverge
fast_optimizer = tf.keras.optimizers.SGD(learning_rate=1.0)
# "Just right" - depends on your specific problem
good_optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)I’ve found that starting with 0.01 and adjusting based on training performance works well for many problems.
Advanced Gradient Descent Techniques in TensorFlow
Over my years of experience, I’ve found these advanced techniques to significantly improve model training:
Read Iterate Over Tensor In TensorFlow
1. Learning Rate Scheduling
Reducing the learning rate over time helps fine-tune the model as it approaches the minimum.
# Exponential decay learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.1,
decay_steps=10000,
decay_rate=0.9)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)2. Momentum
Momentum helps accelerate gradient descent by adding a fraction of the previous update vector to the current update.
# SGD with momentum
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(optimizer=optimizer, loss='mse')This helps the optimizer push through plateaus and shallow local minima.
Check out Convert Tensor to Numpy in TensorFlow
3. Adam Optimizer
In practice, I often skip directly to Adam, which combines momentum and adaptive learning rates.
# Adam combines momentum with adaptive learning rates
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='mse')Adam is often my go-to optimizer for most deep learning problems because it requires less tuning than SGD.
Custom Gradient Descent Implementation
Sometimes I need more control over the optimization process. TensorFlow’s GradientTape allows for custom gradient calculation and application:
# Custom gradient descent loop
def train_step(model, inputs, targets, optimizer):
with tf.GradientTape() as tape:
predictions = model(inputs, training=True)
loss = loss_function(targets, predictions)
# Calculate gradients
gradients = tape.gradient(loss, model.trainable_variables)
# Apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
# Use in a training loop
for epoch in range(epochs):
for step, (x_batch, y_batch) in enumerate(dataset):
loss = train_step(model, x_batch, y_batch, optimizer)
if step % 100 == 0:
print(f"Epoch {epoch}, Step {step}, Loss: {loss}")You can see the output in the screenshot below.

This approach gives you complete control over how and when gradients are calculated and applied.
Read TensorFlow One_Hot Encoding
Practical Example: Stock Price Prediction
Let’s tie everything together with a practical example. We’ll create a simple LSTM model to predict stock prices for a major US tech company:
import tensorflow as tf
import numpy as np
import pandas as pd
import yfinance as yf
# Download historical data for AAPL
data = yf.download('AAPL', start='2018-01-01', end='2023-01-01')
prices = data['Close'].values.reshape(-1, 1)
# Normalize the data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
prices_scaled = scaler.fit_transform(prices)
# Create sequences
def create_sequences(data, seq_length):
xs, ys = [], []
for i in range(len(data) - seq_length):
x = data[i:i+seq_length]
y = data[i+seq_length]
xs.append(x)
ys.append(y)
return np.array(xs), np.array(ys)
# Create training data with 60-day sequences
seq_length = 60
X, y = create_sequences(prices_scaled, seq_length)
# Split into training and testing sets
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Build LSTM model
model = tf.keras.Sequential([
tf.keras.layers.LSTM(50, activation='relu', input_shape=(seq_length, 1)),
tf.keras.layers.Dense(1)
])
# Use Adam optimizer - a more advanced form of gradient descent
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='mse')
# Train the model
history = model.fit(
X_train, y_train,
epochs=20,
batch_size=32,
validation_data=(X_test, y_test),
verbose=1
)
# Make predictions
predictions = model.predict(X_test)
predictions = scaler.inverse_transform(predictions)You can see the output in the screenshot below.

In this example, I’m using the Adam optimizer, which implements an advanced form of gradient descent. For financial time series like stock prices, I’ve found that mini-batch sizes between 32 and 64 with Adam optimizer provide a good balance of training speed and accuracy.
Troubleshoot Gradient Descent Issues
After working with TensorFlow for years, I’ve encountered several common gradient descent issues:
Vanish Gradients
When gradients become too small, learning essentially stops. Solutions include:
# Use ReLU activation
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
# Or use batch normalization
model = tf.keras.Sequential([
tf.keras.layers.Dense(128),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Dense(1)
])Explode Gradients
When gradients become too large, try gradient clipping:
optimizer = tf.keras.optimizers.SGD(clipnorm=1.0) # Clip norm of gradients
# Or
optimizer = tf.keras.optimizers.SGD(clipvalue=0.5) # Clip values of gradientsGet Stuck in Local Minima
Use momentum or Adam optimizer, as shown earlier, or try different initializations:
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu',
kernel_initializer='he_normal'),
tf.keras.layers.Dense(1)
])Check out Training a Neural Network in TensorFlow
When to Use Which Gradient Descent Variant
From my experience, here’s when to use each type:
- Batch Gradient Descent: For small datasets where computational efficiency isn’t a concern
- Stochastic Gradient Descent: When memory is limited or when dealing with very large datasets with redundancy
- Mini-Batch Gradient Descent: For most practical applications of deep learning (this is what you’ll use 90% of the time)
- Adam and other adaptive methods: When you want faster convergence and don’t want to tune the learning rate extensively
I hope you found this guide to TensorFlow gradient descent helpful. Understanding gradient descent is crucial for building effective neural networks, and implementing it correctly in TensorFlow can significantly improve your model’s performance.
If you have any questions or suggestions, please leave them in the comments below. Happy coding!
Other TensorFlow articles you may also like:
- Tensor in TensorFlow
- Compile Neural Network in Tensorflow
- Build an Artificial Neural Network in Tensorflow

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.