TensorFlow Gradient Descent In Neural Network

When I was building a neural network to predict housing prices in California, I ran into a common issue: my model wasn’t learning effectively. The culprit? I hadn’t properly configured my gradient descent optimizer.

Gradient descent is the backbone of neural network training, yet many developers struggle to implement it correctly in TensorFlow. In this tutorial, I will share practical approaches to gradient descent that have helped me build more accurate models.

Whether you’re new to neural networks or looking to optimize your existing TensorFlow models, this guide will walk you through everything you need to know about gradient descent.

This Tutorial Covers:

Gradient Descent in Neural Networks

Gradient descent in Neural Networks is an optimization algorithm that minimizes the loss function by iteratively adjusting the model’s parameters in the direction of the steepest descent.

Think of it like hiking down a mountain in foggy conditions. You can’t see the entire path, but you can feel which way is steepest downward from your current position. You take a step in that direction, reassess, and repeat until you reach the bottom (or minimum).

In neural networks, gradient descent works by:

Calculating the gradient (slope) of the loss function concerning each weight
Updating the weights in the opposite direction of the gradient
Repeating this process until the loss function is minimized

Read Tensorflow Activation Functions

Types of Gradient Descent in TensorFlow

When implementing gradient descent in TensorFlow, I’ve found these three variants to be the most useful:

1. Batch Gradient Descent

In batch gradient descent, we use the entire dataset to compute the gradient for each step.

import tensorflow as tf

# Create a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu', input_shape=(4,)),
    tf.keras.layers.Dense(1)
])

# Configure with batch gradient descent
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mse')

# Train the model with full batch (entire dataset)
model.fit(X_train, y_train, epochs=10, batch_size=len(X_train))

This method provides the most accurate gradient estimation but can be computationally expensive for large datasets.

2. Stochastic Gradient Descent (SGD)

With SGD, we update the weights using just one training example at a time.

# Configure with stochastic gradient descent
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mse')

# Train with batch_size=1 for true SGD
model.fit(X_train, y_train, epochs=10, batch_size=1)

SGD introduces more noise in the gradient updates, which can help escape local minima, but convergence can be slower and less stable.

3. Mini-Batch Gradient Descent

This is the sweet spot I typically use in practice. We update weights using a small batch of examples (usually 32, 64, or 128).

# Configure with mini-batch gradient descent
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mse')

# Train with mini-batches
model.fit(X_train, y_train, epochs=10, batch_size=32)

Mini-batch gradient descent combines the benefits of both previous methods: faster computation than batch gradient descent and more stable convergence than SGD.

Check out Use TensorFlow’s get_shape Function

Learning Rate in Gradient Descent

The learning rate is the most critical hyperparameter to configure properly. It determines how large a step we take in the direction of the negative gradient.

# Too small: slow convergence
slow_optimizer = tf.keras.optimizers.SGD(learning_rate=0.0001)

# Too large: may overshoot or diverge
fast_optimizer = tf.keras.optimizers.SGD(learning_rate=1.0)

# "Just right" - depends on your specific problem
good_optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

I’ve found that starting with 0.01 and adjusting based on training performance works well for many problems.

Advanced Gradient Descent Techniques in TensorFlow

Over my years of experience, I’ve found these advanced techniques to significantly improve model training:

Read Iterate Over Tensor In TensorFlow

1. Learning Rate Scheduling

Reducing the learning rate over time helps fine-tune the model as it approaches the minimum.

# Exponential decay learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.1,
    decay_steps=10000,
    decay_rate=0.9)

optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)

2. Momentum

Momentum helps accelerate gradient descent by adding a fraction of the previous update vector to the current update.

# SGD with momentum
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(optimizer=optimizer, loss='mse')

This helps the optimizer push through plateaus and shallow local minima.

Check out Convert Tensor to Numpy in TensorFlow

3. Adam Optimizer

In practice, I often skip directly to Adam, which combines momentum and adaptive learning rates.

# Adam combines momentum with adaptive learning rates
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='mse')

Adam is often my go-to optimizer for most deep learning problems because it requires less tuning than SGD.

Custom Gradient Descent Implementation

Sometimes I need more control over the optimization process. TensorFlow’s GradientTape allows for custom gradient calculation and application:

# Custom gradient descent loop
def train_step(model, inputs, targets, optimizer):
    with tf.GradientTape() as tape:
        predictions = model(inputs, training=True)
        loss = loss_function(targets, predictions)

    # Calculate gradients
    gradients = tape.gradient(loss, model.trainable_variables)

    # Apply gradients
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    return loss

# Use in a training loop
for epoch in range(epochs):
    for step, (x_batch, y_batch) in enumerate(dataset):
        loss = train_step(model, x_batch, y_batch, optimizer)
        if step % 100 == 0:
            print(f"Epoch {epoch}, Step {step}, Loss: {loss}")

You can see the output in the screenshot below.

This approach gives you complete control over how and when gradients are calculated and applied.

Read TensorFlow One_Hot Encoding

Practical Example: Stock Price Prediction

Let’s tie everything together with a practical example. We’ll create a simple LSTM model to predict stock prices for a major US tech company:

import tensorflow as tf
import numpy as np
import pandas as pd
import yfinance as yf

# Download historical data for AAPL
data = yf.download('AAPL', start='2018-01-01', end='2023-01-01')
prices = data['Close'].values.reshape(-1, 1)

# Normalize the data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
prices_scaled = scaler.fit_transform(prices)

# Create sequences
def create_sequences(data, seq_length):
    xs, ys = [], []
    for i in range(len(data) - seq_length):
        x = data[i:i+seq_length]
        y = data[i+seq_length]
        xs.append(x)
        ys.append(y)
    return np.array(xs), np.array(ys)

# Create training data with 60-day sequences
seq_length = 60
X, y = create_sequences(prices_scaled, seq_length)

# Split into training and testing sets
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Build LSTM model
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(50, activation='relu', input_shape=(seq_length, 1)),
    tf.keras.layers.Dense(1)
])

# Use Adam optimizer - a more advanced form of gradient descent
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='mse')

# Train the model
history = model.fit(
    X_train, y_train,
    epochs=20,
    batch_size=32,
    validation_data=(X_test, y_test),
    verbose=1
)

# Make predictions
predictions = model.predict(X_test)
predictions = scaler.inverse_transform(predictions)

You can see the output in the screenshot below.

In this example, I’m using the Adam optimizer, which implements an advanced form of gradient descent. For financial time series like stock prices, I’ve found that mini-batch sizes between 32 and 64 with Adam optimizer provide a good balance of training speed and accuracy.

Troubleshoot Gradient Descent Issues

After working with TensorFlow for years, I’ve encountered several common gradient descent issues:

Vanish Gradients

When gradients become too small, learning essentially stops. Solutions include:

# Use ReLU activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

# Or use batch normalization
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Dense(1)
])

Explode Gradients

When gradients become too large, try gradient clipping:

optimizer = tf.keras.optimizers.SGD(clipnorm=1.0)  # Clip norm of gradients
# Or
optimizer = tf.keras.optimizers.SGD(clipvalue=0.5)  # Clip values of gradients

Get Stuck in Local Minima

Use momentum or Adam optimizer, as shown earlier, or try different initializations:

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', 
                         kernel_initializer='he_normal'),
    tf.keras.layers.Dense(1)
])

Check out Training a Neural Network in TensorFlow

When to Use Which Gradient Descent Variant

From my experience, here’s when to use each type:

Batch Gradient Descent: For small datasets where computational efficiency isn’t a concern
Stochastic Gradient Descent: When memory is limited or when dealing with very large datasets with redundancy
Mini-Batch Gradient Descent: For most practical applications of deep learning (this is what you’ll use 90% of the time)
Adam and other adaptive methods: When you want faster convergence and don’t want to tune the learning rate extensively

I hope you found this guide to TensorFlow gradient descent helpful. Understanding gradient descent is crucial for building effective neural networks, and implementing it correctly in TensorFlow can significantly improve your model’s performance.

If you have any questions or suggestions, please leave them in the comments below. Happy coding!

Other TensorFlow articles you may also like:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/

TensorFlow Gradient Descent in Neural Network