Tensorflow Gradient Descent in Neural Network

As a TensorFlow beginner, you must understand TensorFlow gradient descent in Neural Networks.

In this TensorFlow tutorial, I will explain how the gradient descent algorithm works with a simple example. Then, slowly, I will build your concepts about gradient descent by explaining how it helps improve the prediction performance of the neural networks or machine learning models.

After that, you will learn how to create a gradient descent algorithm using Python, where you will learn about the model’s parameters and how they are adjusted.

Afterwards, I will show how to implement the gradient descent algorithm using the TensorFlow framework. Finally, you will use the optimiser to build a linear regression model and optimize the model parameters.

What is Gradient Descent in Neural Networks?

Gradient descent in Neural Networks is an algorithm that minimises the loss function. In simple words, when you train the neural network model, you use the loss function to measure how far off our predictions are from the actual target value.

That measure is called an error, computed using the loss functions, so in general, it is said to minimise the loss function, but in reality, that error is minimised.

Here, while training a neural network model, an error is computed using the different loss functions; this error indicates how well the model is learning; the more the error, the less model learning, and the less the error, the more the model learning.

This means that the goal is to minimise the error; the gradient descent algorithm is used to minimise that error.

For example, Suppose you are at the valley’s top. You need to get down or reach the landscape. For that, you need to take many small steps in the direction that seems the steepest downhill and hope to reach the bottom eventually.

That means you need to find the lowest point in the valley that can take you to the bottom of the valley.

In the same way, minimising the error or loss function means finding the lowest point, but here, calculus is used to compute the lowest point at the current position (or point);

These small steps are computed in the direction that decreases the loss function or the error.

I mean, the algorithm uses the calculus to compute the slope that minimises the error made by the model. But what is the primary purpose of gradient descent? The main purpose is to adjust the parameters of the neural network model, such as weights and bias, to minimise the loss function, which makes model prediction more accurate.

You feed data to the model and train it; you expect the model to make correct predictions or decisions with high accuracy. The loss function is an error, or you can call it a feedback mechanism, which helps refine the learning or make accurate predictions.

This tells how well your neural network model learns towards its designed task. It measures the difference between neural network model prediction and actual outcomes, quantifying errors or losses.

Why loss function? The error is used to improve the neural network model; the goal is to minimise this error. This is where gradient descent comes into play, an iterative improvement process, a technique or an algorithm.

Again, this algorithm computes the gradient (or direction) of the loss function slope at any given point and then moves step by step in the direction that lowers the loss.

This process repeats iteratively, adjusting the model’s parameters, such as weights and bias, to minimise the loss function.

The loss function and gradient descent can systematically improve the model’s predictions. Each iteration or repetition improves the model by adjusting parameters based on the loss function error to enhance the model’s accuracy.

READ:  Python turtle input with examples

Building Gradient Descent Algorithm using Python

You must understand the difference between gradient (slope) and steps here. I will show you basic examples of gradients and steps to show their concepts here. Knowing this sets a solid foundation before moving into a more complex example with TensorFlow.

You know that calculus is used to compute the gradient, so here I will show how to find the derivative of a quadratic function in Python.

Create a simple quadratic equation function f(x) = x2 with a global minimum at x=0. Here, I will show how to find the minimum value using gradient descent.

So, if you are familiar with differentiation in mathematics, you need to find the derivative of f(x) = x2; if you compute the derivation of that quadratic function, you get f'(x) = 2x, which gives us the slope at any x.

Let’s create a function and its derivative in Python.

def f(x):
    return x**2

def df(x):
    return 2 * x

As you know, the gradient descent algorithm is used iteratively, so we will start with an initial guess for x and iteratively apply the gradient descent updates rule:

xnew = xold – a. f'(xold), here a is learning rate.

Define the gradient descent function as shown below

def gradient_descent(start_x, learning_rate, n_iterations):
  x = start_x
  for i in range(n_iterations):
    grad = df(x)
    x = x - learning_rate * grad
    print(f"Iteration {i+1}: x = {x}, f(x) = {f(x)}")
  return x

Initialise the parameters as illustrated below.

start_x = 2
learning_rate = 0.1
n_iterations = 10

Use the gradient descent as shown below.

x_final = gradient_descent(start_x, learning_rate, n_iterations)
print(f"Final x: {x_final}")
Building Gradient Descent Algorithm using Python

The above code will run ten times and adjust x towards the minimum of the function f(x) = x2. The minimum value of x is 0.2147, but what is the relation between the above code and the gradient descent we discussed?

You have just created a gradient descent algorithm that finds the minimum value of the function f(x) = x2. This is how the gradient descent algorithm minimises the error or loss. Here, minimising means finding the minimum value.

Let’s understand more deeply how the gradient descent algorithm works. The function gradient_descent(start_x, learning_rate, n_iterations) accepts three input parameters.

  • start_x: It represents the initial guess for the value of x. This is where the algorithm begins searching for the minimum value.
  • learning_rate: This controls how big the step must be taken on each iteration. If smaller learning rates are specified, it takes smaller steps, leading to more accurate results, but might take longer to converge. A larger learning rate increases the convergence but risks overshooting the minimum.
  • n_iterations: This is a way to specify the number of times the algorithm will update the value of x after finding the minimum in each iteration.

Within the gradent_descent() function, first initialising the initial guess value, which is the starting point in the search for the minimum of the function using x = start_x.

Then, the loop is initialised, which runs 10 times. In each iteration, two actions are performed:

Firstly, it computes the f(x) derivative by calling the df(x) function at the current value of x, which tells the function’s slope at that point. In the context of minimising error, this slope represents the direction in which the function increases most steeply. Knowing this, we can move in the opposite direction to find the minimum.

Secondly, It adjusts the current value of x by moving it in the direction that decreased f(x) using x=x-learning_rate * grad, which reduces the error. The step size is determined by the learning_rate and the magnitude of the gradient (grad). Moving against the gradient is the essence of descending towards the minimum.

After completing all the iterations, the function returns the final value of x (0.214), the algorithm’s best guess for the minimum.

Let me show you how computation is happening again, so you have quadratic function f(x) = x2; when you differentiate this value, you get 2*x.

So, in the algorithm, we initialize the x with a random value like x=2.

Then, within the loop, the statement is grad=df(x), a derivation or differentiation of the f(x) = x2; consider this statement representing 2*x.

READ:  Django Upload Image File

The next statement, x = x – learning_rate * grad, uses the gradient and the learning rate and subtracts it from the value of x. So this is where it took a step in the opposite direction of the computed gradient to reduce the error; here is how the error is minimized.

The error can’t be minimized in one iteration; to find the minimum value of the function, we need to repeat the same steps.

Now, let’s see the value of x is 2, the derivative or gradient of f(x) = x2 is 2*x, and the value of the learning rate is 0.1, and the loop should run 10 times as specified in n_iterations.

Now, look at how function appears.

def gradient_descent(start_x, learning_rate, n_iterations):
  x = 2
  for i in range(10):
    grad = 2*x
    x = 2 - 0.1 * grad
    print(f"Iteration {i+1}: x = {x}, f(x) = {f(x)}")
  return x

Let’s run the loop the first time and track the value of variable x, which represents the minimum value.

  • First Iteration: The current value of x is 2; after iteration x= 2-0.1 * 2 * 2, it becomes x=1.6, so the new value of x is 1.6.
  • Second Iteration: The current value of x is 1.6: after iteration x = 1.6-0.1 * 2 *1.6, it becomes x = 1.28, so the new value x is 1.28; understand the pattern of how it updates or adjusts the value of x.
  • Third Iteration: The current value of x is 1.28: after iteration x = 1.28-0.1 * 2 * 1.28, it becomes x = 1.024, so the new value of x is 1.024.
  • Fourth Iteration: The current value of x is 1.024: after iteration x = 1.024-0.1 * 2 * 1.024, it becomes x = 0.8192, so the new value of x is 0.8192.

Perform the same steps 10 times; you get a value of x = 0.21478364. which is the minimum value or error is reduced.

This is how the gradient descent algorithm minimizes the loss function or the error made by the model.

But remember, a more complex model or neural network can use complex functions, unlike quadratic; this is just an example of how gradient descent works.

One thing to note is that your loss function is f(x) = x2; here, we have minimized this function by adjusting the value of x, but how we adjusted the value is that we used the gradient descent algorithm.

Now you are prepared, let’s show how to use the gradient descent in Tensorflow.

Implementing Tensorflow Gradient Descent in Neural Network

First, ensure you have installed TensorFlow on your system; if you haven’t, follow this tutorial How to Install TensorFlow?.

Here, I will show you how to optimize variables to minimize the same quadratic function f(x) = x2

First, import the tensorflow library.

import tensorflow as tf

Define variable x and initialize to 5.

x = tf.Variable(5.0)

Define the loss function we want to minimise, so here, the loss function is f(x) = x2.

loss = lambda: x**2

Define the optimizer which uses the gradient descent internally, as shown below.

optimizer = tf.optimizers.SGD(learning_rate=0.1)

Here in the above code, optimizers are a kind of mechanism which minimizes the loss function; it uses the variants of the gradient descent algorithm, which is called SGD (Stochastic Gradient Descent).

Also, learning_rate (0.1) is passed to SGD, as you know how important the learning rate is.

Remember, there are variants of gradient descent algorithms other than SGD, but the foundation of each is what you learned in the section Building Gradient Descent Algorithm using Python.

If you are confused about optimizers or what they are? I recommend reading this tutorial, How to Compile Neural Network in Tensorflow.

After defining the optimizer, let’s optimise, as shown in the code below.

n_iterations = 10

for i in range(n_iterations):

      optimizer.minimize(loss, var_list=[x])

      print(f"Iteration {i+1}: x = {x.numpy()}, f(x) = {loss().numpy()}")
Implementing Tensorflow Gradient Descent in Neural Network

From the output, the minimum value of x is 0.536, and the function f(x) value is 0.2882329. So, 0.536 is the optimal value, which minimizes the loss function value to 0.28823.

Look at each iteration; it indicates that the value of f(x) decreases at the next iteration.

In the above code, within the loop, the statement optimizer.minimize(loss, var_list=[x]); the optimizer object is the instance of SGD. Remember, optimizers are responsible for applying the optimization algorithm to adjust the model parameter.

READ:  Module 'tensorflow' has no attribute 'optimizers'

Then .minimize(loss, var_list=[x]), this method instructs the optimizer to minimize the given loss function, which is x2 concerning the variables listed in var_list. Simply put, this instructs the optimizer to adjust the variables (parameters) list in the var_list to reduce the value of the loss function.

Remember, when any machine learning or AI model is trained, training means finding the best or adjusting parameter value; the common parameters are weight and bias because weights and biases decide how accurate a model can be.

Here, only the parameter x in var_list is adjusted, but it can be any trainable parameter of the model you train.

In tensorflow, within var_list in the minimize() function, you specify all the parameters you want to adjust or update during training to minimize the loss function.

So, any neural network or machine learning model is trained based on several parameters. To improve the model’s prediction performance, optimizer algorithms are used to adjust or update parameters to minimize the loss function.

Create a linear regression model and train it using a gradient descent optimizer

Import the required library as shown below.

import tensorflow as tf
import numpy as np

Define the input layer using the code below.

X = tf.keras.Input(shape=(1,), name='X')

The input layer is defined with the shape of input data.

Define the linear regression mode as shown below.

y_predicted = tf.keras.layers.Dense(1, name='y_predicted')(X)

model = tf.keras.Model(X, y_predicted)

First, the dense layer shows the relationship between input and output. The complete model is created by combining the input and dense layers.

Compile the model using the code below.

model.compile(optimizer='sgd', loss='mean_squared_error')

Compiling the model means preparing the model for training; in this step, our optimizer and loss function are specified.

When building the complete neural network or machine learning model, this is how the optimizer or loss function is specified in TensorFlow. The optimizer is ‘sgd’, and the loss function is mean_square_error.

So, the optimizer will optimize the loss function mean_square_error using the Stochastic Gradient Descent algorithm. So, the parameter’s weight and bias are adjusted or updated.

Train or fit the model.

model.fit(X_train, y_train, epochs=1000)

Here, the fit() method is called on the model to train the model on X_train and y_train data; the model is trained for 1000 epochs.

The final value of parameters weight and bias is obtained from the model, which is adjusted or updated by the gradient descent algorithm.

W_final, b_final = model.layers[1].get_weights()

Let’s print the final values.

print("Weight (W):", W_final)
print("Bias (b):", b_final)
Create a linear regression model and train it using a gradient descent optimizer

When you run the above code, you get the linear regression model’s final weight and bias value, as shown below.

Getting the value of model parameters

The final values of the parameters Weight and Bias are 1.9885721 and 1.0247945, respectively. This is all the basic things about the gradient descent algorithm in Tensorflow.

Well, I hope that you understand how gradient descent algorithms work.

Conclusion

This tutorial is about the TensorFlow gradient descent algorithm in neural networks, where you learn how to use the gradient descent algorithm and its purposes. You learned how important this algorithm is and how it helps the model to make accurate predictions.

Especially you learn how to minimise the model’s loss function by adjusting the model parameters. You also learned how to build a gradient descent algorithm using Python.

Then, You learned how to implement the gradient descent algorithm using the TensorFlow and create the linear regression model. You used the gradient descent algorithm called SGD to minimize the mean squared error loss function.

You may like to read: