TensorFlow Learning Rate Scheduler

This Python tutorial will focus on using the learning rate schedules for machine learning models with TensorFlow. Also, we will look at some examples of how we can learn learning rate schedules in TensorFlow. And we will cover these topics.

  • TensorFlow Learning rate Scheduler
  • TensorFlow learning rate scheduler adam
  • TensorFlow learning rate scheduler cosine
  • TensorFlow get learning rate
  • TensorFlow adaptive learning rate

TensorFlow Learning Rate Scheduler

  • In the Keras API, one of the callbacks is called LearningRateScheduler (Tensorflow). Callbacks are those services that, based on each individual callback, are called at specific points during the training.
  • These callbacks are invoked every time we train our neural networks to complete their respective duties. In our example, the LearningRateScheduler callback receives the updated learning rate value from the schedule function that we defined beforehand before training, together with the current epoch and current learning rate, and applies the revised learning rate to the optimizer.
  • The goal of learning rate schedules is to reduce learning rates by a predefined schedule throughout training. The learning rate schedules consist of four types
    • Constant learning rate: The default learning rate schedule for the SGD optimizer in Keras is a constant learning rate. The default setting for momentum and decay rate is zero. Choosing the proper learning rate is difficult. Lr=0.1 This can be used as a starting point as we test various learning rate strategies.
    • Time-based decay: The formula of time-based decay is lr = lr0/(1+kt) where in this case lr and k are the hyperparameters and t is the iteration number. The learning rate is unaffected by this when the decay is zero. The learning rate from the previous epoch will be reduced by the supplied fixed amount when the decay is set.
    • Step decay: It is a popular learning rate schedule and it specifies times during training Where epoch is the current epoch number, the drop rate is the amount that the learning rate is modified each time if it is changed, initial lr is the initial learning rate, such as 0.01, initial lr is the initial learning rate, and epochs drop is how frequently to change the learning rate, such as 10 epochs.
    • Exponential decay: It has a mathematical formula lr = lr0 * e^(−kt) where in this case lr and k ar hyperparameters and t is the iteration number. We can easily execute this by specifying the exponential decay function and passing it to the learning rate scheduler.

Example:

import tensorflow as tf 

 
mnist = tf.keras.datasets.mnist 

(X_train, y_train), (X_test, y_test) = mnist.load_data() 

X_train, X_test = X_train / 255.0, X_test / 255.0 

 
 
model = tf.keras.models.Sequential([ 

   tf.keras.layers.Flatten(input_shape=(28, 28)), 

   tf.keras.layers.Dense(512, activation='relu'), 

   tf.keras.layers.Dropout(0.2), 

   tf.keras.layers.Dense(10, activation='softmax')]) 

 
 
sgd = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer='sgd',  

   loss='sparse_categorical_crossentropy', 

   metrics=['accuracy'])
model.fit(X_train, y_train, 

          epochs=10, 

          validation_split=0.2)
model.summary()

In this example, I am going to use the mnist dataset under tf.keras.datasets and then I load the train and test data as (x_train & y_train). Since the input features are between 0 to 255. I will just normalize it by dividing it by 255.

After that, I will create a new sequential model with a single drop-out layer as model = tf.keras.models.sequential so in the first layer I have created a flattened layer that will take the input images of shape (28,28). In the second layer, I have created a dense layer with 512 neurons and the activation function as relu and it is dropped out by the layer with the drop_out rate =0.2, and the final output layer has created a dense layer with 10 neurons with the SoftMax activation function.

Now we will display the summary by using the model.summary().

You can refer to the below Screenshot.

TensorFlow Learning rate Scheduler
TensorFlow Learning rate Scheduler

This is how we can use the learning rate in optimizer SGD by using TensorFlow.

Read: Batch Normalization TensorFlow

TensorFlow learning rate scheduler adam

  • In this example, we will use the ‘adam’ optimizer while compiling the model.
  • Adam is a different optimization algorithm that can be used to train deep learning models instead of stochastic gradient descent.
  • Adam creates an optimization technique that can handle sparse gradients in noisy situations by combining the best features of the AdaGrad and RMSProp algorithms.
  • Adam is rather simple to configure, and the default configuration settings work well for the majority of issues.
  • From adaptive moment estimation comes the name Adam. In order to adjust network weights during training, this optimization approach is a further extension of stochastic gradient descent.
  • Adam optimizer modifies the learning rate for each network weight separately, unlike SGD training, which maintains a single learning rate. The designers of the Adam optimization algorithm are aware of the advantages of the AdaGrad and RMSProp algorithms, two other stochastic gradient descent extensions.
  • As a result, both the Adagrad and RMS prop algorithms’ features are inherited by the Adam optimizers. Adam uses both the first and second moments of the gradients to adjust learning rates rather than just the first moment (mean) as it does in RMS Prop.

Example:

In this example, we will use the ‘adam’ optimize, and then we will use the optimization.


import tensorflow as tf 
 
mnist = tf.keras.datasets.mnist 

(X_train, y_train), (X_test, y_test) = mnist.load_data() 

X_train, X_test = X_train / 255.0, X_test / 255.0 

 
 
model = tf.keras.models.Sequential([ 

   tf.keras.layers.Flatten(input_shape=(28, 28)), 

   tf.keras.layers.Dense(512, activation='relu'), 

   tf.keras.layers.Dropout(0.2), 

   tf.keras.layers.Dense(10, activation='softmax')]) 

sgd = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer='adam',  

   loss='sparse_categorical_crossentropy', 

   metrics=['accuracy'])
model.fit(X_train, y_train, 

          epochs=10, 

          validation_split=0.2)
model.summary()

Here is the Screenshot of the following given code.

TensorFlow learning rate scheduler adam
TensorFlow learning rate scheduler adam

As you can see in the Screenshot we have used the ‘sgd’ optimizer by using TensorFlow.

Read: Tensorflow custom loss function

TensorFlow learning rate scheduler cosine

  • Here we will use the cosine optimizer in the learning rate scheduler by using TensorFlow.
  • It is a form of learning rate schedule that has the effect of beginning with a high learning rate, dropping quickly to a low number, and then quickly rising again.

Syntax:

Here is the Syntax of tf.compat.v1.train.cosine_decay() function

tf.keras.optimizers.schedules.CosineDecay(
                                          initial_learning_rate,
                                          decay_steps, 
                                          alpha=0.0,
                                          name=None
                                         )
  • It consists of a few parameters
    • initial_learning_rate: It is a scaler float 64 or floats 32 and it defines the initial learning rate
    • decay_steps: It specifies the number of steps decay over and it is a scaler int 32.
    • alpha: By default, it takes a 0.0 value and it will check the condition that the minimum learning rate value is always a fraction of the initial learning_rate.
    • name: It defines the name of the operation and by default, the value will be none,

Example:

import matplotlib.pyplot as plt
import tensorflow as tf
tf.compat.v1.disable_eager_execution()
empty_val = []
second_empty_val = []
number_of_iteration = 100

with tf.compat.v1.Session() as sess:
    sess.run(tf.compat.v1.global_variables_initializer())
    for global_step in range(number_of_iteration):
        
        new_rate_1 = tf.compat.v1.train.cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=150,
            alpha=0.0)
        
        new_learing_rate_2 = tf.compat.v1.train.cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=150,
            alpha=0.3)
        lr1 = sess.run([new_rate_1])
        lr2 = sess.run([new_learing_rate_2 ])

        empty_val.append(lr1[0])
        second_empty_val.append(lr2[0])

x = range(number_of_iteration)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, second_empty_val,'r-', linewidth=2)
plt.plot(x, empty_val,'g-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

In the following given code, we have imported the TensorFlow and matplotlib library and then used the tf.compat.v1.train.cosine_decay() function and within this function, we have assigned the learning rate and the decay step.

You can refer to the below Screenshot.

TensorFlow learning rate scheduler cosine
TensorFlow learning rate scheduler cosine

This is how we can use the cosine rate scheduler by using TensorFlow.

Read: TensorFlow global average pooling

TensorFlow get learning rate

  • In this section, we will learn how to get the learning rate by using TensorFlow.
  • To perform this particular task, we are going to use the concept of tf.keras.optimizers.Adam() function.
  • And within this function, we will set the learning rate that is 0.1 and the LearningRateScheduler callback receives the updated learning rate value from the schedule function that we defined beforehand before training.

Syntax:

Let’s have a look at the syntax and understand the working of tf.keras.optimizers.Adam() function.

tf.keras.optimizers.Adam(
    learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-07,
    amsgrad=False,
    name='Adam',
    **kwargs
)
  • It consists of a few parameters.
    • learning_rate: By default, it takes a 0.001 value and it is a capability that takes no arguments and returns the actual value to be used.
    • beta_1: An actual value to use, a float value, a constant float tensor, or a callable that accepts no parameters. the first moment’s estimated exponential decline rate. by default, 0.9.
    • beta_2: An actual value to use, a float value, a constant float tensor, or a callable that accepts no parameters. the Second moment’s estimated exponential decline rate. by default, 0.999.
    • epsilon: It is a nominal constant for stability in numbers and by default it takes 1e-07.
    • amsgrad: It is a boolean value on the convergence of Adam and beyond.

Example:

import tensorflow as tf

new_optimize_val = tf.keras.optimizers.Adam(0.1)
new_var_val = tf.Variable(20.0)
new_loss_val = lambda: (new_var_val ** 2)/2.0   
step_count = new_optimize_val.minimize(new_loss_val, [new_var_val])

print(step_count)

Here is the Screenshot of the following given code

TensorFlow get learning rate
TensorFlow get la earning rate

As you can see we will get all the information about adam optimizer.

Read: Binary Cross Entropy TensorFlow

TensorFlow adaptive learning rate

  • Adagrad, Adadelta, RMSprop, and Adam are examples of adaptive gradient descent algorithms that offer an alternative to traditional SGD. These per-parameter learning rate approaches offer a heuristic approach without the need for a time-consuming manual hyperparameter.
  • Additionally, Keras has several basic stochastic gradient descent extensions that enable variable learning rates. Small configuration is frequently needed because each technique modifies the learning rate, frequently one learning rate per model weight.
  • It allows the training algorithm to keep track of the model’s performance and automatically change the learning rate for optimum performance.

Example:

Let’s take an example and understand the working of adaptive learning rate.

import tensorflow as tf 

 
mnist = tf.keras.datasets.mnist 

(X_train, y_train), (X_test, y_test) = mnist.load_data() 

X_train, X_test = X_train / 255.0, X_test / 255.0 

 
 
model = tf.keras.models.Sequential([ 

   tf.keras.layers.Flatten(input_shape=(28, 28)), 

   tf.keras.layers.Dense(512, activation='relu'), 

   tf.keras.layers.Dropout(0.2), 

   tf.keras.layers.Dense(10, activation='softmax')]) 

 
 
sgd = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer='sgd',  

   loss='sparse_categorical_crossentropy', 

   metrics=['accuracy'])
model.fit(X_train, y_train, 

          epochs=10, 

          validation_split=0.2)
model.summary()

You can refer to the below Screenshot

TensorFlow adaptive learning rate
TensorFlow adaptive learning rate

You may also like to read the following TensorFlow tutorials.

In this Python, tutorial we have focused on using the learning rate schedules for machine learning models with TensorFlow. Also, we will look at some examples of how we can learn learning rate schedules in TensorFlow. And we have covered these topics.

  • TensorFlow Learning rate Scheduler
  • TensorFlow learning rate scheduler adam
  • TensorFlow learning rate scheduler cosine
  • TensorFlow get learning rate
  • TensorFlow adaptive learning rate