In this TensorFlow tutorial, **I will explain Tensorflow Convolution Neural Network, where you will learn how to create CNNs from scratch.**

Computer vision is a field of artificial intelligence that allows computers to interpret and understand visual data.

As a TensorFlow developer, you must know how to build a neural network that allows machines to see and understand the things around us as humans.

So here, I will teach you about convolution neural networks and where they can be used; also, you will understand how they work from theoretical and mathematical perspectives.

After this tutorial, you can build and train your own CNNs with different architectures. Also, I will explain other CNN architecture in the market, which you can explore and use in your project.

In this tutorial, I will be explaining the following things:

- What is a TensorFlow Convolution Neural Network?
- How Convolution Neural Network Works?
- Understanding Mathematics behind Convolution Neural Network
- Building TensorFlow Convolution Neural Network

## What is a TensorFlow Convolution Neural Network?

**Convolution Neural Network (CNN) is another neural network that allows machines to interpret and understand visual data (images). Generally, it is used for tasks such as object detection, image classification, and facial recognition.**

This convolution neural network consists of the following layers:

- Convolution Layers
- Pooling Layers
- Fully Connected Layers

It can automatically learn a hierarchical representation of features from the raw pixel data.

In simple words, CNNs can learn patterns and features from the data without any manual programming, and this is the fundamental characteristic of neural networks, where the model adjusts its internal parameters during training to improve its performance on a given task.

CNNs organize the learned features into hierarchical layers. For example, suppose image classification detects simple patterns like edges and textures in lower layers. This hierarchical organization allows the network to understand increasingly abstract concepts as it moves through the layers.

CNNs directly process raw pixel values from images as input. They analyze these pixel values through successive layers, extracting meaningful features contributing to the network’s understanding of the image content.

### How Convolution Neural Network Works?

As you know, this neural network consists of convolution, pooling and fully connected layers.

CNNs work by feature extraction; in the initial layers (convolutional layers), low-level features, such as edges and textures, are extracted.

For example, to find the feature or detect the pattern in the provided image. Each convolutional (filter) scans the input image and then computes dot products to detect patterns.

As the network progresses through the subsequent layer, it learns to combine low-level features into higher-level representations. This process enables CNNS to determine more complex structures within images.

Generally, here

- In Convolutional Neural Networks (CNNs), the filters are like small windows that slide across the entire input image. Although they’re small in size, they cover the entire depth of the input image. Each filter is designed to detect a specific type of feature in the input image.
- During the convolution layer, we move the filter or kernel to every possible position on the input matrix. At each position, we multiply the values in the input matrix that are covered by the filter with the corresponding values in the filter itself. Afterwards, we sum up all these multiplied values.
- As the filter moves across every possible spot on the input image, it’s like exploring to see if a certain feature appears anywhere in the image.
- The outcome of this process is what we call the “feature map.”
- In Convolutional Neural Networks, we can learn from various features at the same time. When we’re done with all the layers, we gather all these different feature maps together, taking into account their depth, to produce the final output.

Next, the pooling layers reduce the spatial dimension of feature maps, keeping important information while discarding redundant or unwanted details. This is called spatial hierarchies, which facilitates translation invariance, allowing CNNs to recognize objects regardless of their position in the image.

We have two main settings:

- The window size, which is like the size of the area we’re considering.
- The stride, which is like how big of a step we take as we move the window across the image.

In each window, we make a choice: either we pick the largest number we find, or we average all the numbers we find, depending on whether we’re doing max pooling or average pooling.

This pooling process works independently for each part of the image, resizing them as needed and then stacking them together.

Next, there is one more layer of Normalization layers,

- As their name implies, normalisation layers adjust the output from the previous layers to make it more uniform. They’re placed between the convolution and pooling layers, giving each network layer more freedom to learn independently and preventing the model from becoming too fixated on specific details.
- However, normalization layers aren’t commonly used in more advanced architectures because they don’t add much value to the training process.

The final layer is the fully connected layers, which interpret the learned features and classify the input image into different categories.

- The Convolutional Layer, working hand in hand with the Pooling Layer, makes up a key building block in the Convolutional Neural Network. Depending on how complex the task is, we might add more of these blocks to capture even smaller details. However, keep in mind that adding more layers also means needing more computational power.
- Once we’ve gathered all the important features, we’re going to flatten them out into a single list. Then, we’ll hand this list over to a regular fully-connected neural network to do the final job of classifying the image.

There is a step on how the image passes through a CNN:

**Convolutional layer:**Here, multiple filters extract the edge features from the image.**ReLU Layer:**Applies the ReLU activation function to introduce non-linearity, which allows it to learn more complex patterns.**Pooling Layer:**The result from the convolution is a feature map containing the image’s feature, so the pooling layer decides which feature to keep and which to discard. It reduces the spatial dimensions of the feature map.- It keeps only the most significant features while neglecting the less important ones.

**Fully Connected Layers:**After numerous convolutional and pooling operations, the network uses one or more fully connected layers to categorise the image based on the features extracted and learned through the previous layers.

This is the steps or process from the raw pixel data to an interpretable output (such as class label), which shows how CNNs learn to recognize patterns, objects, and more from visual input data.

### Understanding Mathematics behind Convolution Neural Network

Now that you know how convolution neural networks work from a theory perspective, here I will explain the mathematical concepts behind each layer of a convolution neural network, which is about how CNNs process the data.

Let’s begin,

When the image is input to a convolution neural network, the first convolution layer applies the filter (or kernel) to the input image. It performs a mathematical operation which is called convolution.

Each filter detects specific patterns within the image.

So, input images are provided as pixel values. Here, the convolution operation involves element-wise multiplication of the filter weights with the related pixel value in the input image, followed by summing the results to produce the single value for each position in the output feature map.

For example, let’s say you have input the 3×3 image and a 2×2 filter.

The below is the 3×3 image pixel value as input.

```
[2, 4, 5]
[1, 4, 2]
[7, 6, 8]
```

2×2 filter, which is shown below. The filter, also called the kernel, is a small matrix that applies effects such as edge detection. This filter is learned through the training process.

```
[1, 0]
[0, 1]
```

The convolution operation is applied and returns the feature map shown below.

```
[4, 7]
[9, 11]
```

Next, pooling layers downsample the feature map generated by the convolutional layers, reducing spatial dimension while keeping important information.

The pooling operation performed, such as max pooling and average pooling, typically involves sliding a window over the feature map and computing for a summary statistic (maximum or average) within each window.

In simple words, after convolutional operation, the pooling operation is applied to reduce the size of the feature map generated using convolutional operation.

Max pooling operation takes the maximum value within a sliding window of a specified size.

Let’s apply the max pooling with a 2×2 window and a stride of 2. It produces the feature map as shown below.

```
[7]
[12]
```

The stride of 2 means how much the filter moves across the feature map. Remember, a stride of 1 moves the filter one pixel at a time, while a stride of 2 moves it two pixels, and so on.

The pooling operation doesn’t involve any parameters. Instead, it operates on each feature map independently, resizing it spatially.

After several convolutional and pooling layers, the high-level logic in the network is done through fully connected layers. In these fully connected layers, neurons have full connections to all activations in the previous layer.

This layer flattened the previous layer’s output to a vector and used it as input to generate a final prediction.

The above is a very simple example showing that the raw input of the image is processed through a convolution neural network.

## Building TensorFlow Convolution Neural Network

Now, you know how CNNs work in theoretical and mathematical ways. I will show you how to create a convolution neural network in TensorFlow. You will use the MNIST dataset provided by TensorFlow and train the CNN model.

First, import the tensorflow in your environment using the below code.

```
import tensorflow as tf
from tensorflow.keras import layers, models
```

The next step is to define the convolution neural network architecture; as you know, it consists of convolution, pooling and dense layers. So TensorFlow provided layers for this, which you can create as shown in the below code.

```
def cnn():
model = models.Sequential()
# This is the Convolutional layers
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
# It flattens the output of the convolutional layers
model.add(layers.Flatten())
#This is the Fully connected layers
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax')) # Output layer with 10 classes
return model
```

The **cnn()** function creates a convolution neural network model.

Here, to create convolution layers, the TensorFlow provides a function **Conv2D()**, here first layer accepts the input (raw pixel), so it is defined as **Conv2D(32, (3, 3), activation=’relu’, input_shape=(28, 28, 1)).**

This layer** creates a convolutional kernel, which accepts the input size of image 28×28 pixes and channel 1, representing one RGB channel.**

Here the activation function is **‘relu’**. To learn more about activation functions, visit this tutorial, Tensorflow Activation Functions

The kernel size 3×3 applies to the image of 28×28 pixels and outputs the 32 channels.

Next, the max pooling operation is performed using the **MaxPooling2D((2, 2))**, which means a filter of 2×2 is applied on the output of the convolution layers to retain only the most important features.

In the same way, the output goes through the next convolution and max pooling layers again.

After the output from the convolution layer is flattened or converted into a vector using **layers.Flatten().**

Finally, flattened output is passed to connect layers called Dense fully. Here, **layers.Dense(64, activation=’relu’)** represents the fully connected dense layer.

After defining the **cnn()** function, the next step is to load the MINIST dataset fed to CNNs; you can use the code below.

```
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1)) / 255.0
test_images = test_images.reshape((10000, 28, 28, 1)) / 255.0
```

Loading the MNIST dataset into **(train_images, train_labels), (test_images, test_labels)** using the **mnist.load_data()**.

Here, **train_images** and **train_labels** datasets will be **used for training the CNNs**, and the **test_images** and **test_labels** will be **used for testing the CNNs.**

Next, reshaping the train and test images and scaling the pixel values to the range [0, 1] by dividing it by 255.0

Now, create a convolution neural network model by calling the **cnn()** function, as shown below.

`model = cnn()`

Compile the model using the below code.

```
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
```

The **optimizer is ‘adam’**, the **loss function is sparse_categorical_crossentropy**, and the **metrics are ‘accuracy’**; this setting or configuration will be used during the training of the CNNs.

If you have any doubt about compiling the model, follow this tutorial How to Compile Neural Network in Tensorflow.

Train the CNNs model on the loaded MINIST dataset using the below code.

`model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.1)`

Here, passing the **train_images** and **train_lables** to fit() method with **epochs of 5**, **batch_size = 64** and **validation_split is 0.1**. If you are not familiar with these parameters, then follow this tutorial Training Neural Network in TensorFlow

After training the convolution neural network model on MINIST dataset, it returns the following output.

Look **accuracy is 0.9931**, which means the **model is 99% accurate on prediction**, and the **val_accuracy is 0.98**, which means the **model has performed well on the validation dataset, making an accuracy around 98%.**

Now, you can use this model to predict real-life datasets. To evaluate the model, use the below code.

```
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Test accuracy:', test_acc)
```

When the **test_images** and its corresponding **test_labels** are provided to the model, it evaluated and returned the test accuracy of **0.98879**, which is **98%** accuracy.

You can fine-tune this convolution neural network to make more accurate predictions.

The above is common or fundament architecture of the CNNs, but there are different types of CNNs architecture, such as **LeNet, AlexNet, VGGNet, GoogLeNet (Inception), ResNet, MobileNet and EfficientNet**.

**LeNet:**LeNet was one of the first convolutional neural networks. It was made to recognize handwritten digits on bank checks from small, black-and-white pictures.**AlexNet:**AlexNet was trained to understand big, colourful pictures. It introduced new tricks, like ReLU activation and overlapping pooling, to simplify its job.**VGGNet:**VGGNet was all about keeping things simple. It organized layers into neat blocks and used them over and over again. This helped it learn features effectively.**GoogLeNet:**GoogLeNet was smart about saving space. Instead of using lots of layers, it used special blocks called Inception blocks. These blocks made it smaller but still powerful.**ResNet:**ResNet was built to handle really deep networks. It added shortcuts that helped keep the gradients flowing smoothly during training so that it could learn even better.**DenseNet:**DenseNet was like a large network where every layer communicate directly to each other. This made information flow through the network easier and helped it learn faster.**ZFNet:**ZFNet was a tweak of AlexNet. It used slightly smaller filters to avoid losing important details. This made it more efficient at its job.

I hope that you understand how CNNs work and where to use them.

## Conclusion

You learned how to build a tensorflow convolution neural network from scratch, where you found that CNNs are made up of three layers: convolution, pooling and fully connected layers.

Then, you learn the mathematical concept behind these layers or how to process the raw pixel values.

After that, you learned how to create a convolution neural network using the Conv2D(), MaxPooling2D() and Dense() functions.

Overall, you learned how to create, train and evaluate tensorflow convolutional neural network model.

You may like to read:

- Tensorflow Gradient Descent in Neural Network
- Tensorflow Activation Functions
- Training Neural Network in TensorFlow

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etcâ€¦ for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.