From Layers to Architectures
In the previous tutorial, we learned about the basic building blocks of CNNs: the convolutional and pooling layers. But how do we arrange these blocks to create a powerful image recognition model? This arrangement is called an architecture.
For years, researchers competed in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a competition that became the proving ground for new computer vision ideas. Two of the most influential architectures to emerge from this era were VGG and ResNet, each introducing a critical concept that pushed the boundaries of what was possible.
VGG: The Power of Depth and Simplicity
The VGG network, developed by the Visual Geometry Group at Oxford, was built on a simple yet profound hypothesis: depth is the key to performance. Before VGG, many architectures used a variety of different filter sizes. The VGG team decided to test the effect of making the network deeper by exclusively using stacks of very small 3×3 convolutional filters.
The architecture consists of several "VGG blocks." Each block contains a sequence of two or three convolutional layers (all with 3×3 filters) followed by a single max-pooling layer to reduce the dimensions. This pattern is repeated, and as you go deeper into the network, the number of filters in the conv layers doubles (e.g., 64 -> 128 -> 256).
Why are small 3×3 filters so effective?
You might think a larger filter, say 5×5, would be better at seeing larger patterns. However, a stack of two 3×3 conv layers has the same effective receptive field as a single 5×5 conv layer. But the stacked approach has two major advantages:
- More Non-Linearity: Each conv layer is followed by a non-linear activation function (like ReLU). Using two 3×3 layers means you get to apply two activation functions instead of one, making the network's decision function more discriminative and powerful.
- Fewer Parameters: A 5×5 filter has 25 parameters. Two 3×3 filters have 2×(3×3)=18 parameters. This reduction makes the network more efficient and less prone to overfitting.
VGG's main drawback is its sheer size. The popular VGG16 model has about 138 million parameters, making it very computationally expensive to train and deploy.
Here's how you could define a single VGG-style block in Keras:
Python
import tensorflow as tf
from tensorflow.keras import layers, models
def vgg_block(num_convs, num_filters):
"""Creates a VGG-style block."""
block = models.Sequential()
for _ in range(num_convs):
# Add a convolutional layer with a 3x3 filter
block.add(layers.Conv2D(num_filters, (3, 3), padding='same', activation='relu'))
# Add a max pooling layer after the convolutions
block.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))
return block
# Example of creating a block with 2 conv layers and 64 filters
example_block = vgg_block(num_convs=2, num_filters=64)
ResNet: Conquering the Vanishing Gradient Problem
After VGG proved that depth was crucial, a logical next step was to build even deeper networks. However, researchers quickly hit a wall. As they added more and more layers, the models became incredibly difficult to train, and their performance actually got worse. This is known as the degradation problem.
The root cause is often the vanishing gradient problem. Think of training a network like a game of telephone. The error (or "loss") calculated at the end of the network has to send a "correction" signal backward through all the layers to tell them how to adjust their weights. In a very deep network, this signal gets repeatedly multiplied by small numbers at each layer, causing it to shrink exponentially. By the time it reaches the early layers, the signal is so weak (it has "vanished") that those layers learn extremely slowly or not at all.
The creators of ResNet (Residual Network) came up with an ingenious solution: the residual block.
This block introduced the "skip connection" (or shortcut). The input to the block, x, is not only passed through the block's convolutional layers (which produce a result F(x)), but it's also added directly to the output. The final output is therefore H(x)=F(x)+x.
Why is this so revolutionary? 💡
The skip connection creates a direct path—an "express lane"—for the gradient to flow backward. It doesn't have to go through all the intermediate layers' transformations. This completely solves the vanishing gradient problem.
Furthermore, it reframes what the layers are learning. Instead of learning the entire desired output function H(x), the layers are now only learning the residual, F(x)=H(x)−x. It's much easier for a stack of layers to learn to output zero (if no change is needed) than it is to learn the identity function (H(x)=x). The skip connection provides the identity function for free.
This simple but powerful idea allowed ResNet to successfully train networks over 150 layers deep, shattering performance records and becoming one of the most important and widely used architectures in deep learning.