Beyond Bounding Boxes
We've seen how classification assigns one label to an entire image and how object detection draws boxes around individual objects. Semantic Segmentation goes one step further. Its goal is to create a pixel-perfect mask for everything in the image, assigning each and every pixel to a specific class.
The output of a segmentation model is not a label or a box, but a new image, called a segmentation map, which has the same dimensions as the original input. In this map, each pixel's value (often represented by a color) corresponds to the class it belongs to. For example, in an urban driving scene, all pixels belonging to the "road" might be colored gray, all "car" pixels blue, and all "pedestrian" pixels red.
This highly detailed understanding is crucial for applications like:
- Autonomous Driving: To understand the exact boundaries of the road, lanes, and sidewalks.
- Medical Imaging: To precisely outline tumors or organs in medical scans (e.g., MRIs, CT scans).
- Satellite Imagery: To map out land cover, identifying forests, water bodies, and urban areas pixel by pixel.
The U-Net Architecture
One of the most famous and effective architectures for semantic segmentation is U-Net. It was originally designed for biomedical image segmentation, but its principles have been applied broadly. Its architecture, when visualized, has a distinct U-shape, which gives it its name.
The U-Net model consists of two main paths:
1. The Encoder (Contracting Path)
The left side of the "U" is the encoder. This part acts like a typical classification network. It consists of a series of convolutional and max-pooling layers. As an image passes through the encoder, it is progressively downsampled—its spatial dimensions (width and height) get smaller, while its depth (number of feature channels) increases. The purpose of the encoder is to capture the context of the image. By the end of this path, the network understands what is in the image, but it has lost precise information about where it is.
2. The Decoder (Expansive Path)
The right side of the "U" is the decoder. Its job is to take the compressed, high-level feature map from the encoder and progressively upsample it back to the original image size. This path uses a special type of layer called a transposed convolution (or up-convolution) to increase the width and height. The purpose of the decoder is to localize the features, reconstructing the segmentation map and pinpointing where each feature belongs.
The Secret Ingredient: Skip Connections
If you only had the encoder and decoder, the final segmentation map would be very blurry and imprecise, because a lot of fine-grained spatial detail is lost during the downsampling in the encoder path.
The true genius of U-Net lies in its skip connections. These are shortcuts that copy the feature maps from each stage of the encoder and concatenate them with the corresponding stage in the decoder.
This simple addition is incredibly powerful. It gives the decoder direct access to the rich, high-resolution features from the early stages of the network. This allows it to combine the general context from the deep layers (knowing "there is a car") with the precise localization information from the shallow layers (knowing "this specific pixel is part of the car's edge"). This fusion of information is what enables U-Net to produce sharp, highly accurate segmentation masks.