Road to ML Engineer #48 - Semantic Segmentation

Last Edited: 3/19/2025

The blog post introduces semantic segmentation in computer vision.

ML

So far, we have covered image classification and generation for computer vision tasks. However, there are many more computer vision tasks with varying levels of difficulty that have significant real-world use cases. In this article, we will discuss semantic segmentation, which is one of the simpler computer vision tasks.

Semantic Segmentation

The conventional image classification task only produces a single category per image, though an image may contain many objects with different classes. Hence, semantic segmentation applies pixel-wise classification on images like dog, chair, background, etc., to capture the entire semantics of the image. Since we are applying pixel-wise classification, the output for a single input image with dimensions (H,W,C)(H, W, C), where HH, WW, and CC are height, width, and channel (often RGB = 3), will have dimensions (H,W,K)(H, W, K), where KK is the number of categories.

FCNs

Here, we can notice that we do not need to flatten the image like we do for a simple image classification task, and we can build a neural network solely with convolutional layers to transform the images. In fact, such model is called a Fully Convolutional Network (FCN) and has been used in semantic segmentation tasks. The above is an example of a common FCN architecture performing semantic segmentation.

We can see that the architecture reduces the spatial dimension (downsamples) and increases the channel dimension, then re-expands (upsamples) the spatial dimension to fit the final output shape almost like an autoencoder, even when the task does not require us to create compressed latent dimensions. This is because we would like to have an increasing size of receptive field of the kernels (by compressing the spatial dimension) to capture granular to broad patterns that get carried over to subsequent layers.

U-Net

Although the expanding channel dimension can carry over recognized patterns with varying degrees of granularity of the image to subsequent layers to some extent, it has a limit to the information we can retain within the dimension. More importantly, however, this structure is prone to unstable gradients that greatly hinder learning. To overcome this problem, U-Net has a symmetrical architecture with skip connections to directly pass latents with varying sizes of receptive fields and gradients back to the first layers.

U-Net

Despite its simple architecture, U-Net demonstrates strong capability in semantic segmentation tasks, as well as other tasks that expect output of the same size and do not aim to produce smaller latent representations for data compression or generative purposes. (Skip connections prevent the model from compressing all features to a latent space and reconstructing images solely from sampled latents.) The tasks include generative tasks with diffusion, which I will cover in a future article.

Mean IoU

Intersection over union (IoU) is a metric that computes the intersection between predicted segmentation masks and ground truth masks divided by their union, which gets used in various segmentation tasks. For semantic segmentation, we use mean IoU (mIoU), which takes the mean of IoU over classes as a metric to account for class imbalance. The below is the equation of mIoU.

mIoU=1ki=1kNiij=1kNij+j=1kNjiNii \text{mIoU} = \frac{1}{k}\sum_{i=1}^k \frac{N_{ii}}{\sum_{j=1}^k N_{ij} + \sum_{j=1}^k N_{ji} - N_{ii}}

Here, NijN_{ij} is the set of pixels with true class ii predicted as jj, and kk is the total number of classes. We can interpret NiiN_{ii} as true positives, j=1kNij\sum_{j=1}^k N_{ij} as false negatives and true positives, and j=1kNji\sum_{j=1}^k N_{ji} as false positives and true positives, hence interpreting the equation of IoU as TPFN+FP+TP\frac{\text{TP}}{\text{FN} + \text{FP} + \text{TP}}. We can observe that it does not have TN\text{TN} in both numerator and denominator unlike accuracy, allowing us to sensibly take a mean over classes for tackling class imbalance.

Conclusion

In this article, we discussed the problem definition of semantic segmentation, major model architectures for solving the problem, and metrics used. As we can see from above, semantic segmentation is pixel-wise classification that is relatively simple to perform, but resulting models cannot distinguish between multiple object instances of the same class, which can be problematic for some tasks (although even the best models still only achieve around ~0.7 mIoU). From the next article, we will delve into more complex and challenging computer vision tasks.

Resources