Road to ML Engineer #50 - YOLOv11

Last Edited: 4/1/2025

The blog post introduces YOLOv11 (YOLOv8) in computer vision.

ML

In the previous article, we discussed the problem definition of object detection (& instance segmentation), metrics, and some two-stage object detectors, briefly mentioning a single-step detector. Although it's not the only single-step detector approach, the series of YOLO (You Only Look Once) models have been achieving state-of-the-art performance in object detection with various tricks. Thus, in this article, we will discuss the second most recently published YOLO model (published in September 2024), YOLOv11, and the ideas behind it.

YOLOv8

The architecture and approach of YOLOv11 are similar to those used in YOLOv8, so we can start by understanding YOLOv8 first. YOLOv8 utilizes an anchor-free approach, where we directly infer the bounding box instead of adjusting predefined anchor boxes. To ensure boxes with different sizes and ratios can still be generated, we set up multiple heads dedicated to predicting boxes of different scales from features at different levels (more on this later). This approach resulted in simpler and more robust box prediction, capable of predicting objects with extreme sizes and aspect ratios. The following describes the basic building blocks used for YOLOv8.

YOLOv8 Blocks

The Conv block is the most common block, using a convolutional layer, batch normalization layer, and SiLU (Sigmoid Linear Unit) activation. SiLU activation is computed by x×sigmoid(x)x \times \text{sigmoid}(x) and provides a smooth, continuous, and differentiable activation that has more expressive gradients compared to ReLU, which suffers from the "dying ReLU" problem (neurons get stuck outputting zero). Its gradients are stable around zero, which works well with batch normalization, making it a popular choice of activation function for object detection (despite of its higher computational cost). The bottleneck block has skip connections, similar to ResNet blocks, for training deep models, and these can be turned on or off.

YOLOv8

Using these basic blocks, YOLOv8 builds a C2F (Cross Stage Partial Focus) block, where the output from the Conv block is split in half (channels are divided), sent to a series of bottleneck blocks for analyzing various features of the feature map, and then concatenated before passing through a final Conv block. It also utilizes an SPPF (Spatial Pyramid Pooling - Fast) block, which uses multiple max pooling layers with different kernel sizes and concatenates their results to combine features across different resolutions. The architecture, built using these blocks, consists of a backbone, neck, and head.

The backbone attempts to extract features across multiple scales and produce feature maps in three different resolutions while being efficient. These feature maps can then be further processed and concatenated in the neck. The rich features in three different scales (often referred to as P3, P4, and P5) are then passed to three heads, directly predicting the class and size of the bounding boxes. To further enhance training, the method employs techniques like annealed mosaic augmentation. This involves initially combining multiple images into a single image to create a more robust training signal, facilitating initial learning. It then gradually reduces the reliance on this mosaic augmentation over time, allowing for finer adjustments and improved alignment with realistic image data.

YOLOv11

The architecture and approach of YOLOv11 are quite similar to those used in YOLOv8, but it has slight modifications. Firstly, the C2F block is substituted by a C3K2 block, which contains two C3K blocks. The C3K block is quite similar to the C2F block, except that it doesn't split the features and uses smaller kernels (3x3) for efficiency. It also introduces a C2PSA (Cross Stage Partial with Spatial Attention) block, which applies kqv attention for spatial attention over a small feature map after the SPPF block. The following details the blocks and architecture of YOLOv11.

YOLOv11

The introduction of new blocks, including attention mechanisms, made YOLOv11 a superior model among its predecessors, achieving approximately 0.55 mAP@0.5:0.05:0.95 for the largest model with a relatively low latency of approximately 12ms/img. (For comparisons, we recommend checking the official website by Ultralytics, cited at the bottom of the article.) For practice, we recommend trying to implement the blocks and the entire model using PyTorch, as demonstrated by Rao, N. S. (2024).

Non-Maximum Suppression

The object detection models we've covered so far produce a large, fixed number of bounding box predictions with varying degrees of confidence, which can result in low-quality, duplicate predictions for the same objects. To eliminate these redundant boxes, we use non-maximum suppression (NMS). NMS determines clusters of boxes predicting the same object classes, using an arbitrary IoU threshold (like 0.5), and retains only the boxes with the highest confidence scores within each cluster.

Therefore, even YOLO models are not truly end-to-end during inference and are impacted by NMS, especially when dealing with multiple classes. We can reduce the computation required for NMS to some extent by removing boxes with low confidence scores, regardless of their class or position beforehand. However, recent research explores the possibility of removing NMS entirely from object detection models, which I will cover in the next article.

Conclusion

In this article, we introduced YOLOv8 and YOLOv11, both single-step detectors with an anchor-free approach that uses multiple detection heads for different scales instead of relying on predefined anchor boxes. However, it is already not the state-of-the-art model as of now due to the tremendous research efforts and the extremely rapid advancement of the field, and we will cover some of the more advanced models in future articles.

Resources