# YOLO series Object Detection
> **Object Detection Series (Deep Learning) by Aladdin Persson**
> https://www.youtube.com/playlist?list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq
> **AI + IoT 佈建邊緣運算 - 電腦視覺業界專案原理及實作**
> https://www.tenlong.com.tw/products/9786267383032?list_name=lv
*This post is mainly based on the above two source of tutorial. I will provide example in both python and C++. On C++ part, I will only provide inference function with an AI framework that built with CUDA C.*
*In this note, I will provide some code snippet to help further explain the code and implementation detials. Please refer to the author's github for the full implementation. https://github.com/aladdinpersson/Machine-Learning-Collection/tree/master/ML/Pytorch/object_detection Additionally, I will provide C++ version and the full implemenatation will be in AI Framework with CUDA C \[TODO\] : add URL to github.*
## Intro
In the object detection task, we need to make model decide what the object is in the given image and a bounding box mark the object’s location. Additionally, object detection task requires models to do object localization on multiple objects in a given image
**Two Types of BBOXES formats \:**
1. (x1, y1) for upper left corner point and (x2, y2) for bottom right corner point.
2. Two points for a corner point and the other two points for height and width
**YOLO \(You Only Look Once\)**
Divide the input image to a S x S grid and for each grid cell, predicting a bounding box of objects. There will be multiple bounding boxes for one object and Non-Max Suppression will be applied to help clean bounding boxes.
## Intersection over Union
>https://www.youtube.com/watch?v=XXYG5ZWtjj0&list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq&index=2
IoU is a way to quantify how good a predicted bounding box for an object.
***IoU = Area of Intersection / Area Union***
***Image Orientation***

In the above image, a intersection region is marked as yellow rectangle box. To calculate the bounding box coodernate, for the up left point, max of x & y of box1 & box2 is calculated; for the botton right point, min of x & y of box1 & box2 is calculated.
:::info
:bulb: **Python Programming Technique**
In the python code in the author's repository, a range index is used on one element instead of only providing the index.
```python
box1_x1 = boxes_preds[..., 0:1]
```
This can help ***keeping the data dimension*** rather than flatten the data structure.
:::
*Differ from what the author provide. This C++ code will not calculate in batch.*
Since there are two format of bounding box, a convert function is provided as `midpoint2corners`.
```cpp
using bbox_t = struct bbox
{
uint objClass;
float confidence;
float x_tl;
float y_tl;
float x_br;
float y_br;
};
...
bbox_t midpoint2corners(float x, float y, float w, float h)
{
bbox_t box;
box.confidence = 0.0f;
box.objClass = 0;
box.x_tl = x - w / 2;
box.y_tl = y - w / 2;
box.x_br = x + w / 2;
box.y_br = y + w / 2;
return box;
}
float IoU(const bbox_t& bbox_1, const bbox_t& bbox_2)
{
bbox_t bbox_and{ 0, 0.0f, std::max(bbox_1.x_tl, bbox_2.x_tl), std::max(bbox_1.y_tl, bbox_2.y_tl), std::min(bbox_1.x_br, bbox_2.x_br), std::min(bbox_1.y_br, bbox_2.y_br) };
float bbox_1_area = std::abs(bbox_1.x_tl - bbox_1.x_br) * std::abs(bbox_1.y_tl - bbox_1.y_br);
float bbox_2_area = std::abs(bbox_2.x_tl - bbox_2.x_br) * std::abs(bbox_2.y_tl - bbox_2.y_br);
float bbox_and_area = std::max(bbox_and.x_br - bbox_and.x_tl, 0.0f) * std::max(bbox_and.y_br - bbox_and.y_tl, 0.0f);
float iou = bbox_and_area / (bbox_1_area + bbox_2_area - bbox_and_area + 1e-6f);
#ifdef __DEBUG__
printf("%f\n", bbox_1_area);
printf("%f\n", bbox_2_area);
printf("%f\n", bbox_and_area);
printf("%f\n", iou);
#endif
return iou;
}
```
:::info
:bulb: **Transform between Midpoint and Corners Bounding Box Format**
To transform between Corner to Midpoint format, we can multiple the coordinate with a matrix and so as from Midpoint format to Corner format.
*From Corner to Midpoint*

*From Midpoint to Corner*

:::
## Non-Max Suppression
NMS is used to clean up bounding boxes. From a list of bounding boxes, **for each class**, we take the bounding box with highest confident and calculate the IoU with other bounding boxes. If a bounding box has an IoU higher than a certain threshold, delete or ignore the bounding box with lower confidence.
:::info
:bulb: **In author's Python Implementation**
`key` parameter in `sorted` function require a callable function or using `lambda` and select part of the element for sorting.
https://docs.python.org/zh-tw/3/howto/sorting.html
:::
```cpp
void framework::NMS(std::vector<bbox_t> &bboxes, const float iouThreshold, const float probThreshold)
{
// Reduce over probability
bboxes.erase(std::remove_if(bboxes.begin(), bboxes.end(), [&](bbox_t a) { return a.confidence < probThreshold; }), bboxes.end());
// Sort over confidence in descending order
std::sort(bboxes.begin(), bboxes.end(), [&](bbox_t a, bbox_t b) { return a.confidence > b.confidence; });
// Reduce over iouThreshold
float iou = 0.0f;
for (int i = 0; i < bboxes.size(); ++i)
{
for (int j = i + 1; j < bboxes.size(); ++j)
{
if (bboxes[i].objClass != bboxes[j].objClass)
continue;
iou = IoU(bboxes[i], bboxes[j]);
#ifdef __DEBUG__
printf("%d, %d, %f", bboxes[i].objClass, bboxes[j].objClass, iou);
#endif
if (iou > iouThreshold) {
#ifdef __DEBUG__
printf(" -> delete\n");
#endif
bboxes.erase(bboxes.begin() + j);
j--;
}
#ifdef __DEBUG__
else
printf("\n");
#endif
}
}
}
```
## Mean Average Precision
> Mean Average Precision (mAP) Explained and PyTorch Implementation
> https://www.youtube.com/watch?v=FppOzcDvaDI&list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq&index=5
> 深度學習系列: 什麼是AP/mAP?
> https://chih-sheng-huang821.medium.com/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92%E7%B3%BB%E5%88%97-%E4%BB%80%E9%BA%BC%E6%98%AFap-map-aaf089920848
:::info
:bulb: **Precision & Recall**
***Precision** = true positive / true positive + false positive* : Of all true positive predictions, what fraction was actually correct. \(face recognition security\)
***Recall** = true positive / true positive + false negative* : Of all targets, what fraction did we correctly detect. \(find cancer cells\)
:::
:::info
:bulb: **Counter**
https://myapollo.com.tw/blog/python-counter/
In the author's python code.
```python
amount_bboxes = Counter([gt[0] for gt in ground_truths])
for key, val in amount_bboxes.items():
amount_bboxes[key] = torch.zeros(val)
```
The Counter create a dict with *key = image idx & val = number of image with that idx* and the `for` loop replace the val with a zero tensor which has size of original val
:::
## YOLOv1 Implementation with Pytorch
> Reference \:
> You Only Look Once: Unified, Real-Time Object Detection
> https://arxiv.org/abs/1506.02640
> YOLOv1 from Scratch
> https://www.youtube.com/watch?v=n9_XyCGr-MI&list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq&index=5
YOLO algorithm split the input image to *S x S* cells and each cell is responsible to output a prediction with a corresponding bounding box. Object’s midpoint indicate which cell is responsible to predict that object.
:::warning
:warning: **Pitfall of Understanding "Responsible"**
In the above paragrah, many tutorial or article use responsible to describe a cell outputing a object bounding box. However, a cell is what we want for the output. Model output will be "shaped" to provide output related to the loss functions and the labels. **Therefore, when we say cell, the scope of what the cell can see is not just within the cell but the part of the output from layers of convolutional layers** which provide different scale of understanding the input.
:::
:::info
:bulb: **Import `torch.nn` library**
There are two commonly seen ways to import `torch.nn` library.
**\#1 Using Aliasing**
```python
import torch
import torch.nn as nn
```
**\#2 Import from**
```python
import torch
from torch import nn
```
Both of these can provide us using `nn` directly instead of `toch.nn`; however, using the second method might be a better solution when
:::
### YOLOv1 Output Format
Each bounding box for each cell will have **\[midpoint-x, midpoint-y, width, height\]**. ***All the value will be relative to the cell***. Therefore, midpoint coordinate will be from 0 to 1; however, width and height can be greater than 1 if object is wider or taller than cell.

In this case, *C* is the object class has total 20 objects. *P* is the probability that there is an object (0 to 1) in that cell. In YOLOv1, predictions will output two bounding boxes and we expect that they will output different shape of bounding boxes (One wider and the other taller). ***Notice that we only do class prediction once so there can only be one object detected in a cell***.
### Model Architecture

> https://youtu.be/n9_XyCGr-MI?list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq&t=865
In the tutorial, the author use an predefined architecture configuration python`list` which denote what we see in the above image. The model building process is based on this config in the tutorial; however, in the *CUDA C AI Framework*, we will list out all the layers to enable better debugging.
*All the convolution layer are in `same` size which have the padding of* `kSize / 2`
```python
# (kernel size, out channel size, stride, padding)
architecture_config = [
(7, 64, 2, 3),
"M",
(3, 192, 1, 1),
"M",
(1, 128, 1, 0),
(3, 256, 1, 1),
(1, 256, 1, 0),
(3, 512, 1, 1),
"M",
# List: tuples and then last integer represents number of repeats
[(1, 256, 1, 0), (3, 512, 1, 1), 4],
(1, 512, 1, 0),
(3, 1024, 1, 1),
"M",
# List: tuples and then last integer represents number of repeats
[(1, 512, 1, 0), (3, 1024, 1, 1), 2],
(3, 1024, 1, 1),
(3, 1024, 2, 1),
(3, 1024, 1, 1),
(3, 1024, 1, 1),
]
```
**Basic Convolution Block**
```python
class CNNBlock(nn.Module):
def __init__(self, in_channels, out_channels, **kwargs):
super(CNNBlock, self).__init__()
self.conv2d = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
self.batchnorm = nn.BatchNorm2d(out_channels)
self.leakyrelu = nn.LeakyReLU(0.1)
def forward(self, x):
return self.leakyrelu(self.batchnorm(self.conv2d(x)))
```
In this convlution block, the author include an extra batchnorm layer. In the original paper, this is not included because it's not invented yet. The next step is to create the feature extracter \[*Please refer to the author's github repository for detial*\] from the list and attach a fully connective layer as the following code snippet.
```python
def _create_fcs(self, split_size, num_boxes, num_classes):
S, B, C = split_size, num_boxes, num_classes
return nn.Sequential(
nn.Flatten(),
nn.Linear(1024 * S * S, 4096),
nn.Dropout(0.5),
nn.LeakyReLU(0.1),
nn.Linear(4096, S * S * (C + B * 5))
)
```
*Please note that this is not what the setting **Aladdin Persson** provide in his tutorial. This lean to what original paper provide.*
:::info
:bulb: **Write test in the `model.py` file**
Refernce \: https://www.youtube.com/watch?v=n9_XyCGr-MI&list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq&index=5
In the youtube tutorial, the author use a small test to check if the model is built correctly in the `model.py` file. In my opinion, this will be a good idea and can be adopted for future works since this make executing the `model.py` file meaningful and provide a testing functionality not in other files but within itself.
:::
### Loss Functions
> Reference \:
> You Only Look Once: Unified, Real-Time Object Detection
> https://arxiv.org/pdf/1506.02640.pdf
> YOLOv1 詳細解讀
> https://medium.com/@\_Xing_Chen_/yolov1-%E8%A9%B3%E7%B4%B0%E8%A7%A3%E8%AE%80-ff3da6ae6948
***The first loss function calculate the MSE of the midpoint of the bounding box.***


Notice the identity function, which means only compute when there is a bounding box in cell *i* and bounding box *j* in the cell is responsible to predict the ground truth box for the object.
:::success
:key: **Quote from the paper**
> Note that the above loss function only penalizes classification error **if an object is present in the grid cell**. It also only **penelizes bounding box coordinate error if that predictor is \'\'responsible\'\' for the ground truth box** \(i.e. has the highest IOU of any predictor in that grid cell\).
:::
***The second loss function calculate the MSE of the square root of the height and weight of the bounding box.***

:::success
:key: **Square Root of W & H**
Please refer to the following article for detail.
> YOLOv1 詳細解讀
> https://medium.com/@\_Xing_Chen_/yolov1-%E8%A9%B3%E7%B4%B0%E8%A7%A3%E8%AE%80-ff3da6ae6948
:::
***The third and fourth one are for the confidence of an object exist or not***

In the third loss function, since there are two bounding box in one cell, *the loss should be calculated on the one with the highest IOU of the two*.

In the fourth one, since the perfect situation is that both of the bounding box should return no object in the cell if the cell contains no object in the ground truth, *we take both of the bounding box into consideration and calculate loss according to them*.
***The final loss function is for classification loss***

Notice that instead of using entropy based classification loss functions, the paper use regression based loss function.
### Loss Functions Implementations
*In this section, only code snippets included for explanation.*
**Bounding Box Coordinate Extraction, IOU Calculation and Select the Responsible Box**
```python
predictions = predictions.reshape(-1, self.S, self.S, self.C + self.B * 5)
iou_b1 = intersection_over_union(predictions[..., 21:25], target[..., 21:25])
iou_b2 = intersection_over_union(predictions[..., 26:30], target[..., 26:30])
ious = torch.cat([iou_b1.unsqueeze(0), iou_b2.unsqueeze(0)], dim=0)
iou_max, bestbox = torch.max(ious, dim=0)
```
Notice that before `torch.max`, a new dimension is add to both of the list of the bounding box. This enable we apply `torch.max` on to the bounding boxes and pick the bigger one out of the two. ***To sum up, we want to compare bounding boxes in `iou_b1` and `iou_b2` accordingly. By adding a new dimension and concatenate them, we can apply max to choose between them.***
A sample output of the `bestbox`
```bash
tensor([[[[0],
[1],
[1],
[0],
[1],
[0],
[0]],
[[1],
[0],
[1],
[1],
[0],
[0],
[0]],
[[0],
[1],
[0],
[0],
[1],
[0],
[0]],
[[0],
...
[0],
[1],
[0],
[1],
[1]]]])
torch.Size([1, 7, 7, 1])
```
:::info
:information_source: **Code Explain**
* `.reshape(-1, ...etc)` \: reshape with -1 means that the size for this dimension will be calculated based on size of other dimensions. \[https://blog.csdn.net/weixin_42599499/article/details/105896894\]
* `x[..., 2]` \: In short, this means that keeping all the previous dimensions and take out the second element in final dimension. Consider it as representing all the `:` of the previous dimensions.
\[https://blog.csdn.net/orangerfun/article/details/120680613\]
* `.unsqueeze` :\ Simply add a new dimension out to the specified dimension.
* In line 46 of the `loss.py` \[https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/object_detection/YOLO/loss.py\], the `target[..., 20].unsqueeze(3)` is equal to `target[..., 20:21]`
:::
#### Box Coordinates Loss
**Select Box out of the two**
We already know that the shape of the `bestbox` in above explanation. When extracting the responsible box, we simply multiply and sum `bestbox` with `predictions[..., 26:30]` the second box and `(1 - bestbox)` with `predictions[..., 21:25]` the first box.
**Calculate Square Root of W & H**
```python
box_predictions[..., 2:4] = torch.sign(box_predictions[..., 2:4]) * torch.sqrt(torch.abs(box_predictions[..., 2:4] + 1e-6))
box_targets[..., 2:4] = torch.sqrt(box_targets[..., 2:4])
```
When calculate output from the model, **the W & H will not nesassary be positive since the model knows nothing before its training**. Therefore, for the square root can only applied on number `> 0` and the gradient to be correct, a absolute function should be applied and **the sign of the gradient should be kept**. On the other hand, the target \(label\) will be as what we expected therefore, such check don't need to be applied.
:::info
:information_source: **Code Explain**
* `torch.flatten(x, end_dim)` \: In the official document, we can see that \(`torch.flatten(input, start_dim=0, end_dim=-1)`\) the default dimension is from 0 to -1. \[https://pytorch.org/docs/stable/generated/torch.flatten.html\]
* `torch.nn.MSELoss` \: 【Pytorch基礎】torch.nn.MSELoss損失函數 \[https://blog.csdn.net/zfhsfdhdfajhsr/article/details/115637954\]
:::
#### Object Loss & No Object Loss
***TODO \: Test if these effect the result***
In the tutorial, these two have different flatten settings. However, what I have in mind is reduction is applied so that the dimension will not effect the output since it is a scalar.
#### Classification Loss
Similar to the above calculation.
#### Sum Up All the Loss
```python
loss = (
self.lambda_coord * box_loss
+ object_loss
+ self.lambda_noobj * no_object_loss
+ class_loss
)
```
### Datasets
> Reference \:
> * How to build custom Datasets for Images in Pytorch
> https://www.youtube.com/watch?v=ZoZHd0Zm3RY&list=PLhhyoLH6IjfxeoooqP9rhU3HJIAVAJ3Vz&index=9
> * PascalVOC_YOLO by ALADDIN PERSSON
> https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbllGd0NuTXVnaWQtM2ltanlZN2d4ZzVJS01ZZ3xBQ3Jtc0trOWQ2RDZTaWNhTHAzVkhmejBuLVRJR2szNk1vb0h3VkpoWDIweWVwMG9lMmJQVDNTTjdMNFNuaDYwY3NRTW5rNHlVNHk1UnM4ZWJPUExTMlFESzBMS2JBOExjWXd0MW9YTkp2WFpCNEoyWjNLUkJaMA&q=https%3A%2F%2Fwww.kaggle.com%2Fdataset%2F734b7bcb7ef13a045cbdd007a3c19874c2586ed0b02b4afc86126e89d00af8d2&v=n9_XyCGr-MI
In this implementation, we ensure that when loading dataset, there is only one object label exist in a cell which means that **we enforce one object for one cell in the dataset loading stage on the ground truth label**.
:::info
:bulb: **Code Explain**
* In line 32 of `dataset.py`, the code `float(x) if float(x) != int(float(x)) else int(x)` basically checking if the loaded string is an integer or a float. This check help us store `class_label` as an integer
* in line 53, 54 of `dataset.py`, the code basically transform the label bounding box midpoint coordinate from relative to the entire image to relative to a cell. Keep in mind that by scaling them according to the cell size, we can have which cell in the integer part and the floating point represent the relative accordinate to a cell.
:::
### Training
:::info
:bulb: **Code Explain**
* In line 53 of `train.py`, we can see that a resize is applied. This works is because the label is relative to the entire image. Therefore, resize it will not change the relative location of an image.
:::
**Result on 8 example with 100 EPOCH**
```bash
Train mAP: 0.9999989867210388
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.31it/s, loss=40.7]
Mean loss was 40.736507415771484
```
**Result on 100 example with 200 EPOCH**
```bash
Train mAP: 0.9923983812332153
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.27s/it, loss=162]
Mean loss was 158.8941650390625
Train mAP: 0.9905207753181458
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.27s/it, loss=124]
Mean loss was 157.00500106811523
Train mAP: 0.9914115071296692
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.23s/it, loss=108]
Mean loss was 154.05665588378906
Train mAP: 0.9698892831802368
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.22s/it, loss=136]
Mean loss was 157.37435150146484
Train mAP: 0.9752365350723267
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.24s/it, loss=134]
Mean loss was 160.92015075683594
Train mAP: 0.9772406816482544
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.19s/it, loss=134]
Mean loss was 149.35044860839844
Train mAP: 0.9661636352539062
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.24s/it, loss=116]
Mean loss was 152.6690902709961
```
Notice that some minor change was made to clean up the original code. \[***TODO : Fork a branch to include my version of things***\]
:::success
:bulb: **Pitfall of Training with WSL2**
> Reference \: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#wsl-2-support-constraints
Refer to section *4.1 Known Limitations for Linux CUDA Applications*. We can see that **there are some memory management constraints** that exist in WSL2. `PIN_MEMORY` should not be set otherwise a out of memory error will occur.
:::
**Result on all the data with 10 Epoch \(Not Finished\)**
```python
# Hyperparameters
LEARNING_RATE = 2e-5
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 64
WEIGHT_DECAY = 0
EPOCHS = 10 # 1000
NUM_WORKERS = 32 # 2
PIN_MEMORY = False # In WSL2 pin memory cannot be used
LOAD_MODEL = False
```
```bash
Epoch #0
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 258/258 [03:50<00:00, 1.12it/s, loss=789]
Mean loss was 1137.7924542094386
Epoch #1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 258/258 [03:42<00:00, 1.16it/s, loss=684]
Mean loss was 757.7551872785701
Epoch #2
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 258/258 [03:45<00:00, 1.14it/s, loss=599]
Mean loss was 669.1634396102078
```
:::success
:bulb: **GPU Temperature went to 90 celsius and hotspot went to 100 celsius**
Please refer to *Home Server Setup* https://hackmd.io/EhDRrZWEQ4it2FMZZPVE4w for detail of the improvement I did.
:::
### LAB : Replace Darknet with Reduced MobileNetV2
In the `OpenCL Programming`, I built a heavily reduced version of MobileNetV2. This feature extractor is very small in size but still retain the basic structure of MobileNetV2. Please refer to the post for better detail.
> \[https://hackmd.io/fWapRSunQGS_sCvyMp-3Eg\]
```bash
Epoch #96
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 2068/2068 [02:11<00:00, 15.75it/s, loss=13]
Mean loss was 16.00029033595404
Epoch #97
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2068/2068 [02:11<00:00, 15.78it/s, loss=23.4]
Mean loss was 16.084995328573008
Epoch #98
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2068/2068 [02:11<00:00, 15.69it/s, loss=9.44]
Mean loss was 15.908252839654741
Epoch #99
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2068/2068 [02:10<00:00, 15.87it/s, loss=21.1]
Mean loss was 15.926505912895129
Train mAP: 0.8290584683418274
```
With 100 Epoch Training, the mAP ca reach about 0.8. This version of the MobileNetV2 is built for only detecting if a certain object is in the image ***as a binary classifier***. Therefore, the model will be better to be trained for detecting if an certain object is in the image and output its bounding box. \(*Only 1 class*\)
## YOLOv3 Implementation with Pytorch
Reference \:
> YOLOv3 from Scratch
> https://www.youtube.com/watch?v=Grir6TZbc1M&list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq&index=6
The main differeces of YOLOv3 and v1 is that the YOLOv1 only has a 7 x 7 grid detecting objects which make smaller object or overlapping object hard to detect. \(Rememeber that only a grid can only output one object\) In order to solve this problem, YOLOv3 utilize different feature map scales in different model layers and thus can make model work on small or big bounding boxes.
Reference \:
> What’s new in YOLO v3?
> https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b
### Model Architecture

***:warning: Please note that this diagram has some error thus following the code configuration might be better.***
Reference \:
> https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/object_detection/YOLOv3/model.py

```python
"""
Information about architecture config:
Tuple is structured by (filters, kernel_size, stride)
Every conv is a same convolution.
List is structured by "B" indicating a residual block followed by the number of repeats
"S" is for scale prediction block and computing the yolo loss
"U" is for upsampling the feature map and concatenating with a previous layer
"""
config = [
(32, 3, 1), # 416x416
(64, 3, 2), # 208x208
["B", 1], # 208x208
(128, 3, 2), # 104x104
["B", 2], # 104x104
(256, 3, 2), # 52x52
["B", 8], # 52x52
(512, 3, 2), # 26x26
["B", 8], # 26x26
(1024, 3, 2), # 13x13
["B", 4], # 13x13 # To this point is Darknet-53
(512, 1, 1), # 13x13
(1024, 3, 1), # 13x13
"S", # 13x13
(256, 1, 1), # 13x13
"U", # 26x26
(256, 1, 1), # 26x26
(512, 3, 1), # 26x26
"S", # 26x26
(128, 1, 1), # 26x26
"U", # 52x52
(128, 1, 1), # 52x52
(256, 3, 1), # 52x52
"S", # 52x52
]
```
### Use of Anchor Boxes
For different scale, there will be three different \(total 9 with three scales\) anchor boxes for each grid in that scale. These anchor boxes are consider knowledge that provided to the model by calculating from the training dataset. The model is than asked to learn to \"adjust\"" these bounding boxes to fit the ground truth. Additionally, this indicated that ***all grid will output three bounding boxes each reference to a anchor box therefore, a grid can detect three bounding boxes***.

:::info
:bulb: **Code Explain**
Please follow `model.py`
* In line 162, the `in_channels` is expanded to 3 times size becuase after the Upsampling channel will be a concat.
* In line 124, as the last note, the concat happen on the current layer output `x` and the last `route_connections` on channel dimension `dim=3`
* In line 98, a reshape and a permute is used to make each anchor as a dimension and move the bounding box coordinates to the last dimension.
:::
:::success
:bulb: **Coding Python Compare to C++**
In Python programming, data with more than three dimensions is very common and usually need to have correct dimensions to use the data on different functions. On the other hand C++ will not use too many dimensions for a data structure since dynamically allocate such data structure will have memory scatter around the system RAM.
:::
### Datasets
*Same as in YOLOv1, datasets are provided by the author and can be download from kaggle.*
*I provide the entire code snippet since it might be easier to explain them.*
On the first line, we create targets which is the same size as the output of the model and it means that our model need to fit this target entirely. Notice that we set all the value in the target to 0 because we should only set some of its values according to the bounding box. Therefore, if the model output any value that is not set by us \(most of the value\) it will be punished since it's the background.
On the line 6, `has_anchor=[False] * 3` becuase we want to assign an anchor for each scale for a bounding box. This means that a bounding box will have to set an anchor for each scale. In our case, there are three scales so a bounding box will have three anchors from each of the three scale assign to it.
From line 13, if the `anchor_take` is not 1 and the specific scale has not assign an anchor yet, we set the anchor \(Remember we are reading a list from the highest IoU\) by setting the confidence to 1, \(also this will mark the `anchor_take` to 1 they are checking the same thing\) set the coordinate relative to the cell and set the class label.
On line 26, if there is an anchor that is predicted anchor that has high IoU \(> 0.5\) to the ground truth bounding box, we set the confidence to `-1` which specifically marked as not punish when we calculate the loss.
```python=
targets = [torch.zeros((self.num_anchors // 3, S, S, 6)) for S in self.S] #[po, x, y, w, h, c]
for box in bboxes:
iou_anchors = iou(torch.tensor(box[2:4]), self.anchors) # calculate IOU only with w, h (assuming same midpoint)
anchor_indices = iou_anchors.argsort(descending=True, dim=0)
x, y, width, height, class_label = box
has_anchor = [False] * 3 # each scale should have one anchor # for a bbox, all three scales should have an anchor box for it.
for anchor_idx in anchor_indices:
scale_idx = anchor_idx // self.num_anchors_per_scale
anchor_on_scale = anchor_idx % self.num_anchors_per_scale
S = self.S[scale_idx]
i, j = int(S * y), int(S * x) # which cell
anchor_taken = targets[scale_idx][anchor_on_scale, i, j, 0]
if not anchor_taken and not has_anchor[scale_idx]:
targets[scale_idx][anchor_on_scale, i, j, 0] = 1
x_cell, y_cell = S * x - j, S * y - i # both between [0,1]
width_cell, height_cell = (
width * S,
height * S,
) # can be greater than 1 since it's relative to cell
box_coordinates = torch.tensor(
[x_cell, y_cell, width_cell, height_cell]
)
targets[scale_idx][anchor_on_scale, i, j, 1:5] = box_coordinates
targets[scale_idx][anchor_on_scale, i, j, 5] = int(class_label)
has_anchor[scale_idx] = True
elif not anchor_taken and iou_anchors[anchor_idx] > self.ignore_iou_thresh:
targets[scale_idx][anchor_on_scale, i, j, 0] = -1 # ignore prediction
return image, tuple(targets)
```
:::info
:bulb: **Code Explain**
Please follow `dataset.py`
* In line 20, `ImageFile.LOAD_TRUNCATED_IMAGES = True`. please see the following link for reference https://blog.csdn.net/weixin_43135178/article/details/117897962
:::
### Loss Functions
For box coordinates loss, instead of appling `exp` to the `prediction`, for better gradient calculation, a `log` is applied on `target`
:::info
:bulb: **Reference**
Please refer to the tutorial for better understanding
* YOLOv3 from Scratch Loss Implementation
https://youtu.be/Grir6TZbc1M?list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq&t=4470
* 【論文理解】yolov3損失函數https://blog.csdn.net/weixin_43384257/article/details/100986249
:::