# DETR End-to-End Object Detection with Transformers ## Author and paper link ```Latex= @misc{carion2020endtoend, title={End-to-End Object Detection with Transformers}, author={Nicolas Carion and Francisco Massa and Gabriel Synnaeve and Nicolas Usunier and Alexander Kirillov and Sergey Zagoruyko}, year={2020}, eprint={2005.12872}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` [paper reference](https://arxiv.org/abs/2005.12872?context=cs.CV) ## Abstract DEtection TRansformer,DETR present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. DETR combine a common CNN with a transformer architecture. <br>During training bipartite matching uniquely assigns predictions with ground truth boxes. Prediction with "no" match should yield a **no object**$(\emptyset)$ class prediction. >DETR propose a direct set prediction approach to bypass the surrogate tasks. This end-to-end philosophy has led to significant advances in complex structured prediction tasks such as machine translation or speech recogntion, but not yet in object detection. ###### tags: `object queries` `set-based loss` `encoder-decoder` ![DETR Predict method](https://hackmd.io/_uploads/S1MxvNmsp.png) ![transformer part](https://hackmd.io/_uploads/B1OkCJEi6.png) # COCO experiment | MODEL | #params | AP | AP50 | AP75 | APs | APm | AP-l | | ------------- | ------- | ---- | ---- | ---- | ---- | ---- | ---- | | DETR | 41M | 42.0 | 62.4 | 44.2 | 20.5 | 45.8 | 61.1 | | DETR-DC5 | 41M | 43.3 | 63.1 | 45.9 | 22.5 | 47.3 | 61.1 | | DETR-R101 | 60M | 43.5 | 63.8 | 46.4 | 21.9 | 48.0 | 61.8 | | DETR-DC5-R101 | 60M | 44.9 | 64.7 | 47.7 | 23.7 | 49.5 | 62.3 | # NEU-DET experiment ### train config | ID | MODEL | config_cmd | | --- | ------------- | ---------------------------------------------------- | | 1 | DETR-DC5-R101 | --epochs 200 --enc_layers 1 --dec_layers 5 | | 2 | DETR-DC5-R101 | --epochs 100 --enc_layers 1 --dec_layers 5 | | 3 | DETR-DC5-R101 | --epochs 100 --enc_layers 2 --dec_layers 2 | | 4 | DETR-DC5-R101 | --epochs 100 --enc_layers 3 --dec_layers 1 | | 5 | DETR-DC5-R101 | --epochs 100 --enc_layers 3 --dec_layers 3 | | 6 | DETR-DC5-R101 | --epochs 100 --enc_layers 0 --dec_layers 6 | | 7 | DETR-DC5-R50 | --epochs 500 --enc_layers 6 --dec_layers 6 --batch 4 | ## precision | ID | MODEL | AP | AP50 | AP75 | APs | APm | AP_l | | --- | ------------- | ----- | ----- | ----- | --- | ----- | ----- | | 1 | DETR-DC5-R101 | 0.001 | 0.003 | 0.000 | -1 | 0.000 | 0.002 | | 2 | DETR-DC5-R101 | 0.000 | 0.002 | 0.000 | -1 | 0.000 | 0.000 | | 3 | DETR-DC5-R101 | 0.001 | 0.004 | 0.001 | -1 | 0.000 | 0.001 | | 4 | DETR-DC5-R101 | 0.000 | 0.000 | 0.000 | -1 | 0.000 | 0.000 | | 5 | DETR-DC5-R101 | 0.003 | 0.009 | 0.000 | -1 | 0.000 | 0.003 | | 6 | DETR-DC5-R101 | 0.003 | 0.011 | 0.000 | -1 | 0.000 | 0.000 | | 7 | DETR-DC5-R50 | 0.000 | 0.000 | 0.000 | -1 | 0.000 | 0.000 | ## recall | ID | MODEL | APm | ARs | ARm | AR-l | | --- | ------------- | ----- | --- | ----- | ----- | | 1 | DETR-DC5-R101 | 0.000 | -1 | 0.001 | 0.032 | | 2 | DETR-DC5-R101 | 0.000 | -1 | 0.000 | 0.012 | | 3 | DETR-DC5-R101 | 0.000 | -1 | 0.000 | 0.025 | | 4 | DETR-DC5-R101 | 0.000 | -1 | 0.000 | 0.000 | | 5 | DETR-DC5-R101 | 0.000 | -1 | 0.001 | 0.032 | | 6 | DETR-DC5-R101 | 0.000 | -1 | 0.000 | 0.067 | | 7 | DETR-DC5-R50 | 0.000 | -1 | 0.000 | 0.001 | ### classes missed analysis ![missed analysis](https://hackmd.io/_uploads/HkrjnkEsp.png) Analysis of the number of instances of various classes missed by DETR depending on how many are present in the image. We report the mean and the standard deviation. **As the number of instances gets close to 100, DETR starts saturating and misses more and more objects**. # Model Structure ![simaple 76](https://hackmd.io/_uploads/ryBzcVQi6.png) - **backbome** DETR uses a conventional CNN backbone(usually is ResNet50) to learn a 2-D representation of an input image. CNN backbone generates a lower-resolution activation map $f \in \mathbb{D \times H \times W}$. - **Transformer-Encoder** The encoder expects sequence as input, hence we collapse the spatial dimensions of $z_0$ into one dimension, resulting in a $d \times HW$ feature map. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. - **Transformer-Decoder** The decoder follows the standard architecture of the transformer, transforming $N$ embeddings of size $d$ using multi-headed self-attention and encoder-decoder attention mechanisms. A small fixed number of learned positional embeddings, referred to as object queries, are taken as input. Additionally, it attends to the encoder output in the transformer decoder. - Auxiliary decoding losses paper use auxiliary losses in decoder during training. It add **prediction FFNs** and **Hungarian loss** after each decoder layer. All predictions FFNs share their parameters. - **Prediction feed-forward network(FFN)** The final prediction is computed by a 3-layer perceptron with ReLU activation function and hidden dimension d, and a linear projection layer. Since we predict a fixed-size set of N bounding boxes, where N is usually much larger than the actual number of objects of interest in an image, an additional special class label $\emptyset$ is used to represent that no object is detected within a slot. $emptyset$ class plays similar role to the background. # Loss Function Losses the paper use linear combination of 1 and GIoU losses for bounding box regression with L1 = 5 and iou = 2 weights respectively. All models were trained with N = 100 decoder query slots. ## Hungarian Loss DETR infers a fixed-size set of $N$ predictinos, in a single pass through the decoder, where $N$ is set to be significantly larger than the typical number of objects in an image. One of the main difficulties of training is to score predicted objects(class, position, size) with respect to the ground truth.<br> - Target: optimize bipartite matching between predicted and ground truth object objects and object-specific losses. Let us denote by $y$ the ground truth set of objects, and $\hat{y} = \{ \hat{y_i} \}^{N}_{i=1}$ the set of $N$ predictions. Assuming $N$ is larger than the number of objects in the image, we consider $y$ also as a set of size $N$ padded with $\emptyset$ (no object). To find a bipartite matchinf between these two sets we search for permutation of $N$ elements $\sigma \in \mathcal{G}_N$ with lowest cost: $$\begin{aligned} \hat{\sigma} = \mathop{\arg\min}_{\sigma \in \mathcal{G}_N} \sum^{N}_{i} \mathcal{L_{match}}(y_i,\hat{y_{\sigma(i)}}) \end{aligned}$$ where $\mathcal{L_{match}}$ is a pair-wise *matching cost* between ground truth $y_i$ and a prediction with index $\sigma(i)$. This optimal assigment is computed efficiently with the **Hungarian algorithm**.<br> The matching cost takes into account both the class prediction and the similarity of predicted and ground truth boxes. Each element $i$ of the ground truth set can be seen as a $y_i = (c_i,b_i)$ where $c_i$ is the target class label(which may be $\emptyset$) and $b_i \in \left[0,1 \right]^{4}$ is a vector that defines ground truth box center coordinates and its height and weight relative to the image size. For the prediction with index $\sigma(i)$ we define probability of class $c_i$ as $\hat{p}_{\sigma(i)}(c_i)$ and the predicted box as $\hat{b}_{\sigma(i)}$. With these notations DETR define $$\mathcal{L}_{match}(y_i,\hat{y}_{\sigma(i)}) = -\mathbb{1}_{\{c_i \neq \emptyset \}}+\mathbb{1}_{\{c_i \neq \emptyset\}}\mathcal{L}_{box}(b_i,\hat{b}_{\sigma(i)})$$ the Hungarian loss for all pairs matched in the previous step. We define the loss similarly to the losses of common object detectors. $$\mathcal{L}_{Hungarian}(y,\hat{y}) = \sum^{N}_{i=1}\left[ -\log{\hat{p}_{\hat{\sigma}(i)}(c_i)} + \mathbb{1}_{\{c_i \neq \emptyset \}}\mathcal{L}_{box}(b_i,\hat{b}_{\sigma(i)}) \right]$$ ![Hungarian Image](https://hackmd.io/_uploads/S1n80lAia.png) ### Box loss Paper use a soft version of Intersection over Union in our loss, together with a $l_1$ loss on $\hat{b}$. $$\begin{aligned} \mathcal{L_{box}}(b_{\sigma(i)},\hat{b_i}) &= \lambda_{iou} \mathcal{L}(b_{\sigma(i)},\hat{b_i}) + \lambda_{L1}||b_{\sigma(i)} - \hat{b}||_1\\\mathcal{L_{iou}}(b_{\sigma(i)},\hat{b_i}) &= 1 - \left(\frac{|b_{\sigma(i)} \cap\hat{b_i}|}{|b_{\sigma(i)} \cup\hat{b_i}|} - \frac{|B(b_{\sigma(i)},\hat{b_i})\setminus b_{\sigma(i)} \cup \hat{b_i})|}{|B(b_{\sigma(i)})|}\right) \end{aligned}$$ $|.|$ means **area**, and the union and intersection of box coordinates are used as shorthands for the boxes themselves. The areas of unions or instersections are computed by min / max of the linear functions of $b_{\sigma(i)}$ and $\hat{b_i}$ ,which makes the loss sufficiently well-behaved for stochastic gradients. $B(b_{\sigma(i)},\hat{b_i})$ means the largest box containing $b_{\sigma(i)},\hat{b_i}$(the areas involving $B$ are also computed based on min / max of linear functions of the box coordinates). # Conclusion 1. DETR, a new design for object detection systems based on transformers and bipartite matching loss for direct set prediction. 2. DETR is straightforward to implement and has a flexible architecture that is easily extensible to panoptic segmentation. 3. It achieves significantly better performance on large objects than Faster R-CNN