# RTDETR ## Author and paper link ```Latex= @misc{lv2023detrs, title={DETRs Beat YOLOs on Real-time Object Detection}, author={Wenyu Lv and Yian Zhao and Shangliang Xu and Jinman Wei and Guanzhong Wang and Cheng Cui and Yuning Du and Qingqing Dang and Yi Liu}, year={2023}, eprint={2304.08069}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` [paper reference](https://arxiv.org/abs/2304.08069) ## Abstract Our RT- DETR-L achieves 53.0% AP on COCO val2017 and 114 FPS on T4 GPU, while RT-DETR-X achieves 54.8% AP and 74 FPS, outperforming all YOLO detectors of the same scale in both speed and accuracy. # NEU-DET experiment | 指標 | TWCC_RTDETR16 | TWCC_RTDETR24 | | -------------------- | ------------- | ------------- | | metrics/precision(B) | 0.67444 | 0.7223 | | metrics/recall(B) | 0.67814 | 0.64334 | | metrics/mAP50(B) | 0.677 | 0.67418 | | metrics/mAP50-95(B) | 0.36799 | 0.36059 | # Improving Weaknesses 1. slow training convergence 2. hard to optimize queries 3. difficult to realize real-time object detection - due to the high computational cost of the model itself. # Model Structure RT-DETR consists of a backbone, a hybrid encoder,a IoU-aware query selection , a transformer decoder with auxiliary prediction heads. - hybrid encoder transforms multi-scale features image into a sequence through **intra-scale interaction** and **cross-scale fusion** - IoU-aware query selection is employed output selected fixed number of image features from hybrid encoder output seqence to server as initial **object queries** - transformer decoder to generate boxes and confidence scores ## main structure ![](https://hackmd.io/_uploads/BJe0o_Y03.png) Overview of RT-DETR. It leverage features of the last three stages of the backbone {S3 , S4 , S5 } as the input to the encoder. The efficient hybrid encoder transforms multi-scale features into a sequence of image features through intra-scale feature interaction (AIFI) and cross-scale feature-fusion module (CCFM). The IoU-aware query selection is employed to **select a fixed number of image features** to serve as **initial object queries for the decoder**. Finally, the decoder with auxiliary prediction heads iteratively optimizes object queries to generate boxes and confidence scores. ## cross-scale feature-fusion module(CCFM) ![](https://hackmd.io/_uploads/H1a2GttCh.png) The role of the CCFM block is to fuse the adjacent features into a new feature ![](https://hackmd.io/_uploads/rkB4Mjt0h.png) fuse the S3,S4,F5 features into new feature ## intra-scale feature interaction(AIFI) The original paper didn't deplict the the structure of "AIFI".So here are some guesses. According to origin paper's description and [ICIF-Net](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9759285) 's structure as shown below. ### origin paper's description > AIFI further reduces computational redundancy based on vari- ant D, which only performs **intra-scale interaction on S5**. We argue that applying the self-attention operation to high- level features with richer semantic concepts can capture the connection between conceptual entities in the image, which facilitates the detection and recognition of objects in the im- age by subsequent modules ### ICIF Net's intra-scale cross-interaction ![](https://hackmd.io/_uploads/ryx-j9t03.png) * cross means enabling two-branch communication at the same resolution ### AIFI might is attention mechanism which is from [attention is all you need] at s5 feature ![](https://hackmd.io/_uploads/BJ0RoctAn.png) ## IoU-aware selecetion The paper proposes IoU-aware query selection by constraining the model to produce high classification scores with high IoU scores and low classification scores with low IoU scores during training. The blue line represents the proposed IoU-aware method. The ideal scenario would be a line at a 45-degree angle. ![center](https://hackmd.io/_uploads/HkplOit03.png) constraint the model by using following loss function $$\begin{aligned} \mathcal{L}(\hat{b},b) = &\mathcal{L}_{box}(\hat{b},b) + \mathcal{L}_{cls}(\hat{c},\color{red}{\hat{b}},y,\color{red}{b}) \\ &\mathcal{L}_{box}(\hat{b},b) + \mathcal{L}_{cls}(\hat{c},\color{red}{IoU}) \\ where \\ &\hat{y}= prediction \\&\hat{y}= \{ \hat{c},\hat{b}\} \\&y = ground truth \\&y = \{ c,b\} \\&c= categories\\ &b = bounding \space boxes \end{aligned}$$ ### How to select topK for hybird_encoder output? selet the topk classification score form hybird_encoder output ```python= # decoder head self.dec_score_head = nn.ModuleList([ nn.Linear(hidden_dim, num_classes) for _ in range(num_decoder_layers) ]) enc_outputs_class = self.enc_score_head(output_memory) _, topk_ind = torch.topk(enc_outputs_class.max(-1).values, self.num_queries, dim=1) ```