DTDNet - HackMD

# [Towards Robust Tampered Text Detection in Document Image: New dataset and New Solution](https://openaccess.thecvf.com/content/CVPR2023/papers/Qu_Towards_Robust_Tampered_Text_Detection_in_Document_Image_New_Dataset_CVPR_2023_paper.pdf) The proposed network DTD uses a multimodal transformer architecture for tampered text detection in document images. They also propose a large dataset for document text tampering. ### Proposed Method : ![](https://hackmd.io/_uploads/H1UJ9Yzt2.png) - The proposed Document Tampering Detector(DTD) model has four modules : - *Visual Perception Head* to extract visual features from the images - *Frequency Perception Head* to convert the Discrete Cosine Transform (DCT) coefficients of the images to frequency domain feature embeddings - A Multi-Modality Encoder - A Multi-view Iterative decoder. - The visual perception head uses 7 convolution blocks to extract the image features. **Frequency Perception Head:** ![](https://hackmd.io/_uploads/BJf2cFMth.png) - The DCT coefficient can be used to detect the discontinuities of Block Artifact Grids(BAG) of tampered images. - The image is converted to YCrCb color space and its Y channel DCT coeff map is computed which is fed into convolution layers to get the first feature head Fp1 - The second head Fp2 is computed by expanding the Y-channel quantization table to the DCT coefficient size and multiplying with Fp1 - These are concatenated and downsampled to 8x8 blocks and position embedding are applied by CoordConv to get Fp3 - Then 3 MobileConv Layers are applied to get final frequency feature embedding, these help enlarge the recpetive field and enhance the features. **MultiModal Transformer:** - The frequency domain and visual domain features are fused using a multi-modality transformer utilising scSE module based on Swin Transformer. **Multi-view Iterative Decoder:** ![](https://hackmd.io/_uploads/BkW7pYMK2.png) - This aims to mimc human perception by using four cascaded iteration operations which concatenate smaller feature maps after each iteration **Training**: - The model uses cross entropy and Lovasz loss - Curriculum learning strategy is applied where model is gradually trained from easier to harder data. ### Experiments : - The datasets used for training are DocTamper(proposed dataset) and T-SROIE - The evaluation metric used is IoU,precision,Recall and F-score. - The DTD model outperforms other models in document image tampering detection and cross domain generalisation and achieves SOTA performance on T-SROIE dataset.