# [Towards Robust Tampered Text Detection in Document Image: New dataset and New Solution](https://openaccess.thecvf.com/content/CVPR2023/papers/Qu_Towards_Robust_Tampered_Text_Detection_in_Document_Image_New_Dataset_CVPR_2023_paper.pdf)
The proposed network DTD uses a multimodal transformer architecture for tampered text detection in document images.
They also propose a large dataset for document text tampering.
### Proposed Method :

- The proposed Document Tampering Detector(DTD) model has four modules :
- *Visual Perception Head* to extract visual features from the images
- *Frequency Perception Head* to convert the Discrete Cosine Transform (DCT) coefficients of the images to frequency domain feature embeddings
- A Multi-Modality Encoder
- A Multi-view Iterative decoder.
- The visual perception head uses 7 convolution blocks to extract the image features.
**Frequency Perception Head:**

- The DCT coefficient can be used to detect the discontinuities of Block Artifact Grids(BAG) of tampered images.
- The image is converted to YCrCb color space and its Y channel DCT coeff map is computed which is fed into convolution layers to get the first feature head F<sub>p1</sub>
- The second head F<sub>p2</sub> is computed by expanding the Y-channel quantization table to the DCT coefficient size and multiplying with F<sub>p1</sub>
- These are concatenated and downsampled to 8x8 blocks and position embedding are applied by CoordConv to get F<sub>p3</sub>
- Then 3 MobileConv Layers are applied to get final frequency feature embedding, these help enlarge the recpetive field and enhance the features.
**MultiModal Transformer:**
- The frequency domain and visual domain features are fused using a multi-modality transformer utilising scSE module based on Swin Transformer.
**Multi-view Iterative Decoder:**

- This aims to mimc human perception by using four cascaded iteration operations which concatenate smaller feature maps after each iteration
**Training**:
- The model uses cross entropy and Lovasz loss
- Curriculum learning strategy is applied where model is gradually trained from easier to harder data.
### Experiments :
- The datasets used for training are DocTamper(proposed dataset) and T-SROIE
- The evaluation metric used is IoU,precision,Recall and F-score.
- The DTD model outperforms other models in document image tampering detection and cross domain generalisation and achieves SOTA performance on T-SROIE dataset.