# HTR Proposal
## Motivation
* In the field of Handwritten Text Recognition (HTR), the availability of labeled data remains scarce. Achieving better performance with transformer models requires either a substantial amount of data or pre-trained models.
* HTR methodologies are not directly applicable to the Scene Text Recognition (STR) domain. The distinct challenges posed by scene texts, such as varying backgrounds, fonts, and orientations, necessitate tailored approaches.
* Conventional Connectionist Temporal Classification (CTC) Loss enables monotonic alignments, suitable for word-level and line-level HTR tasks. However, this restriction limits research opportunities for paragraph-level or page-level tasks. Furthermore, languages without explicit word delimiters, like Chinese and Japanese, pose challenges to detectors attempting to separate characters.
* Transformer-based models often encounter efficiency challenges, including large parameter counts and prolonged training and inference times, as observed in models like TrOCR.
* While Transformer-based models, apart from TrOCR, show potential, their performance in HTR tasks still falls short of expectations.
* Handwritten images contain valuable ink information, surrounded by redundant information. Is it feasible to develop a tailored mask strategy specialized for HTR, thus reducing redundancy?
* Certain HTR models require the incorporation of a Language Model (LM). An intriguing avenue of exploration is whether we can design models that alleviate the dependence on such an auxiliary component.
* Could a single Transformer architecture potentially replace the need for traditional CNN backbones in HTR applications, leading to simplified architectures with promising performance?
## Related work
### Handwritten Text Recognition
**1.Rethinking Text Line Recognition Models**
https://arxiv.org/pdf/2104.07787.pdf



**2.Evaluating Sequence-to-Sequence Models for
Handwritten Text Recognition**
* Encoder/Decoder:LSTM/LSTM w/Attn
https://arxiv.org/pdf/1903.07377.pdf


**3.Gated Convolutional Recurrent Neural Networks for
Multilingual Handwriting Recognition**
* Encoder/Decoder:GCRNN/CTC
http://www.tbluche.com/files/icdar17_gnn.pdf



**4.Pay Attention to What You Read:
Non-recurrent Handwritten Text-Line Recognition**
* Transformer
https://arxiv.org/pdf/2005.13044.pdf




**5.Decoupled attention network for text recognition**
https://arxiv.org/pdf/1912.10205.pdf
* Encoder/Decoder:FCN / GRU



**6.TrOCR: Transformer-based Optical Character Recognition
with Pre-trained Models**
https://arxiv.org/pdf/2109.10282.pdf
* Transformer



**7.A Scalable Handwritten Text Recognition System**
* GRCL block
https://arxiv.org/pdf/1904.09150.pdf


**8.Recurrence-free unconstrained handwritten text
recognition using gated fully convolutional network**
* GFCN block
https://arxiv.org/pdf/2012.04961.pdf



**9.DAN: a Segmentation-free Document Attention
Network for Handwritten Document Recognition**
* Encoder/Decoder:FCN / Transformer decoder
https://arxiv.org/pdf/2203.12273.pdf


**10.End-to-end handwritten paragraph text recognition using a
vertical attention network**
* Encoder/Decoder:FCN / LSTM
https://arxiv.org/pdf/2012.03868.pdf




**11.Are multidimensional recurrent layers really necessary
for handwritten text recognition?**
* CNN + BLSTM
http://www.elvoldelhomeocell.net/pubs/jpuigcerver_icdar2017.pdf

**12.Accurate, Data-Efficient, Unconstrained Text
Recognition with Convolutional Neural Networks**
* GFCN
https://arxiv.org/pdf/1812.11894.pdf



**13.OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by learning to unfold**
https://openaccess.thecvf.com/content_CVPR_2020/papers/Yousef_OrigamiNet_Weakly-Supervised_Segmentation-Free_One-Step_Full_Page_Text_Recognition_by_learning_CVPR_2020_paper.pdf

**14.Best practices for a handwritten text recognition system**
https://www.cse.uoi.gr/~sfikas/DAS2022-Retsinas-BestpracticesHTR.pdf


**15.CSSL-MHTR: Continual Self-Supervised Learning for Scalable Multi-script
Handwritten Text Recognition**
https://arxiv.org/pdf/2303.09347.pdf


**16.Attention-based fully gated cnn-bgru for russian
handwritten text**
https://arxiv.org/pdf/2008.05373.pdf
* cnn-bgru



**17.Watch your strokes:Improving handwritten text recognition with deformable convolutions**
https://iris.unimore.it/bitstream/11380/1204119/2/2020_ICPR_HTR_CR.pdf


**18.Mask Guided Selective Context Decoding for Handwritten Chinese Text Recognition**


**19.Scan, Attend and Read: End-to-End Handwritten
Paragraph Recognition with MDLSTM Attention**
https://arxiv.org/pdf/1604.03286.pdf

**20.A Light Transformer-Based Architecture for
Handwritten Text Recognition**
https://hal.science/hal-03685976/file/A_Light_Transformer_Based_Architecture_for_Handwritten_Text_Recognition.pdf


**21.SPAN: a Simple Predict & Align Network for
Handwritten Paragraph Recognition**
https://arxiv.org/pdf/2102.08742.pdf


**22.An Efficient End-to-End Neural Model for
Handwritten Text Recognition**
https://arxiv.org/pdf/1807.07965.pdf

**24.Handwriting Recognition in Low-resource Scripts using Adversarial Learning**
https://openaccess.thecvf.com/content_CVPR_2019/papers/Bhunia_Handwriting_Recognition_in_Low-Resource_Scripts_Using_Adversarial_Learning_CVPR_2019_paper.pdf

**25.StackMix and Blot Augmentations for Handwritten Text Recognition**
https://arxiv.org/pdf/2108.11667.pdf

**26.PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition**
https://arxiv.org/pdf/2207.14807.pdf

---
---
### Scene text recognition
**1.Self-supervised Implicit Glyph Attention for Text Recognition(CVPR2023)**
https://openaccess.thecvf.com/content/CVPR2023/papers/Guan_Self-Supervised_Implicit_Glyph_Attention_for_Text_Recognition_CVPR_2023_paper.pdf


**2.Image-to-Character-to-Word Transformers for
Accurate Scene Text Recognition(The most recent TPAMI)**


**3.Levenshtein ocr**
https://arxiv.org/pdf/2209.03594.pdf

**4.Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition**
https://openaccess.thecvf.com/content/CVPR2021/papers/Fang_Read_Like_Humans_Autonomous_Bidirectional_and_Iterative_Language_Modeling_for_CVPR_2021_paper.pdf




**5.Multi-Granularity Prediction for Scene Text
Recognition**
https://arxiv.org/pdf/2209.03592.pdf


**6.An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition**
https://arxiv.org/pdf/1507.05717.pdf

---
---
### Mask strategy
**1.Good helper is around you: Attention-driven masked image modeling**


**2.Semmae: Semanticguided masking for learning masked autoencoders**
https://proceedings.neurips.cc/paper_files/paper/2022/file/5c186016d0844767209dc36e9e61441b-Paper-Conference.pdf


**3.What to Hide from Your Students:
Attention-Guided Masked Image Modeling**
https://arxiv.org/pdf/2203.12719.pdf


**4.Improving masked autoencoders by
learning where to mask**
https://arxiv.org/pdf/2303.06583.pdf


**5.Hard Patches Mining for Masked Image Modeling**
https://openaccess.thecvf.com/content/CVPR2023/papers/Wang_Hard_Patches_Mining_for_Masked_Image_Modeling_CVPR_2023_paper.pdf


**6.Adversarial masking for self-supervised learning**
https://proceedings.mlr.press/v162/shi22d/shi22d.pdf

 
**7.Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality**
https://arxiv.org/pdf/2205.10063.pdf




**8.Evolved Part Masking for Self-Supervised Learning**
https://openaccess.thecvf.com/content/CVPR2023/papers/Feng_Evolved_Part_Masking_for_Self-Supervised_Learning_CVPR_2023_paper.pdf


**9.MaskGIT: Masked Generative Image Transformer**
https://arxiv.org/pdf/2202.04200.pdf

### Datasets (Todo)
IAM
READ2016
LAM
HKR
Synthetic data:
MJSynth (MJ) https://www.robots.ox.ac.uk/~vgg/data/text/
SynthText (ST) https://www.robots.ox.ac.uk/~vgg/data/scenetext/
Internal data
Scene Text recognition Dataset
CASIA
ICDAR2013 Competition Dataset
SCUT-HCCDoc
MTHv2
## Contributions
* Unified Recognition of Chinese and Non-Chinese Handwritten Text:
We achieve a unified approach for recognizing both Chinese and non-Chinese handwritten text, addressing the challenges posed by different script systems.
* Novel Mask Strategy for Handwritten Text Recognition (HTR):
We introduce an innovative masking strategy tailored for the Handwritten Text Recognition (HTR) task, enhancing the model's capability to focus on relevant textual features.
* Robust Performance on Large and Small-Scale Datasets:
Our proposed model demonstrates robust performance across a wide range of dataset scales, ensuring consistent accuracy on both large and small-scale data.
* Efficiency-Enhancing Attributes:
Our approach retains an efficient profile while delivering state-of-the-art performance, catering to the demands of real-time and resource-constrained applications.
* Unified Excellence in HTR and Scene Text Recognition (STR) Tasks:
We showcase the versatility of our model by achieving exceptional performance in both Handwritten Text Recognition (HTR) and Scene Text Recognition (STR) tasks, unifying the capabilities of addressing both domains
## Experiments
### Experimental approach
1. Initial Architecture Exploration:
We commence our study by building upon a foundational encoder-decoder architecture. This versatile architecture forms the basis for our investigation, offering applicability to both Chinese and Latin script datasets, and accommodating Scene Text Recognition (STR) tasks.
For assessing the model's effectiveness, we first perform an in-depth comparison on the IAM dataset, which serves as a benchmark for handwritten text recognition. Subsequently, we consolidate results from the remaining three handwritten script datasets in a comprehensive tabular format.
2. Refinement through Masking Strategies and Network Adjustments:
In our pursuit of enhancing performance, we delve into the exploration of various mask strategies and subtle network architecture adjustments. The introduction of novel masking strateg ies and fine-tuned network modifications aims to capitalize on the contextual information within the input data, thereby elevating recognition accuracy.
3. Ablation Study and Parameter Analysis:
To gain deeper insights into the contributions of individual components, we conduct ablation experiments. These experiments encompass an in-depth analysis of the impact of the chosen mask strategies and model hyperparameters on the overall performance.
### Experimental details
1. Recording Model Efficiency:
Throughout our experimental process, we meticulously document the efficiency metrics of our models. This includes quantifying factors such as inference time and computational resources utilized during training and inference. By consistently recording these metrics, we ensure a comprehensive understanding of the trade-offs between model performance and computational requirements.
2. Incorporating Statistical Analysis:
For each set of experimental results obtained from our model, we employ a robust statistical methodology. Specifically, we conduct multiple runs of each experiment and calculate the standard deviation to account for variability. This practice guarantees the reliability of our reported performance metrics by presenting the average performance along with its associated variation.
3. Visualization and Interpretability:
Visualizations play a pivotal role in presenting our experimental findings. We ensure that each experiment's results are complemented by visual representations that aid in understanding and interpretation.
4. Fair Comparisons on the IAM Dataset:
When evaluating our model's performance on the IAM dataset, we conduct a series of fair comparisons. Specifically, we implement experiments with different conditions:
With Language Model (LM): We compare the model's performance with and without the integration of a language model, showcasing the impact of contextual language information on recognition accuracy.
With Synthetic Data: Our experimentation includes the integration of synthetic data to get a fair comparisons with TrOCR and others model.
With Pretrained Model: We examine the benefits of utilizing a pretrained model as an initialization point
### Experimental tables
**IAM Results**
| Method | VAL CER | VAL WER | TEST CER | TEST WER | Params|Training Time| Inference time | Memory|
| -------- | ---------- | ---------- | -------- | -------- | -----|-----|-----| -----|
| | Text | Text | | ||
|1D-LSTM(No LM)\cite{11} | 3.8[3.4-4.3] | 5.8[5.3-6.3] |13.5[12.1-14.9]| 18.4[17.4-19.5]| 9.3M|3.8min/epoch|| 10.5G Bs16|
|1D-LSTM(word LM)\cite{11} | 2.9[2.5-3.3] | 4.4[3.9-4.8] |9.2[8.1-10.2]| 12.2[11.4-13.2]|9.3M |3.8min/epoch||10.5G Bs16|
|LSTM/LSTM w/Attn \cite{2} | - | - |4.87| -||
|GCRNN(7-gram LM) \cite{3} | - | - |3.2|10.5|725K|
|Transformer \cite{4} | - | - |7.62| 24.54|100M|202.5s/epoch
|Transformer(Synth) \cite{4} | - | - |4.67| 15.45|100M|202.5s/epoch
|FCN / GRU \cite{5}| ||6.4|19.6||
|TrOCR_small(Synth) \cite{6}| ||4.22||62M||348.4s 8.37 sentences/s 89.22 tokens/s|
|TrOCR_base(Synth) \cite{6}| ||3.42||334M||633.7s 4.60 sentences/s 50.43 tokens/s|
|TrOCR_large(Synth) \cite{6}| ||2.89||558M||666.8s 4.37 sentences/s 47.94 tokens/s|
|GRCL(Internal) \cite{7}| - | - |4.0| 10.8|10.6M|
|GFCN \cite{8} | - | - |7.99|28.61| 1.4M|13.75min/epoch| 74ms/sample|
|FCN+LSTM \cite{10} | - | - |4.97|14.31|2.7M|
|GFCN \cite{12} | - | - |4.9||3.4M|
|OrigamiNet \cite{13} | - | - |4.8||115.3M|
|CNN\BLSTM+CTC shortcut \cite{14}| ||4.62|15.89||
|CSSL-MHTR \cite{15}| - | - |4.9| -| 10.2M|
|CNN-BGRU \cite{16}| - | - |5.79| 15.85| 885K|
|Deform-CNN \cite{17}| - | - |4.6| 19.3| |
|light Transformer \cite{20}| - | - |5.7| 18.86| 6.9M|
|light Transformer(Synth) \cite{20}| - | - |4.76| 16.31| 6.9M|
|Span \cite{21}| - | - |4.82| 18.17| 19.2M| |||
|Ours | - | - |-| -| -| |||
|Ours with Synth | - | - |-| -| -| |||
|Ours with LM | - | - |-| -| -| |||
|Ours with pretrained | - | - |-| -| -| |||
**READ2016 Results**
| Method | VAL CER | VAL WER | TEST CER | TEST WER | Params|Training Time| Inference time | Memory|
| -------- | -------- | -------- | -------- | -------- | -----|-----|-----| -----|
| | Text | Text | | ||
|CNN+BLSTM \cite{2} | - | - |4.66| -| |
|CNN+RNN \cite{ICFHR2016} | - | - |5.1|21.1| |
|FCN+LSTM \cite{10} | - | - |4.1| 16.29| |
|FCN / Transformer decoder \cite{9}| - | - |4.1| 17.64|7.6M|
|Span \cite{21}| - | - |4.56| 21.07| 19.2M| |||
| Ours | | | | ||
**LAM Results**
| Method | VAL CER | VAL WER | TEST CER | TEST WER | Params|Training Time| Inference time | Memory|
| -------- | -------- | -------- | -------- | -------- | -----|-----|-----| -----|
| | Text | Text | | ||
|1D-LSTM\cite{11} (No LM)| - |3.7|-| 12.3|9.3M |3.8min/epoch||10.5G Bs16|
|1D-LSTM (w/ DefConv) \cite{17} | - | - |3.5|11.6|9.6M |
|CRNN (w/ DefConv) \cite{17} | - | - |3.3|11.3|18.5M|
|GFCN \cite{8} | - | - |5.2| 18.5| 1.4M|
|OrigamiNet \cite{13} | - | - |3.0|11.0|115.3M|
|CSSL-MHTR \cite{15}| - | - |5.1| -| 10.2M|
|Transformer \cite{4}| - | - |10.2|22.0| 54.7M|
|TrOCR \cite{6}| - | - |3.6|11.6| 385.0M|
| Ours | | | | ||
**HKR Results**
| Method | VAL CER | VAL WER | TEST CER | TEST WER | Params|Training Time| Inference time | Memory|
| -------- | -------- | -------- | -------- | -------- | -----|-----|-----| -----|
|1D-LSTM\cite{11} (No LM)| - |-|Test1(43.4) Test2(54.7)| Test1(76.8) Test2(82.9)| 9.6M|3.8min/epoch||10.5G Bs16|
|GCRNN \cite{3} | - | - |Test1(16.1)Test2(10.1)|Test1(59.6) Test2(37.4)|728K |
|G-CNN-BGRU \cite{25} | - | - |Test1(4.13) Test2(6.31)|Test1(18.91) Test2(23.69)|885K|
|StackMix \cite{26} | - | - |3.49 ± 0.08|13.0 ± 0.3||
|CSSL-MHTR \cite{15}| - | - |2.9| -| 10.2M|
| Ours | - | - | | | |
### Visualization results(Todo)