# TESSERACT
[TOC]
## Introduction
[Tesseract](https://en.wikipedia.org/wiki/Tesseract_(software)) is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006
### Timeline
![](https://i.imgur.com/HadWUkk.png)
1. Tesseract is an open-source OCR engine which began as a PhD research project at Hewlett Packard labs in Bristol between 1984 and 1994 and gained momentum as a possible software and/or hardware add-on for HP’s line of flatbed scanners
2. After a joint project between HP Labs Bristol, and HP’s scanner division in Colorado, Tesseract had a significant lead in accuracy over the commercial engines, but did not become a product.
3. In late 2005, HP released Tesseract for open source. Tesseract development has been sponsored by Google since 2006
## Legacy Engines
Tesseract versions < 4 are considered as legacy engines. These engines have a static classifier which classifies characters after a series of step-by-step processing of the image. Tesseract versions > 4 uses an in-built LSTM pipeline as default classifier.
### Architecture
-------------------
Processing + Classification
-------------------
Processing follows a traditional step-by-step pipeline:
* **Adaptive thresholding:**
Tesseract expects binary image with optional polygonal text regions defined. It uses adaptive thresholding for image binarisation
![](https://i.imgur.com/sRRGQfZ.png)
* **Page Layout Analysis:**
In order to recognize text region, non-text region, Images, tables etc, Tesseract has a hybrid page layout analysis pipeline.\
Top-down + Bottom-up approach\
Top-down approach: Starts with a pretrained layout and attempts to cut image into required parts using recursive horizontal and vertical cuts. Tries finding rectangles of whitespaces[background] or text[foreground]
Bottom-up approach: Analyzes groups of pixel or connected components to classify regions into text/image/graphic/blank etc and box up these regions into rectangles
* **Word Recognizer:**
1.**Connected Component Analysis**: This step stores the outlines of
the components. At this stage, outlines are gathered together, purely by nesting, into Blobs
2. Blobs are filtered using percentile height. This removes drop-caps, punctuations, isolated noise etc. Filtered blobs are organized into lines and text regions using line finding algorithm
3. **Line finding algorithm**: The line finding algorithm is designed so that a skewed
page can be recognized without having to de-skew
- Filtered blobs are likely to follow the model of non-overlapping, parallel but sloping lines
- Sorting and processing the blobs by x-coordinate makes it possible to assign blobs to a unique text line, while tracking the slope across the page
- Once the filtered blobs have been assigned to lines, a [least median of squares fit](http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/LeastMedianOfSquares.pdf) is used to estimate the baselines, and the filtered-out blobs are fitted back into the appropriate lines
- The final step of the line creation process merges blobs that overlap by at least half horizontally
4. **Baseline Fitting**: Once the text lines have been found, the baselines are fitted more precisely using a quadratic spline. Quadratic splines helps to fit curved baselines which can happen due to scanning, book bindings etc
5. **Fixed pitch detection and chopping**: Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using the pitch, and disables the chopper and associator on these words for the word recognition
![](https://i.imgur.com/KCDHabU.png)
6. **Proportional Word Finding**: For text with non-fixed pitch, tesseract solves by measuring gaps in limited vertical range
![](https://i.imgur.com/GVF5hOO.png)
7. **Word Recognition**: Words are segmented into characters before they can be recognized. The initial segmentation based on fixed pitch is recognised first. Below steps are applied for non-fixed pitch text
- **Chopping Joined Characters**: Blobs with worst confidence are chopped, chop points are found from concave vertices of a polygonal approximation of the outline
- **Associating Broken Characters**: After chop points are exhausted and the confidence is still not good, it is given to the *associator*. The associator uses best-first search of segmentation graph of possible combinations of the maximally chopped blobs into candidate characters. The fully-chop-then-associate lets tesseract recognise broken characters
8. **Static Character Classifier**: The early version of tesseract used topological features, they are nicely independent of font and size but are not robust to problems of real-life images
![](https://i.imgur.com/hd0SqG5.png)
![](https://i.imgur.com/cohE8QE.png)
\
An intermediate idea involved the use of segments of the polygonal approximation as features, but this approach is also not robust to damaged characters\
![](https://i.imgur.com/wX7RCmr.png)
**The breakthrough idea**: During training, the segments of a polygonal approximation are used for features[*prototypes*], but in recognition, features of a small, fixed length (in normalized units) are extracted[*unknown*] from the outline and matched many-to-one against the clustered prototype features of the training data
![](https://i.imgur.com/JNNHjYt.png)
\
\
In the above figure, the short, thick lines are the features extracted from the unknown, and the thin, longer lines are the clustered segments of the polygonal approximation that are used as prototypes
* The problem of broken character was resolved but the computational cost of computing the distance between an unknown and a prototype is very high
- **Features extracted from Unknown: 3D**
- Multiple features extracted from one unknown
- Each feature is a short, fixed length, directed line segment, with (x,y) position and $\theta$ making it a 3D feature vector (x, y, $\theta$) from integer space [0,255]
\
![](https://i.imgur.com/bOfRFPV.png)
- **Features extracted from training data: 4D**
- Elements of polygon approximation, clustered within a character/font combination
- x, y, direction and length
![](https://i.imgur.com/Sc4SyvT.png)
- **Distance Function: Single Feature to Single Proto**
The features extracted from the unknown are 3-dimensional, (x, y, $\theta$), with typically 50-100 features in a character, and the prototype features are 4-dimensional (x, y, angle, length), with typically 10-20 features in a prototype configuration
This makes distance calculation computationally expensive
![](https://i.imgur.com/346ScnJ.png)
\
Feature distance $d_{fp}$ = $d^2$ + $a^2$\
Feature Evidence $e_{fp}$ = $\frac{1}{1 + kd_{fp}^2}$
----
**Note:** A prototype will belong to a character of a particular font and a particular config[configuration will be one of the 4 attributes: font, bold, italic, normal, bold italic]
____
- **Feature evidence and proto evidence(for a single font config of a single character class)**
Feature Evidence $e_f$ = $\max_{p \in (bold,italic,normal,bold italic)}$ $e_{fp}$\
Proto Evidence $e_p$ = $\sum_{top\, l_p}e_{fp}$
- **Character Normalization Feature**
Single 4D feature for each unknown [y position relative to baseline, outline length, 2nd x-moment, 2nd y-moment]
- **The Distance Function:Unknown char to Prototype**\
$d$ = 1 - $\frac{\sum_f{e_f} + \sum_p{e_p}}{N_f + \sum_p{l_p}}$
\
$d'$ = $\frac{dl_o + kc}{l_o + k}$
\
$l_o$ = length of the outline of unknown\
$c$ = character normalization feature\
$k$ = arbitrary constant eg. 10
- **Rating** = $d' l_o$
- **Certainty** = $-20d'$
Greater certainty are better, as the farther from zero, the greater the distance
- **Classification:** Proceeds as two-step process
- *class pruner* creates a shortlist of character classes that the unknown might match. The feature space is quantized from $256^3$ to $24^3$ i.e bit-vectors
- similarity of each feature from the unknown is calculated with features of all prototypes of a given class(these are also quantized as bit-vectors)
The distance function keeps a record of evidences for all configuration of a character class
9. **Training Data**: the classifier was trained on a mere 20 samples of 94 characters from 8 fonts in a single size, but with 4 attributes (normal, bold, italic, bold italic), making a total of 60160 training samples
10. **Adaptive Classifier**:
Limitation of static classifier: Ability to differentiate between characters and non-characters
Another classifier trained on output of static classifier which uses baseline normalization techniques and helps in distinguishing between upper-lower case characters and improves immunity against noise specks
![](https://i.imgur.com/kVMaUhY.png)
## LSTM based upgrade
Tesseract version >= 4.0.0 includes a new neural network subsystem configured as a textline recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline
- **VGSL Specs:** Variable-size Graph Specification Language (VGSL) enables the specification of a neural network, composed of convolutions and LSTMs, that can process variable-sized images, from a very short definition string
- **LSTM engine in version 4.0.0**
The neural net (for eng) takes an image of 36px high, variable width, single channel
```
1,36,0,1: Input layer of batch=1, height=36, width=variable, channels=1
Ct3,3,16: 3x3 Conv with 16 outputs and tanh non-linearity
Mp3,3: 3x3 max-pool
Lfys64: forward-only LSTM, collapsing y-dimension; 64 outputs
Lfx96: forward-only LSTM, operating on the x-dimension, 96 outputs
Lrx96: reverse LSTM, x dimension, 96 outputs
Lfx512: forward LSTM, 512 outputs
O1c1: Output layer; outputs a 1-d sequence of vector values; softmax with CTC (Connectionist Temporal Classification); final "1" is ignored (actual number of classes is taken from unicharset).
```
An example of LSTM engine:
![](https://i.imgur.com/ZSIsDC3.png)
- **Connectionist Temporal Classification Loss**
Method used to calculate loss between sequence type input and output when the exact alignment between input and ground truth output is not known. Example: Speech to Text.
![](https://i.imgur.com/Cpr2PnK.png) ![](https://i.imgur.com/ggAQgf1.png)
## Appendix
1. **VGSL Specs**:
**Model string input and output**
A neural network model is described by a string that describes the input spec, the output spec and the layers spec in between. Example:
```
[1,0,0,3 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]
```
The first 4 numbers specify the size and type of the input, and follow the TensorFlow convention for an image tensor: [batch, height, width, depth]. Batch is currently ignored, but eventually may be used to indicate a training mini-batch size. Height and/or width may be zero, allowing them to be variable. A non-zero value for height and/or width means that all input images are expected to be of that size, and will be bent to fit if needed. Depth needs to be 1 for greyscale and 3 for color. As a special case, a different value of depth, and a height of 1 causes the image to be treated from input as a sequence of vertical pixel strips. NOTE THAT THROUGHOUT, x and y are REVERSED from conventional mathematics, to use the same convention as TensorFlow. The reason TF adopts this convention is to eliminate the need to transpose images on input, since adjacent memory locations in images increase x and then y, while adjacent memory locations in tensors in TF, and NetworkIO in tesseract increase the rightmost index first, then the next-left and so-on, like C arrays.
The last “word” is the output specification and takes the form:
```
O(2|1|0)(l|s|c)n output layer with n classes.
2 (heatmap) Output is a 2-d vector map of the input (possibly at
different scale). (Not yet supported.)
1 (sequence) Output is a 1-d sequence of vector values.
0 (category) Output is a 0-d single vector value.
l uses a logistic non-linearity on the output, allowing multiple
hot elements in any output vector value. (Not yet supported.)
s uses a softmax non-linearity, with one-hot output in each value.
c uses a softmax with CTC. Can only be used with s (sequence).
NOTE Only O1s and O1c are currently supported.
```
The number of classes is ignored (only there for compatibility with TensorFlow) as the actual number is taken from the unicharset.
**Syntax of the Layers in between**
NOTE that all ops input and output the standard TF convention of a 4-d tensor: [batch, height, width, depth] regardless of any collapsing of dimensions. This greatly simplifies things, and allows the VGSLSpecs class to track changes to the values of widths and heights, so they can be correctly passed in to LSTM operations, and used by any downstream CTC operation.
NOTE: in the descriptions below, <d> is a numeric value, and literals are described using regular expression syntax.
NOTE: Whitespace is allowed between ops.
**Functional ops**
```
C(s|t|r|l|m)<y>,<x>,<d> Convolves using a y,x window, with no shrinkage,
random infill, d outputs, with s|t|r|l|m non-linear layer.
F(s|t|r|l|m)<d> Fully-connected with s|t|r|l|m non-linearity and d outputs.
Reduces height, width to 1. Connects to every y,x,depth position of the input,
reducing height, width to 1, producing a single <d> vector as the output.
Input height and width *must* be constant.
For a sliding-window linear or non-linear map that connects just to the
input depth, and leaves the input image size as-is, use a 1x1 convolution
eg. Cr1,1,64 instead of Fr64.
L(f|r|b)(x|y)[s]<n> LSTM cell with n outputs.
The LSTM must have one of:
f runs the LSTM forward only.
r runs the LSTM reversed only.
b runs the LSTM bidirectionally.
It will operate on either the x- or y-dimension, treating the other dimension
independently (as if part of the batch).
s (optional) summarizes the output in the requested dimension, outputting
only the final step, collapsing the dimension to a single element.
LS<n> Forward-only LSTM cell in the x-direction, with built-in Softmax.
LE<n> Forward-only LSTM cell in the x-direction, with built-in softmax,
with binary Encoding.
```
In the above, (s|t|r|l|m) specifies the type of the non-linearity:
```
s = sigmoid
t = tanh
r = relu
l = linear (i.e., No non-linearity)
m = softmax
```
Examples:
```
Cr5,5,32
```
Runs a 5x5 Relu convolution with 32 depth/number of filters.
```
Lfx128
```
runs a forward-only LSTM, in the x-dimension with 128 outputs, treating the y dimension independently.
```
Lfys64
```
runs a forward-only LSTM in the y-dimension with 64 outputs, treating the x-dimension independently and collapses the y-dimension to 1 element.
**Plumbing ops**
The plumbing ops allow the construction of arbitrarily complex graphs. Something currently missing is the ability to define macros for generating say an inception unit in multiple places.
```
[...] Execute ... networks in series (layers).
(...) Execute ... networks in parallel, with their output concatenated in depth.
S<y>,<x> Rescale 2-D input by shrink factor y,x, rearranging the data by
increasing the depth of the input by factor xy.
**NOTE** that the TF implementation of VGSLSpecs has a different S that is
not yet implemented in Tesseract.
Mp<y>,<x> Maxpool the input, reducing each (y,x) rectangle to a single value.
```
Full Example: A 1-D LSTM capable of high quality OCR
```
[1,1,0,48 Lbx256 O1c105]
```
As layer descriptions: (Input layer is at the bottom, output at the top.)
```
O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
outputting 105 classes.
Lbx256: Bi-directional LSTM in x with 256 outputs
1,1,0,48: Input is a batch of 1 image of height 48 pixels in greyscale, treated
as a 1-dimensional sequence of vertical pixel strips.
[]: The network is always expressed as a series of layers.
```
This network works well for OCR, as long as the input image is carefully normalized in the vertical direction, with the baseline and meanline in constant places.
**Full Example: A multi-layer LSTM capable of high quality OCR**
```
[1,0,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]
```
As layer descriptions: (Input layer is at the bottom, output at the top.)
```
O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
outputting 105 classes.
Lfx256: Forward-only LSTM in x with 256 outputs
Lrx128: Reverse-only LSTM in x with 128 outputs
Lfx128: Forward-only LSTM in x with 128 outputs
Lfys64: Dimension-summarizing LSTM, summarizing the y-dimension with 64 outputs
Mp3,3: 3 x 3 Maxpool
Ct5,5,16: 5 x 5 Convolution with 16 outputs and tanh non-linearity
1,0,0,1: Input is a batch of 1 image of variable size in greyscale
[]: The network is always expressed as a series of layers.
```
2. **CTC Loss**:
![](https://i.imgur.com/Pw9fDAI.png)
![](https://i.imgur.com/zoovRgG.png)
## References
1. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf
2. https://dl.acm.org/doi/pdf/10.1145/1577802.1577804
3. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.568.5833&rep=rep1&type=pdf
4. https://github.com/tesseract-ocr/docs/tree/master/das_tutorial2016
5. https://pdfs.semanticscholar.org/99c9/90ff841da4923771c5d2b6692c3a4656d447.pdf
6. https://tesseract-ocr.github.io/tessdoc/Planning.html#features-from-30x-which-are-missing-for-lstm
7. https://tesseract-ocr.github.io/tessdoc/NeuralNetsInTesseract4.00
8. https://tesseract-ocr.github.io/tessdoc/VGSLSpecs.html
9. https://distill.pub/2017/ctc/