RFC: Preloading OCR-D Processors

The Current Model

Currently, the OCR-D API stipulates the following run-time model for OCR-D processors:

Workflow engine calls processor on a workspace with a number of parameters.
The processor gets started, parses the CLI arguments, and loads all its config files and models into memory.
It reads/deserialises the METS.
It chdirs into the workspace directory and loops over all pages, consuming files in the input fileGrp(s), producing files in the output fileGrp(s), respectively. This involves physical I/O for PAGE and image files.
It writes/serialises the METS.
The processor exits and wakes up the workflow engine again.

However, this model has a number of drawbacks:

The only places for parallelisation are:
- within the processor: Since the page loop is fully under the processor's control, it can use multiple threads or multiple processes for either
  - pages (custom page-level parallelism; e.g. a segmenter with multiprocessing.Pool)
    PAGE and image I/O can be parallel, but the METS changes (adding references to these physical files) need to be synchronised.
  - segments or activities within the page (custom sub-page parallelism; e.g. OCR with multithreading)
    No I/O or synchronization required, but not applicable to all kinds of tasks.
- across workspaces (book-level parallelism): Entirely under the workflow engine's control, processors can be spawned independent of each other.
  There are no dependencies between different METS, but there might be resource conflicts between processor instances (main memory, GPU memory, I/O bandwidth, network bandwidth), which are hard to anticipate for workflow engines. (For example, the makefileisation uses a semaphore to synchronise between GPU-enabled processors.)
Note the book-parallel option only helps under the use-case of maximum-throughput batch processing. We still need page-level parallelism for the use-case of minimum-latency on-demand processing, especially for single (large) workspaces.
Any meaningful error recovery must be done in the processor, too. Once the processor exits with failure, the workflow engine must assume the complete workflow failed. Only the processor implementation has the context necessary to handle errors (e.g. from exceptions):
- On page level, fall back to the input annotation, or at least skip the next page.
- Below the page level, omit the segment.
Steps 2-3 and 5-6 can create a significant overhead to the actual computation in 4:
- The METS de/serialisation in steps 3 and 5 is computationally expensive for large workspaces (due to the redundant and inefficient mutually recursive structure of METS with its fileGrps vs structMap), and also costs I/O bandwidth. It could be skipped entirely, if step 2 was not a CLI invocation, but used the processor's Python API (in case it's a Python module at all, not bashlib-based).
- Processors which read large, input-independent files into memory, like dictionaries and language models, especially if they need to be pre-computed or transferred to the GPU memory, like neural models, could save a lot of CPU time running on multiple workspaces if steps 2 and 6 were not part of the loop.
  They could have a setup routine which preloads everything, independent of the processing.
This is a concern for both throughput and latency.

Notice that we don't have any cheap, wholesale page-level parallelism or page-level recovery in core, yet. Everything is up to the processor implementation (where it just does not happen). These two are already good reasons themselves to change the processor API. However, this document focusses on the third point, workflow-level overhead, which has not received much attention so far.

The Preloading Model(s)

Let's assume we adopt the proposed new API allowing page-level parallelism and recovery in core: Processors can opt-in by changing their implementation (as discussed here) as follows:

refactor initialisation (currently spilled over constructor or process) into setup,''
refactor process into process_page (leaving the page loop itself to the Processor superclass or workflow engine in core).

Now the door would be open for 2 alternative (or even complementary) approaches:

Workflow Server. The workflow engine becomes a stateful server (for some conrete workflow). It instantiates all (Python) processors in a workflow once, calling their respective setup routine and waits for requests with concrete workspaces. On request, it deserialises the METS, then iterates fileGrps and pages for all subsequent processors, calling their process_page routine (passing data in memory), and finally serialises the METS.
Processor Server. A processor can be implemented as a stateful server (for some concrete workflow). It spawns a daemon once which calls the setup routine and waits for requests with granular access to the large memory-resident objects, like neural prediction. When called on concrete workspaces (whether invoked via CLI or Python API), the processor's process_page routine contains a network client delegating most of the work to the daemon (passing data via network sockets).

Note A: Of course, either of these options only applies to processors that already adopted the new API. That is, under the workflow server, old processors and bashlib processors would still need to be instantiated and terminated document by document. And under the processor server, old processors obviously cannot become servers.

Note B: No further changes to the OCR-D CLI or core API are required here. Under the workflow server, the de-facto standard core API is simply lifted to an integration interface. Under the processor server, the new internal behaviour is completely encapsulated.

Both approaches also bring the advantage of enabling fail-over capacity.

The processor server additionally brings the advantage of enabling queueing for better local resource allocation (e.g. constant GPU utilisation).

For a practical guide on how to generally (not OCR-D-specifically) implement a processor server for neural models, see:

Examples:

proof of concept for a workflow server

Other aspects:

oversubscription, "rogue" multiscalar processors like TF; controlling GPU/CPU resource allocation via variables, discussed here
page-level parallelisation for bashlib processors, discussed here
minimise the work needed to adapt and manage all processors; prototyping
ocrd_all's sub-venvs are isolated from each other; venv vs. Docker encapsulation

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.