owned this note
owned this note
Published
Linked with GitHub
---
tags: OCR, OCR-D
---
# RFC: Preloading OCR-D Processors
## The Current Model
Currently, the [OCR-D API](https://github.com/OCR-D/core) stipulates the following **run-time model** for [OCR-D processors](https://ocr-d.de/en/spec/cli):
1. Workflow engine calls processor on a workspace with a number of parameters.
2. The processor gets started, parses the CLI arguments, and loads all its config files and models into memory.
3. It reads/deserialises the METS.
4. It `chdir`s into the workspace directory and loops over all pages, consuming files in the input fileGrp(s), producing files in the output fileGrp(s), respectively. This involves physical I/O for PAGE and image files.
5. It writes/serialises the METS.
6. The processor exits and wakes up the workflow engine again.
However, this model has a number of **drawbacks**:
- The only places for __parallelisation__ are:
- _within_ the processor: Since the page loop is fully under the processor's control, it can use multiple threads or multiple processes for either
- pages (custom page-level parallelism; e.g. a segmenter with `multiprocessing.Pool`)
PAGE and image I/O can be parallel, but the METS changes (adding references to these physical files) need to be synchronised.
- segments or activities within the page (custom sub-page parallelism; e.g. OCR with multithreading)
No I/O or synchronization required, but not applicable to all kinds of tasks.
- _across_ workspaces (book-level parallelism): Entirely under the workflow engine's control, processors can be spawned independent of each other.
There are no dependencies between different METS, but there might be resource conflicts between processor instances (main memory, GPU memory, I/O bandwidth, network bandwidth), which are hard to anticipate for workflow engines. (For example, the [makefileisation](https://github.com/bertsky/workflow-configuration) uses a [semaphore](https://bertsky.github.io/workflow-configuration/#gpu-vs-cpu-parallelism) to synchronise between GPU-enabled processors.)
Note the book-parallel option only helps under the use-case of _maximum-throughput batch processing_. We still need page-level parallelism for the use-case of _minimum-latency on-demand processing_, especially for single (large) workspaces.
- Any meaningful __error recovery__ must be done in the processor, too. Once the processor exits with failure, the workflow engine must assume the complete workflow failed. Only the processor implementation has the context necessary to [handle errors](https://github.com/OCR-D/core/issues/579) (e.g. from exceptions):
- On page level, fall back to the input annotation, or at least skip the next page.
- Below the page level, omit the segment.
- Steps 2-3 and 5-6 can create a significant __overhead__ to the actual computation in 4:
- The _METS de/serialisation_ in steps 3 and 5 is computationally expensive for large workspaces (due to the redundant and inefficient mutually recursive structure of METS with its fileGrps vs structMap), and also costs I/O bandwidth. It could be _skipped_ entirely, if step 2 was not a CLI invocation, but used the processor's Python API (in case it's a Python module at all, not bashlib-based).
- Processors which read large, input-independent files into memory, like dictionaries and language models, especially if they need to be pre-computed or transferred to the GPU memory, like _neural models_, could save a lot of CPU time running on multiple workspaces if steps 2 and 6 were not part of the loop.
They could have a `setup` routine which _preloads_ everything, independent of the processing.
This is a concern for both throughput and latency.
Notice that we don't have any cheap, wholesale _page-level parallelism_ or _page-level recovery_ in core, yet. Everything is up to the processor implementation (where it just does not happen). These two are already good reasons themselves to change the processor API. However, this document focusses on the third point, _workflow-level overhead_, which has not received much attention so far.
## The Preloading Model(s)
Let's assume we adopt the [proposed](https://github.com/OCR-D/core/issues/322) new API allowing page-level parallelism and recovery in core: Processors can opt-in by changing their implementation (as discussed [here](https://github.com/OCR-D/core/issues/322#issuecomment-700991484)) as follows:
- refactor initialisation (currently spilled over constructor or `process`) into `setup`,''
- refactor `process` into `process_page` (leaving the page loop itself to the `Processor` superclass or workflow engine in core).
Now the door would be open for 2 alternative (or even complementary) approaches:
1. __Workflow Server__. The workflow engine becomes a stateful server (for some conrete workflow). It instantiates all (Python) processors in a workflow once, calling their respective `setup` routine and waits for requests with concrete workspaces. On request, it deserialises the METS, then iterates fileGrps and pages for all subsequent processors, calling their `process_page` routine (passing data in memory), and finally serialises the METS.
1. __Processor Server__. A processor can be implemented as a stateful server (for some concrete workflow). It spawns a daemon once which calls the `setup` routine and waits for requests with granular access to the large memory-resident objects, like neural prediction. When called on concrete workspaces (whether invoked via CLI or Python API), the processor's `process_page` routine contains a network client delegating most of the work to the daemon (passing data via network sockets).
> Note A: Of course, either of these options only applies to processors that already adopted the new API. That is, under the workflow server, old processors and bashlib processors would still need to be instantiated and terminated document by document. And under the processor server, old processors obviously cannot become servers.
> Note B: No further changes to the OCR-D CLI or core API are required here. Under the workflow server, the de-facto standard core API is simply lifted to an integration interface. Under the processor server, the new internal behaviour is completely encapsulated.
Both approaches also bring the advantage of enabling _fail-over_ capacity.
The processor server additionally brings the advantage of enabling _queueing_ for better local resource allocation (e.g. constant GPU utilisation).
For a practical guide on how to generally (not OCR-D-specifically) implement a processor server for neural models, see:
- [Keras REST Tutorial](https://blog.keras.io/building-a-simple-keras-deep-learning-rest-api.html)
- [TensorFlow Serving Tutorial](https://www.tensorflow.org/tfx/tutorials/serving/rest_simple)
---
Examples:
- [proof of concept for a workflow server](https://github.com/OCR-D/core/pull/652)
-
Other aspects:
- oversubscription, "rogue" multiscalar processors like TF; controlling GPU/CPU resource allocation via variables, discussed [here](https://github.com/OCR-D/core/issues/322#issuecomment-687400296)
- page-level parallelisation for bashlib processors, discussed [here](https://github.com/OCR-D/core/issues/322#issuecomment-687370354)
- minimise the work needed to adapt and manage all processors; prototyping
- [ocrd_all](https://github.com/OCR-D/ocrd_all)'s sub-venvs are isolated from each other; venv vs. Docker encapsulation