Currently, the OCR-D API stipulates the following run-time model for OCR-D processors:
chdir
s into the workspace directory and loops over all pages, consuming files in the input fileGrp(s), producing files in the output fileGrp(s), respectively. This involves physical I/O for PAGE and image files.However, this model has a number of drawbacks:
The only places for parallelisation are:
multiprocessing.Pool
)Note the book-parallel option only helps under the use-case of maximum-throughput batch processing. We still need page-level parallelism for the use-case of minimum-latency on-demand processing, especially for single (large) workspaces.
Any meaningful error recovery must be done in the processor, too. Once the processor exits with failure, the workflow engine must assume the complete workflow failed. Only the processor implementation has the context necessary to handle errors (e.g. from exceptions):
Steps 2-3 and 5-6 can create a significant overhead to the actual computation in 4:
setup
routine which preloads everything, independent of the processing.This is a concern for both throughput and latency.
Notice that we don't have any cheap, wholesale page-level parallelism or page-level recovery in core, yet. Everything is up to the processor implementation (where it just does not happen). These two are already good reasons themselves to change the processor API. However, this document focusses on the third point, workflow-level overhead, which has not received much attention so far.
Let's assume we adopt the proposed new API allowing page-level parallelism and recovery in core: Processors can opt-in by changing their implementation (as discussed here) as follows:
process
) into setup
,''process
into process_page
(leaving the page loop itself to the Processor
superclass or workflow engine in core).Now the door would be open for 2 alternative (or even complementary) approaches:
setup
routine and waits for requests with concrete workspaces. On request, it deserialises the METS, then iterates fileGrps and pages for all subsequent processors, calling their process_page
routine (passing data in memory), and finally serialises the METS.setup
routine and waits for requests with granular access to the large memory-resident objects, like neural prediction. When called on concrete workspaces (whether invoked via CLI or Python API), the processor's process_page
routine contains a network client delegating most of the work to the daemon (passing data via network sockets).Note A: Of course, either of these options only applies to processors that already adopted the new API. That is, under the workflow server, old processors and bashlib processors would still need to be instantiated and terminated document by document. And under the processor server, old processors obviously cannot become servers.
Note B: No further changes to the OCR-D CLI or core API are required here. Under the workflow server, the de-facto standard core API is simply lifted to an integration interface. Under the processor server, the new internal behaviour is completely encapsulated.
Both approaches also bring the advantage of enabling fail-over capacity.
The processor server additionally brings the advantage of enabling queueing for better local resource allocation (e.g. constant GPU utilisation).
For a practical guide on how to generally (not OCR-D-specifically) implement a processor server for neural models, see:
Examples:
Other aspects: