ELAN e.V. High Availability Work for Opencast in 2022

# ELAN e.V. High Availability Work for Opencast in 2022 This year, we invest in high-availability of the open-source projects we support. Read about what we do, what we find out and how we address problems. ## Index rebuild performance At the beginning, we wanna address an issue which came up several times when talking with the community about their expectations of a high available opencast. The actual Elasticsearch index rebuild may take a long time depending on the number of events, series and workflows to be processed. As a part of our project, we want to optimize the rebuild process to reduce the downtime of a running opencast instance caused by it. To achieve this, first we will build an index with a great amount of sample data within a constant environment. So we are able to get a comparable basis for testing purposes in our further development. The aim is to make reliable statements on the progress of improving the index rebuild process. In a second step, the most time-consuming parts of the index rebuild will be optimized by reducing them to what is necessary and including approaches like parallelization and bulk inserts. ### Sample data set to test index rebuilds During the process of cleaning up and improving the index rebuild, we need a default data set to compare different code versions and make quantitative statements about our efforts. This data set has to be big enough to represent an active opencast instance containing a large number of series, (scheduled) events and workflows. This test data set has to be deployed in a testing environment with constant conditions and should be scalable to meet different testing requirements. For this purpose, a python script has been set up which ingests an arbitrary number of series and events using the External API of opencast. The events are randomly generated from a small media pool containing a couple of Videos. A random number of nonop and failing workflows within a defined interval is running against each event within the data set. Finally, the test data set has been deployed on a virtual machine on our proxmox server running a clean opencast installation. So we have a consistent testing environment, which can easily be backuped, reset and reused for testing different code modifications. The currently used data set contains the following elements: | Index objects | Amount | | ---------------- | -----: | | Series | 200 | | Events | 1100 | | Scheduled events | 500 | | Total workflows | 27386 | The average time used as a baseline for the further development has been measured as follows: | Index rebuild section | Rebuild time | | | ---------------------- | ---------------:| ----------: | | Series | 16.8 s | 0.6% | | Scheduler | 41.2 s | 1.6% | | Workflow | 2442.1 s | 91.9% | | Asset manager | 156.8 s | 5.9% | | Total rebuild time | 2656.9 s | 100.0% | ### Optimizing the workflow index rebuild Amongst other topics we analyze and modify the most time-consuming step of the index rebuild process. The re-indexing of the workflows. Irrelevant and redundant information shall be removed and the step of writing the actual workflow information to the elastic search index shall be optimized. The first step of optimizing the index rebuild contained to change behaviour of the rebuild process regarding the workflows which actually have to be processed at all. The previous approach was to index all the workflows which has been run against an event until the latest workflow for each event has been indexed. This may lead to a large overhead if the ratio of workflows to events is significantly more than one within a cluster. To filter the desired workflows which shall be indexed we need to access information about the workflows that have been stored as single XML Objects within the job tables in database in previous versions of opencast. The crowdfunded pullrequest which also removed the Solr Index from the Workflowservice addressed this problem and created a new database table where all the necessary information for futher processing can be accessed. Time measurements also showed a slightly better performace after this patch. | Index rebuild section | Rebuild time | | | ---------------------- | ---------------:| ----------: | | Series | 16.1 s | 0.6% | | Scheduler | 43.8 s | 1.8% | | Workflow | 2226.0 s | 90.3% | | Asset manager | 179.7 s | 7.2% | | Total rebuild time | 2465.6 s | 100.0% | After the workflow data has been put to its own table, it was an easy task to select only the latest workflow for each eventn sorting them by creation date and grouped by their mediapackage ID. As the time measurement below shows a lot of time can be saved, when rebuilding the index of an opencast cluster containing much more workflows than events. | Index rebuild section | Rebuild time | | | ---------------------- | ---------------:| ----------: | | Series | 14.1 s | 5.2% | | Scheduler | 38.4 s | 14.1% | | Workflow | 91.2 s | 33.5% | | Asset manager | 128.4 s| 47.2% | | Total rebuild time | 272.1 s | 100.0% | After removing the irrelevant workflows from index rebuild the performance can be further improved by reducing the data which will be used when updating a workflow within the index. So far, when rebuilding the index all available information about the workflow and the corresponding mediapackage has been read from database or catalog and written to the index. From the workflow side only the state of the latest workflow is used for most searching processes. The mediapackage data will be written again to the index within the rebuild function of the asset manager. So it makes no sense to add the metadata of the mediapackage to the index at this point. And there can also be data changes, so that the written indexdata differs from workflowservice to asset manager. For these reasons the data used when rebuilding the workflow index has been reduced to the state and the necessary IDs of the Workflow, the corresponding mediapackage and the organization. The event metadata added at this point has completly been removed and the query fetching the values from database has been simplified to reduce access time, doing the latest workflow filtering within the actual rebuild function. Primarily clusters with a workflow to event ratio about one will benefit from these changes. As before a time measurement has been taken to compare with the previous states. | Index rebuild section | Rebuild time | | | ---------------------- | ---------------:| ----------: | | Series | 14.0 s | 5.6 % | | Scheduler | 36.7 s | 14.7 % | | Workflow | 67.6 s | 27.1% | | Asset manager | 130.9 s| 52.5% | | Total rebuild time | 249.2 s | 100.0% | We also tried to reduce the workflow index rebuild time by setting the workflow index entries within a configurable number of multiple threads. A single request has been sent in each thread. As it is shown in the table below it would be possible to save more than 50% of the processing time, but this approach has been discarded cause of the much more efficient bulk requests of elastic search, described in the next part. | Number of Threads used | Rebuild time | | ----------------------- | ---------------:| | Workflow - 1 Thread | 51.6 s | | Workflow - 4 Threads | 41.7 s | | Workflow - 16 Threads | 35.1 s | | Workflow - 64 Threads | 27.5 s | Finally the order of the index rebuild itself has been changed, putting the workflow index rebuild on the last position, so that opencast can theoretically already be used, while updating the workflow index data. As not all information within the admin UI will be correctly displayed as long as the rebuild is still running, an additional indicator for a running rebuild has been added. It contains information about the currently active step of the rebuild and the status of previosly done rebuild steps. ### Using the bulk inserts option of Elasticsearch The previous procedure was only using single requests for updating elastic search documents. As it is mentioned in the ES documentation a lot of processing time can be saved by collecting multiple ES documents and sending them within a single bulk request. So, we started with updating the series index rebuild service. First the original procedure has been kept and only modified by collecting multiple update functions within a list and applying them within the existing abstraction layers before sending the new created entries with a single bulk request to the ES server. As it is shown below this seemed to be a very promising approach, because we were able to reduce the rebuild time of the series index by more than a factor of 10 depending on the number of documents within a bulk insert. So we decided to use bulk inserts for all rebuild services. | Number of docs per Bulk | Rebuild time | | ------------------------ | ---------------:| | Series - 1 Doc | 14.0 s | | Series - 4 Doc | 5.2 s | | Series - 8 Doc | 3.6 s | | Series - 16 Doc | 2.5 s | | Series - 32 Doc | 1.8 s | | Series - 64 Doc | 1.1 s | Also the procedure of how the index data is being updated has been changed. Previously we generated update functions for each index entry and sent them through multiple abstraction layers together with the current data objects before aplying the function to the obsolete or maybe still empty data. Now the repopulate function of each services collects a range of already updated data objects and gives them straight to index service to apply a bulk update. Two abstraction layers could be removed and it makes the whole update procedure more straightforward. As the time measurement below shows there are efforts in all services using the bulk inserts, reducing the rebuild time by up to more than a factor of 10. To get a first impression how the number of documents within a bulk request takes influence on the processing time there has been measurements with 16 and with 64 docs. | Index rebuild section (16 Docs) | Rebuild time | | | ---------------------- | ---------------:| ----------: | | Series | 2.5 s | 3.6 % | | Scheduler | 7.9 s | 11.3 % | | Workflow | 28.9 s | 43.3% | | Asset manager | 30.4 s| 43.6% | | Total rebuild time | 69.7 s | 100.0% | | Index rebuild section (64 Docs) | Rebuild time | | | ---------------------- | ---------------:| ----------: | | Series | 1.2 s | 2.2 % | | Scheduler | 6.1 s | 11.4 % | | Workflow | 23.1 s | 43,3 % | | Asset manager | 23.0 s | 43,1 % | | Total rebuild time | 53.4s | 100.0% | ### Join index rebuild of scheduler to assetmanager | Index rebuild section | Rebuild time | | | ---------------------- | ---------------:| ----------: | | Series | 1.2 s | 2.5 % | | Scheduler (moved to AM)| 0.0 s | 0.0 % | | Workflow | 22.7 s | 47,4 % | | Asset manager | 24.0 s | 50,1 % | | Total rebuild time | 47.9s | 100.0% | TO BE CONTINUED...