High-Availability Projects

# High-Availability Projects ## Major Goals - Minimize downtime during updates - Learn about high-availability concepts. What should developers look out for? # Document the current state ## Summary To decide on future work, we should document the current state of high availability problems in Opencast: - What are single point of failures? - Which services cannot be run redundant? - Why can there be only one admin node? - Can there be only one Asset Manager? ## Goals Write documentation in the form of a high-availability report or blog article addressing current problems. This should help up with future development decisions for this project and potentially the next years. ## Non Goals - Do not fix any problems - Do not assess external services like Elasticsearch - Do not assess underlying infrastructure (VMs, storage, …) ## Potential Risks - We may not find all issues, leaving us with an incomplete picture - We must word this carefully to not hurt Opencast's reputatioon by talking about “all the bad things” ## Success Criteria Having an article outlining potential issues. ## Benefits/Rewards - Helps us make better decisions - Establishes us as experts in this area ## Resources/Budget - 2 weeks of Alex's time - Potentially over a longer span of time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Build a test data set to test index rebuilds ## Summary To test improvements and evaluate effects we made, we need a test data set. ## Goals Build a script to ingest and process: - Add 100 series - Ingest 500 events using the `fast` workflow, assigning 5 events per series - Run a random number of 0-50 additional no-operation workflows on each event - Run an additional failing workflow on 50 events - Schedule 200 events Save this environment (database, index, files) to be easily spun up again later: - Tarball with all data - Potentially a container to spin up on Proxmox - Potentially a Docker/Podman container Run test re-indexing to establish a performance baseline (speed and memory usage). ## Non Goals - No modifying Opencast - We just need enough data, not replicate a large production setup ## Potential Risks - Future index changes may mean we need to recreate the tarballs and containers ## Success Criteria We have a tarball we can extract and then have a simple environment to test effects we have. ## Benefits/Rewards - We can easily test if what we do is the right thing - Tests will be reproducible ## Resources/Budget - 1 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Strip down workflow index rebuilds ## Summary - Part of making the index rebuild perform better. - Depends on [having test data](#build-a-test-data-set-to-test-index-rebuilds) One of the largest indexes we rebuild is the workflow index. But despite us processing the most data for this, very little data actually ends up being used in the index at the end. We should strip this down to data actually being used. ## Goals - Remove data we do not need from index writes. Is indexing the workflow state enough? If so, we should only read that from the database and write it to the index. - All other data should come from the asset manager in the end anyway, right? - If several workflows did run on a given event, it should suffice indexing the latest workflow only - Test against [test data set](#build-a-test-data-set-to-test-index-rebuilds) ## Non Goals - No running things in parallel - No looking at other indexes - No bulk ingests ## Potential Risks - We could need more data after all ## Success Criteria - We speed up workflow index rebuilds by 50% on our test data set. ## Benefits/Rewards - Faster updates, less downtime. ## Resources/Budget - 3 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Build workflow index last while Opencast is already operational ## Summary If we were able to [reduce the workflow index rebuild to just the workflow status](#strip-down-workflow-index-rebuilds), we may be able to rebuild this index last, while Opencast is already operational. ## Goals - Run workflow index last - Only index non-active workflows - Active workflows may change and will add themselves to the index anyway - Test against [test data set](#build-a-test-data-set-to-test-index-rebuilds) ## Non Goals - Only looking at workflow index ## Potential Risks - We may need the original order after all? - We may need the state for something important? ## Success Criteria The workflow index rebuild is no longer blocking operations at all. ## Benefits/Rewards - Faster update times and less downtime ## Resources/Budget - 2 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Parallelize workflow index rebuild ## Summary Right now, all rebuilds are run sequentially, getting a number of items from the database to then add them to the index and repeat the process. Maybe, we could add several items to Elasticsearch in parallel, speeding up the process. ## Goals - Run a configurable number of rebuilding loops in parallel. - Measure the speed and compare it to our baseline. - Check and compare the memory usage. ## Non Goals - Only look at workflow index ## Potential Risks - We may end up using too much memory ## Success Criteria - The workflow index rebuild is significantly faster - The memory consumption is manageable ## Benefits/Rewards - Faster update times and less downtime ## Resources/Budget - 1 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Get Latest Asset Versions From Database ## Summary We mostly use a set of abstraction layers to get data from the asset manager database. We have seen for other index rebuilds that something like this can be very slow, when accessing a large amount of data. We should look into requesting a range of the latest snapshots from the asset manager database directly. ## Goals - Request a range of snapshots from the database directly - Only get the newest version of a snapshot - Feed index rebuilds from this data - Compare the speed to our baseline benchmark ## Non Goals - Not dealing with other indexes - Not trying to evaluate which data we actually need - No parallel updates - Not trying to do online rebuilds while being operational ## Potential Risks - We may end up not being faster at all - We are non-transactional and run the risk of overwriting in-between updates - The risk already exists but might be a bit lower now ## Success Criteria - We sped up the process by at least 10% - We need a similar amount of memory ## Benefits/Rewards - Faster updates, less downtime ## Resources/Budget - 3 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Investigate Asset Manager Rebuild ## Summary After working on the workflow index rebuild, we should investigate if we could apply any of the techniques we learned about which had a positive performance impact to the asset manager rebuild as well. ## Goals - Investigate the current state of the asset manager index rebuild - Define some new projects based on what we have learned so far and what seem to be sensible projects ## Non Goals - Not writing any code ## Potential Risks - Investigation shows us that there is nothing left to do - We may miss potential improvements ## Success Criteria - We defined two new projects ## Benefits/Rewards - We have more work - We have a proper evaluation of the current state ## Resources/Budget - 1 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Use Elasticsearch Bulk Inserts for Series Re-Indexing ## Summary Instead of inserting each document into Elasticsearch separately, we [could use bulk insert operations](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html#add-multiple-documents) to ingest several documents at once. This will hopefully increase speed and lower CPU and memory usage. ## Goals - Switch to bulk inserts for series, getting N series from the database and inserting them into Elasticsearch all at once. - Compare the performance against our benchmark. ## Non Goals - Not dealing with any other indexes - Not trying to optimize data ## Potential Risks - Could be hard(ish) to introduce due to abstraction layer - Could be that other parts are slower and we do not gain much ## Success Criteria - We increased the series index rebuild speed by 20% ## Benefits/Rewards - Faster updates and less downtime ## Resources/Budget - 3 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Investigate if we need to index scheduled events at all ## Summary The scheduler stores most of its information in the asset manager. This potentially means that all scheduled events are indexed twice, and we may not actually need the first round (the scheduler) re-indexing at all. ## Goals - Investigate what happens if we skip the scheduler. - Are any information missing? - Can we get the missing information during the asset manager rebuild? - Define a new project if we can drop the index. ## Potential Risks - We may still need this for something ## Success Criteria - We could drop the scheduler rebuild ## Benefits/Rewards - Less work, faster updates, less potential for errors ## Resources/Budget - 2 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Move External API ## Summary Moving the External API to a different node will help us to minimize the requirements in order to run Opencast. Ideally, we can later put this on several nodes, lead-balancing them. Also, if we want to put Opencast into a read only mode for updates, this will help. As a first step, we want to move the External API to the presentation node. This would theoretically allow us to run Opencast with nothing but a presentation node present. We should keep in mind that we don't want to strongly entangle the API in the presentation node components to not be bound to just another node in the end. ## Goals - Move External API to presentation node ## Non Goals - Not improving the External API ## Potential Risks - The entanglement with the asset manager may cause problems - The index service abstraction layer may cause problems ## Success Criteria - We have a fully functional External API on presentation ## Benefits/Rewards - We are way more flexible when it comes to turning off noies - We have the potential of running multiple API backends. ## Resources/Budget - 8 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Use keepalived to Configure Failover ## Summary We will never be able to prevent machines or services from failing. But we can run a lot of them redundant so that if one fails, the other will take over. For that, we can take a look [at using keepalived to configure an IP-based failover](https://www.linode.com/docs/guides/ip-failover-keepalived/). ## Goals - Demo HA set-up using keepalived - Make Prometheus highly available - Webinar to shoow others how this woorks ## Non Goals - No load-balancing - No production systems ## Potential Risks - None ## Success Criteria - We can kill one Prometheus and still get data ## Benefits/Rewards - Learn how to build failover systems - Increase service availability ## Resources/Budget - 1 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Investigate Failover Concepts ## Summary Are there systems or ways to configure failover apart from keepalived? Can services do that on their own? How do they do that? Can (or should) we adapt some of these concepts to the software we write? ## Goals - Research failover models. - How does other software do that? - Write an advisory and offer a webinar. ## Non Goals - Not actually setting any production nodes up in HA mode ## Potential Risks - Not finding anything new ## Success Criteria - Found an alternative to keepalived ## Benefits/Rewards - Get to know concepts - Share knowledge across Devs and DevOps ## Resources/Budget - 1 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Christian - Lars # Let Opencast Nodes dispatch their own jobs ## Summary Right now we have one central big loop which dispatches all jobs in a cluster running on the admin node. It would be great if we could have each node simply select the jobs it wants to work on instead. ## Goals - Write a node based job dispatcher - Ensure job claims are transactional (prevent two nodes from claiming the same job) ## Non Goals - No limits, loads or special rules ## Potential Risks - Hard to estimate required time ## Success Criteria - A three node setup can run through the `fast` workflow - The admin node only schedules jobs on the admin node ## Benefits/Rewards - Drastically improved scalability - Per node redundancy ## Resources/Budget - 6 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Research how other projects handle high availability ## Summary Other projects have tried solving high availability problems before, and we should try learning from them what we can. As one example, we can set up Proxmox in a high-availability cluster and see how it handles filing notes, as well as research how it does that. ## Goals - Set up a Proxmox high-availability cluster - Research technology and concepts behind this - Tell others what you learned ## Non Goals - Implement any of this in Opencast ## Potential Risks - Low risk of breaking our test servers - Results may not be applicable to our projects ## Success Criteria - We can take down a Proxmox server without loosing the services ## Benefits/Rewards - Accumulate knowledge - Improve development decisions - Improve DevOps knowledge ## Resources/Budget - 1 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Work through interactive Kubernetes tutorial ## Summery Kubernetes comes with a lot of high-availability concepts and learning about it will help us to understand these problems, even if we do not use Kubernetes. That is why we should take a look at how it works. ## Goals - Run through the [interactive Kubernetes tutorial](https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/cluster-interactive/) to learn about Kubernetes basics. ## Non Goals - Run an actual Kubernetes cluster - Deploy Opencast in Kubernetes ## Potential Risks - Unsure when we will first encounter Kubernetes at one of our customers data centeres ## Success Criteria - Knowing Kubernetes basics ## Benefits/Rewards - Know about what modern systems can potentially do ## Resources/Budget - 1 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Identify stateless Opencast components ## Summary Replication, failovers and other high-availability concepts work exceptionally well if components are stateless, since no data sync is required for these. We could try identifying stateless components of Opencast and evaluate if we could and should extract them to be able to run them stand-alone with replication. ## Goals - Identify stateless components of Opencast - Evaluate if we would benefit from running them stand-alone - What would we need to run them stand-alone - Define extraction projects if it seems sensible ## Non Goals - Actually extract components ## Potential Risks - None ## Success Criteria - Identified at least four stateless components ## Benefits/Rewards - Low-hanging fruits for replication and high-availability ## Resources/Budget - 1 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Document the best ways of upgrading Opencast ## Summary You can use several tips and tricks to make Opencast updates run smoothly. Not all of these are documented properly. We should talk to people to get a few ideas and maybe try one or two upgrade tricks to evaluate if they work. Finally, we should write down what we have found and potentially add it to Opencast's documentation. ## Goals - Talk to adopters about updates - Evaluate best practices - Write down a best practice guide ## Non Goals - Update any production system - Look at upgrades from pre 10 versions ## Potential Risks - We may not get anything better than Opencast's current upgrade guide. ## Success Criteria - Two tips not to be found in the current upgrade guides ## Benefits/Rewards - Helps us with future upgrades ## Resources/Budget - 2 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Migrate Elasticsearch Indexes ## Summary Elasticsearch allows migrations instead of rebuilds for certain operations. Being able to use migrations would prevent lengthy index rebuilds. We should investigate what we could use migrations for and how we could make them happen. ## Goals - Investigate conditions for migrations - Create a demo migration for Opencast (e.g. add a random field) - Document the findings and provide a webinar for developers to demonstrate them ## Non Goals - Modify any old migrations ## Potential Risks - May not help with our migrations - Abstraction layers could be unhelpful ## Success Criteria - A successful demo migration ## Benefits/Rewards - Prevents lengthy index rebuilds in the first place - Faster updates, less downtime ## Resources/Budget - 2 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars # Evaluate Liquibase for online database migrations ## Summary We could try using something like Liquibase to safely evolve our database schema while Opencast is running. This could prevent manual steps and speed up the database migration. ## Goals - Take a look at Liquibase - Could that work for us? - Are there Liquibase competitors? - Evaluate how a migration using Liquibase could look like - Define a follow-up project it it seems sensible ## Non Goals - Implement a database migration ## Potential Risks - Liquibase may potentially not be suitable for us ## Success Criteria - We know if we can/shoould use Liquibase or similar projects ## Benefits/Rewards - Faster migrations - Less manual work ## Resources/Budget - 2 weeks of Alex's time ## Initial Funding - Part of AVVP high-availability ## Proposed By - Lars