OC Architecture Meeting

# OC Architecture Meeting * Main point: reducing complexity * Well defined data for items * "Duration of a video might be a string array" -> Stop it. Get help. * Opencast guarantees almost nothing about the (meta)data of videos/series/... * No ground truth! * There is the DC catalog. But several processes add data, change data. Details depend on what API you use. * There are several places where this data is stored: DB, XML file on disk, search index. * And several methods in Java to retrieve different data, which load from different sources. * And all these sources can disagree! * User provided data (e.g. inside catalogs) is often not properly checked and can completely break Opencast by setting an important value to nonsense. * This leads to: * Broken events, which cannot be process/deleted/harvested by Tobira/... * Lots of "defensive code" that tries to read from different sources, in different formats, with some fallback. * Lots of added complexity and bugs in case of unexpected data. Not just in OC, but also external apps talking to OC. * Solution: * OC need to have one declared ground truth of data. Differences in other sources are bugs. * OC needs to define a core set of attributes per item with well defined types/formats. * E.g. duration: define it to be a number representing milliseconds (for example). And make sure that this value is always extracted from the video, not being overridable by users/admins. * Code/devs/apps need to be able to trust OC on something * Store all core attributes in the DB and add checks for their validity in few central places. * Additional suggestions: * Store all additional metadata in DB, never reading XML files from disc for metadata. That's super slow and error-prone. * "But we need DC catalogs as interface": sure, but generate these XML files on the fly from DB data! You can still format the data in any way you like in APIs. Doesn't mean we need to store it as such. * Multi-node problems * Most things are only tested single-node, causing lots of multi-node bugs to be released. Generally, our current way of doing multi node seems fragile. * Let's step back for a sec: why do we want multi node? * Distributing work to multiple servers. This is most important for "workers", i.e. costly video operations. * HA: keeping Opencast up when updating and when one node crashes for some reason. (AFAIK this does not really work for Opencast yet) * Shrink gap between "typical dev environment" and "big prod environment": less complexity (fewer different setups) and catching bugs earlier * Remote impls are bad: * tedious to write * add lots of "plumbing" code to Opencast which duplicates logic and can easily become outdated * add APIs that are visible in the API browser (open up for external use), but these APIs contain implementation details * If we stay with our current Java stack and multi-node system: use the automatic remote impls and replace all manually written ones. * If we completely redo the system: reconsider everything! * If all nodes have access to DB and search index: why even communicate? * Communicate via some different means? * Multi-tenancy adds lots of complexity (causing bugs, making code more difficult to understand, ...) * Why do we want it anyway? * Having many universities in one OC instance so that they can share workers * Can we make worker sharing happen differently? * Have very "thin" workers that can receive jobs from multiple Opencast instances? * Would be a good candidate to write from scratch, possibly in a different language, if deemed useful. * But: People don't want to maintain multiple systems when they don't use Ansible * Rethink workflows * Very flexible, but: * Run slow and sequentially * Having everything defined in turing-complete "scripts" makes it way harder for OC to be clever, to understand progress, etc * Example of being clever: * first create a fast low resolution h264 encode to be able to already publish the video * create higher resolution encodes later * create h265 encodes over night when the server is idling * Deciding codec depending on estimated video views * Creating thumbnails, subtitles, preview images in parallel very quickly * -> All these things are technically possible with workflows but incredibly tricky to code in there * Yes: allowing extensibility, letting admins hook into arbitrary points and perform arbitrary code * No: having the core functionality of Opencast defined in editablewebinars.html workflow files * Well typed API responses * Currently, for most APIs, it is very unclear what is returned. What does the JSON object look like? (Example: search sometimes returns lists and sometimes the object directly) * People usually look at one example response and continue from there, but this leads to bugs easily: * Some fields might be optional * Some fields might have different types (string, number) for different requests * ... * We should properly type our API responses. And that should ideally be automatic somehow, as manual documentation easily gets outdated. * This can be done via GraphQL or tRPC, but we could also stay with REST, as long as there is a reasonable way to automatically type responses. * No stupid doc comments more!! * `/** Add a user reference */ void addUserReference()` -> the doc comment adds _zero_ information * Either actually add information or don't bother writing a doc comment at all * Modernize Static File serving * Make it possible to serve via external tool by adjusting URL paths (without breaking Static File Protection) * Auth via JWT * Tech Stack * We use lots of old dependencies (e.g. spring security, osgi). Updating often seems to take a lot of work. * Using some very big, heavy, powerful, (and in my opinion hopelessly overengineered) frameworks like spring security * The big size and complexity makes updating harder * Almost no one fully understands these frameworks (to be clear: this very explicitly includes me) * Leading to non-idiomatic usage, which makes things harder in the long run * Goal: get rid of these huge frameworks and simplify * But also some people complained about Rust for Tobira * Given all of that: full rewrite, partial rewrite or incremental improvement? * I know, rewrite is a meme, but if ever, now is the time to actually consider it. And weight pro and cons against the other options.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.