D015 - Tracking the video status/processing - the Project Proposal

# D015 - Tracking the video status/processing - the Project Proposal *draft 0.2* Origin: https://github.com/mprinc/terra-zontik/issues/37 We are starting the document with **Aspect/Strategic Ideas/Approaches** and later we suggest concrete **Tasks** according to those aspects # Aspects ## Development Paradigms We need to support healthy development paradigms: 1. ***`Unit-testing`*** - (i) to be sure that regular and some important boundary cases are covered, (ii) to avoid error regression in future development, and (iii) to provide confidence in future development and have documented examples of code usage 2. ***`E2E-testing`*** - a technique to test out the entirety of the software product from beginning to end from the users’ perspective so we know that all components fit properly together 3. **`Separation of concerns`** - like we achieved with the `backend-mockup` where frontend can be developed without the backend. Similarly, we want to support backend development without *"going through"* the frontend, etc, in a sense to provide a `frontend-mockup`. 4. ***`Modular developments`*** where each functional section of code is articulated as a ***`task`*** that can be better understood, tested, and manipulated inside the Terra business logic that nicely and beneficially maps into a visual workflow. 5. ***`Scalability & Load balancing`*** - the possibility that a request can be handled by more than one server or service instance. This provides resilience, scalability, and a better user experience. ## Monitoring **NOTE**: The following acronyms are not well-established ones ### M-C-A: Monitoring customer activities M-C-A would help us with tracking each unique customer activity and being able to (i) identify specific issues with services, (ii) improve user satisfaction, (ii) learn about customer behavior. ### M-S-R: Monitor Server Resources We need to track all the critical resources of our servers to not end up in either (i) a **non-responsive** scenario or (ii) a scenario where responsiveness is slow so the **user usability** is unacceptable. ### M-S-A: Monitor Services Availability We need to track all the Terra services to understand their availability, and either 1. send notifications on their unavailability or 2. recover/restart the service if possible ### M-S-F: Monitor Services Functionality This is *"deeper"* monitoring and investigation than the former M-S-A aspect. Here we challenge the **service correctness** with all the tasks it commits to provide. ### E2E-M: End-to-End Monitoring uploading a small video every few minutes and checking its status in DB or even if the email is received. This solution **deeply monitors every aspect** of user functionality. ### M-S-L: Monitoring-Server Load Monitoring if too many tasks/videos are in the processing queue, or if their processing gets too slow, allows us to: **(i)** **increase resources** dynamically, based on needs and/or **(ii)** **inform** users upfront to instead try later This leads us to ## Measure requests/resources ratio Measure processing speed and understand the ratio between the number of users (requests) and necessary resources. (It should also take in mind how heavy load each user brings! **How big videos** and how many of them) # Notifying This is a category of work (tasks) where we provide more information to the end user or administrator about the system (activity) status ## Notifying Users ### Upfront prior users’ actions + Set of messages on the production server that are displayed during the **maintenance period**. Possible notifications: + our website/service is being updated with new features, be aware of possible malfunctions/glitches during this process (in the period of XX till YY) + ... click here to be informed when it is UP + if the monitor detects that the number/frequency of issues (videos not being successfully processed) is above some threshold, we might even **AUTOMATICALLY** add a warning on the website (in addition to warning admins) of the temporary possibility of glitches. + if we experience/schedule longer or heavier maintenance/upgrading blackouts or the ones that **disable the whole website** (giving no option for putting up these messages on the website) we could go even with **informing users by emails upfront**. ### Upon user actions' failures + **RETRYING**: informing that s/he should retry + **POSTPONING**: informing that the service is down, but will be informed when is up to resend + **AUTOMATIC**: informing users that their video is saved and will be automatically retried when service is up and s/he will get success or permanent failure email. + we are already detecting some errors (like UNDEFINED, TIMEOUT). In these cases, users and admins might be informed of the failure. We could choose to retry later instead of sending a failure email after 30 mins (as it is now) ## Notifying Admins + Warning of individual/temporary processing failure + EMERGENCY ERROR: permanently down, reset required + Warning: service/server temporary down, with later UP notification # Recovery We can try to recover (retry) all the video processing requests that are recoverable + assuming the uploaded video happens to be successfuly saved at the beginning of the processing workflow, if the processing crashes, we can still redo automatically the same processing workflow # Resilience + In the future, we should have mirroring servers, where the 2nd one is up, while the main is under reconstruction/upgrading/crashed/under-hackers-attack # Tasks ## Existing Tasks TASK: **F016 - Storage management, delete old videos** + This task addresses a "fragile" balance between caching video material and its subproducts long enough to be available for all the processing tasks but still managing all garbage collection necessary to avoid disk space issues TASK: **D002 - Monitor servers availability** Support monitoring for: + SUBTASK: **resources** + currently we have installed server services that observe server resources and implemented dashboard to visualize some of them + we need to operationalize it send alarms when the resourceds get out of the preset boundaries + SUBTASK: **single components** + video-processing async tasks + high-level async tasks + 3rd part services (mail, DB, key-store, broker, ...) + SUBTASK: **high-level (e2e) - testing the system by performing "user-like" requests** + calling some short FFMPEG task to check if it finishes successfully + checking for an email confirmations TASK: **D003 - Install Video Processing Servers - non-elastic but manually scalable** This task will help with making both scalable, responsive and resilient infrastructure as we would have multiple servers separate from the main backend server. 1. SUBTASK: **install and provide the procedure for installing additional video processing servers** 2. SUBTASK: **install workflow support** - this will provide a *networking* infrastructure for handshaking tasks and results between the backend and video processing servers TASK: **D007 - beta.welcometerra.com** + Implement the beta server, as currently we have just staging and production server + Having them 3 set properly, we can have a safer terra scenarios: 1. `production` - final server, not used for testing except monitoring 2. `beta` - the next code release, ready for heavy testing inside the Terra organization together with beta users 3. `staging` - (i) demonstrates new features, (ii) testing new features in a "real world scenario" TASK: **D008 - Backup of code, data and database** + We need to backup `code` versions to be able to quickly role back if we notice an issue on the production server + We need to back up `data` (like original videos) to give a trust to users of terra as video management storage + We need to back up `database` to not risk loss of user's data (accounts, video info, ...) TASK: **D009 - Implementing Testing Infrastructure with some basic/crucial tests** + This is an old task that should be reorganized to address: 1. SUBTASK: **providing unit-test framework and guidance** 1. SUBTASK: **providing e2e framework and guidance** 1. SUBTASK: **cover some critical terra infrastructure with the initial unit-tests and e2e tests** 1. SUBTASK: **cover some critical terra infrastructure with the initial e2e tests** TASK: **D012 - Fix deployment scripts** TODO: Rename into "**D012 - Automatized DevOp scripts**" They will automatize handling: 1. SUBTASK: **resources (instances, volumes, backups)** 1. SUBTASK: **installation (servers)** 1. SUBTASK: **configuration (services, scaling)** 1. SUBTASK: **building (frontend, services)** 1. SUBTASK: **deployment (frontend, backend, services)** TASK: **D014 - Mock-up servers** 1. SUBTASK: **Mock-up frontend** - this will significantly speed up the backend development and testing cycle and demonstrate the API usage 1. SUBTASK: **Mock-up services** - this will speed up backend development TASK: **D015 - Tracking the video status/processing** TODO: extend with descriptions 1. SUBTASK: **Identify Failure Cases that are recoverable - document** 1. SUBTASK: **Create event checkpoints to track video failures - Backend implementation** + this utilizes ColaboFlow passively to audit the workflow progress 1. SUBTASK: **Set up an alert to ourselves when the server is down or process error** 1. SUBTASK: **Migrate away from Celery to ColaboFlow** + this utilizes ColaboFlow actively to control the workflow execution 1. SUBTASK: **If the system is down, alert S/S to restart ASAP (within 30 mins)** 1. SUBTASK: **If the system is fine, but certain videos are not processed, track the status and email the user to try again** 10. SUBTASK: **Identify and document the project proposal** + the outcome of this subtask is the current document ## New Tasks TASK: **E2E-M** automatic video uploading and testing success TASK: **PING-ing FFMPEG** calling some quick FFMPEG task to check if it finishes successfully (live test of async messaging and FFMPEG failures) TASK: **Proper error handling** + We should provide a systematic approach for reporting, handling and presenting errors in the system as the current platform very often doesn't even **react** to an error making user "***deaf***". We handled some of the most critical cases like *"network errors"*, *"access errors"*, etc. TASK: **Notifying errors** 1. SUBTASK: **Notifying Users Upfront** 1. SUBTASK: **Notifying Users Upon user actions’ failures** 1. SUBTASK: **Notifying Admins** TASK: **Modularizing code** We need to transform sections of current code into modular tasks to both audit and control it more properly through the Terra workflow TASK: **Recovery** We recover (retry) all the video processing requests that are recoverable TASK: **Resilience** + Provide mirroring servers that are immediatelly available or possible to install and boot on a server failure + provide autodetection and autoinitialization of such a scenario TASK: **Scalability and loadbalancing** We need to provide a scalability to our platform to 1. handle properly change in user demands 2. reduce service costs 3. support system resilience 4. support system recovery ### Finished Tasks (need an extension) TASK: **D001 - multitasking/multiusers monitoring infrastructure** + This task needs to be rewritten to support new workflow framework

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.