# CS184 Website Backup
### Motivation
As of now, the CS184 website runs in a single VM without any redundancy, and the previously proposed Github backup mechanism is still a work in progress. In the case that the machine fails, or if we accidentally break the website when updating it, we need to be prepared to restore the CS184 website without causing too much disruption to students.
### Design Goals
* Automate CS184 website backups
* Automate the deletion of old backups
* Make backups easy to restore
* Make transition of administrative priviledges for TAs from year to year simple
### Out-of-scope Goals (can be addressed later)
* Keeping a live backup server
* Keeping a live staging server
* Load-balancing requests between current and backup server
* Set up a simple web server to host previous years' contents
### Storage provider alternatives
#### Github
**Pros**
1. Account not tied to one particular person
2. Free
3. Support versioning
**Cons**
1. Large files (>50MB) are not supported by default
2. Although no explicit cap seems to be present, Github *strongly recommends* that repositories keep their sizes [below 5GB](https://docs.github.com/en/github/managing-large-files/what-is-my-disk-quota). Currently, the entire `website` folder on the web server is already 3.7GB.
3. No explicit way to delete old backups (Github GCs nonreachable objects whenever it pleases). Combined with 2, this means our repo can reach 5GB after only 2 backups.
4. Github itself [recommends against]((https://docs.github.com/en/github/managing-large-files/what-is-my-disk-quota)) using it as a backup tool.
#### Google Drive
**Pros**
1. Free unlimited storage (for now)
2. Support large files
3. Account not tied to one particular person (cs184@berkeley.edu)
**Cons**
1. Google drive will discontinue its unlimited storage policy for education and instead allocate 100TB of pooled storage for each organization starting July 2022. While UC Berkeley can probably negotiate to increase this cap, it isn't clear how expensive it will be. For more info, see https://support.google.com/a/answer/10403871?hl=en.
2. Have to build versioning & lifecycle control from scratch
3. ~~Google drive does not offer group accouts, nor is it clear that we can request a single user account from the universtiy to backup our data.~~ As Jade pointed out, we actually have a group account!
#### S3
**Pros**
1. Clearly defined ways to transfer account ownership when instructor changes
2. Automatic versioning & deletion of backups
3. No need to worry about service being cancelled (as long as we are paying)
4. Support large files
**Cons**
1. Some initial setup required from Prof Ren.
2. Account tied to the instructor
3. Monthly fee
Given the analysis above, I believe Google Drive is the best option because it is both free and requires no futher setup by the instructor. While implementing versioning and automatic deletion is necessary, it isn't too much work. However, if we choose this option, we need to bear in mind the risk of the school limiting our usage after July 2022 and us having to switch to another provider as a result.
I think S3 is the second-best option because it is the only option that does not have the risk of us being forced to move to another platform. Furthermore, it has versioning and lifecycle control built-in, saving us the need to develop these ourselves. However, this will be one more thing for us to manage, so I think it's marginally worse than Google Drive.
## Google Drive
### Initial Setup
1. Under cs184@berkeley.edu, create shared drives named `CS184 Daily Backups` and `CS184 Static Archive`.
2. Create a new API client secret for Google Drive (https://rclone.org/drive/#making-your-own-client-id).
### Transfer of Priviledges
At the start of every semester, after surveying & assigning TAs to work on the website, they should also be granted access to cs184@berkeley.edu.
### Backup Implementation
**Daily Backups**
Daily backups are triggered via crontab and automatically uploaded to Google Drive via `rclone`. This includes:
* Mysql database
* The entire website folder (excluding `.git/` and `node_modules/`)
We will need to write a script that performs the following:
Inside of `CS184 Daily Backups`, it creates a folder named with the current date and time. Then, it uploads both the .sql file and a compressed version of the website folder to this newly created folder.
After uploading, it scans the root directory and deletes the oldest folders if more than 30 are present.
**Static Backups**
Static backups are performed via `httrack` at the end of each semester and uploaded to `CS184 Static Archive` via `rclone`. Initially, all previous semesters should be archvied, but only the most recent semester should be archived in subsequent semesters. A script should be authored for simple execution every semester.
This ensures that we can use a simple webserver to host all the previous content, without the need to maintain every year's dynamic websites.
## S3
### Initial Setup
This should only need to be done once by Prof. Ren.
1. Register a new AWS root account. I recommend not using an old account or an account associated with the retail amazon.com because we may want to transfer this root account.
2. Create an IAM group called 'cs184-backup' with the following policies:
* AmazonS3FullAccess
3. Create an IAM user called 'cs184-backup'
* Grant both **Programmatic access** and **AWS Management Console access**
* Add to 'cs184-TA' group
4. Document the login link, username, and password and hand to the head TAs. This account info, along with login details to the CS184 web server, should be passed down from year to year.
### Transfer of Priviledges
If the instructor for CS184 changes, the current instructor should transfer the AWS root account to the new instructor. For the latest instructions, see https://aws.amazon.com/premiumsupport/knowledge-center/transfer-aws-account/
### Backup Implementation
**Daily Backups**
Daily backups are triggered via crontab and automatically uploaded to S3. This includes:
* Mysql database
* The entire website folder (excluding `.git/` and `node_modules/`)
We configure S3 to enable versioning and delete previous versions of objects 30 days after objects become previous versions. This policy ensures that at least 1 version of the daily backup is available even after we pause the backup for a long time (e.g. at the end of the semester).
**Static Backups**
The implementation of static backups on S3 is very similar to that on Google Drive.
### Cost Analysis
Assume the size of the current website is 4GB, then 30 daily backups will take 4*30=120GB.
Assume static backups of all previous semesters take less than 50GB.
Then our total data is 170GB, which takes $3.91 per month according to https://calculator.aws/.