Phase 1
=======
### Resource Tagging And Budgets
1. Tag all of your instances with `Key: Project` and `Value: Phase1`.
2. In addition to the tag above, all instances in your HBase cluster should be tagged with `Key: teambackend` and `Value: hbase`, and all instances with MySQL installed should be tagged with `Key: teambackend` and `Value: mysql`.
3. No additional tags are required for instances used to do ETL jobs.
4. You can use **ANY** instances for ETL and debugging.
5. You can ONLY use instances in the **M family** which are **smaller than or equal to large** for your web service in submission (both your web-tier and storage-tier).
6. You can use one of these AMIs as your base: Ubuntu Server 18.04 LTS (**ami-0817d428a6fb68645** (64-bit x86) / **ami-0f2b111fdc1647918** (64-bit Arm)) or Amazon Linux AMI 2018.03.0 (HVM) (**ami-0c94855ba95c71c99** (64-bit x86) / **ami-0d29b48622869dfd9** (64-bit Arm)). You can also build your own AMIs based on these AMIs for this project, or write a script to automatically install packages needed to run your server on one of these clean state AMIs. However, if you choose to use your own AMI, you may suffer a big latency of I/O operations in the first-time access. This is expected and please figure out how to solve it by yourselves.
7. Your web service (each test submitted to the website) must have a maximum cost of **$0.85 per hour** (this includes on-demand EC2, EBS and ELB costs. Ignore S3, EMR, Network and Disk I/O costs). **Even if you use spot pricing, the constraints that apply to your web service pertain to on-demand pricing.**
8. You will have a budget **$55/team** for all the tasks in this phase (Phase 1).
9. For each query, we will take the highest score in your 10-minutes submissions as your score for this query.
### Team Project Grading
The team project is worth 20% of the course grade.
- Phase 1: 20%
- Phase 2: 30%
- Phase 3: 50%
As shown above, phase 1 is worth 20% of the Team Project. You need to finish all the tasks in order to move on to the next phase, where a new query will be added. Your success in the next phases is highly dependent on how much you have explored, experimented, learned and achieved in Phase 1. So, even though it only accounts for 20%, Phase 1 has a huge impact on the overall success of the project and should be taken very seriously. Think of it as an opportunity to learn and explore at a reduced cost.
START-PANEL:"info"
### Phase 1 Grading
#### [5%] Checkpoint Report, due Sunday, October 18
The checkpoint report must conform to the given template, which is available [here](https://docs.google.com/document/d/18zfkfu1ZMEKIMhMrsZR-DKv-1Y8aDpixYg6y3bolVt8).
#### [5%] Q1 Checkpoint, due Sunday, October 18
To pass the checkpoint, you need at least one successful 10-minute submission. That is, a submission with a non-zero score.
#### [10%] Q1 Final, due Sunday, October 25
This grade will be calculated as `raw_score * 10%`
The raw score is based on the *effective throughput* of your *best* 10-minute submission.
Please refer to section **Query Score Calculation** for the definition of *effective throughput*.
#### [10%] Q2 Checkpoint, due Sunday, October 25
To pass the checkpoint, you need at least one successful 10-minute submission, for **both MySQL and HBase**. That is, the assigned grade for this checkpoint will be either 0% or 10%.
#### [50%] Q2 Final, due Sunday, November 1
Your MySQL and HBase submissions each account for 25% of the Phase 1 grade. The raw score for MySQL/HBase is based on the *effective throughput* of your *best* 10-minute submission.
**Note:** You are required to achieve at least 30% of the RPS target for **both MySQL and HBase** in order to get any credit.
#### [20%] Final Report + Project Code, due Tuesday, November 3
You need to submit a final report at the end of Phase 1. The report must conform to the template, available from [here](https://docs.google.com/document/d/1z5wSN3h2UrWhVWEwBHFGUprm6SpjchXhK7p_PVfFsro).
**Please work on the report as you are working on the queries. The report template has some hints that guide your development. Also, you need to collect data in order to finish the report. The best way to do the project is to continuously work on your report. It will guide your progress. Historically, teams that write the report at the end of each module tend to perform badly.**
END-PANEL
START-PANEL:"warning"
- In the team project, you have larger budgets but also more tasks. You should always plan your tasks, calculate the potential cost before using any resource and pay careful attention to your budget. Decide on your design, experimentation, development, testing, deployment and drama budgets before you start working. Keep in mind that you are allowed to use AWS, Azure or GCP for the ETL process.
- You are NOT ALLOWED to use ANY existing caching applications (Redis, Memcached, etc.), third-party cache libraries or ANY existing databases except MySQL and HBase in the team project. (Amazon RDS is not allowed to use.) However, you are allowed to write your own cache application manually. ANY violation of this rule will result in penalties (-100% at least).
END-PANEL
START-PANEL:"info"
#### **Hints:**
**ETL**: You can use any instance types in the ETL phase, but be mindful of your expenditure and budget. If we take AWS as an example, other instance families in EMR, such as the C family (Compute Optimized) and R family (Memory Optimized), may be appropriate based on your algorithm. Don't restrict your choice only to the M family (General Purpose) machines for the ETL process.
**MySQL**: you can refer to [this](https://dev.mysql.com/doc/refman/5.7/en/optimization.html) for MySQL 5.7 official optimization document. You may switch to other MySQL versions if needed.
**HBase**: Please review the writeup of the NoSQL primer, HBase primer and P3.1 and recall what you have done in P3.1\. These experiences will be very useful when you design your HBase schema. You can choose to set up an HBase cluster either using EMR or manually using EC2\. This has to be done on AWS. Please consider the potential performance benefit of deploying your own HBase cluster as well as the cost before making any decisions. Refer to [this](http://hbase.apache.org/book.html#standalone_dist) if you want to set up an HBase cluster manually. You may also need to install Hadoop and ZooKeeper before installing HBase. You can also try other Hadoop distributions such as [CDH](https://www.cloudera.com/downloads/cdh/5-14-0.html) developed by Cloudera. However, you are NOT allowed to use or install secondary index libraries, like Apache Phoenix.
END-PANEL
START-PANEL:"info"
#### **Reference Server:**
We have also provided you two reference servers for you to check the correctness of your results (don't forget to put in query URL). We highly recommend you to use this server for checking the correctness of your output files before loading to your database.
- Full reference server [http://reference.theproject.zone](http://reference.theproject.zone)
- Mini reference server [http://reference.theproject.zone/mini](http://reference.theproject.zone/mini) which contains the first 50 files only (out of 1000)
You can also use this server to figure out possible encoding problems that you may encounter in Q2.
START-PANEL:"danger"
[table]
[thead]
[tr]
[th]Violation[/th]
[th]Penalty of the project grade[/th]
[/tr]
[/thead]
[tbody]
[tr]
[td]Using more than $55 to complete this phase[/td]
[td]-10%[/td]
[/tr]
[tr]
[td]Failing to tag all your resources for this project[/td]
[td]-10%[/td]
[/tr]
[tr]
[td]Using more than $0.85 per hour for submissions[/td]
[td]`-2*n%` (where n = cents exceeding budget. e.g. spending $0.920 per hour will result in 2*7%=14% penalty)[/td]
[/tr]
[tr]
[td]Using more than $75 to complete this phase[/td]
[td]-100%[/td]
[/tr]
[tr]
[td]Submitting **ANY** kind of credentials in code[/td]
[td]-100%[/td]
[/tr]
[tr]
[td]Using instances not in the M family or larger than large for your web service in submission (both your web-tier and storage-tier).[/td]
[td]-100% at least[/td]
[/tr]
[tr]
[td]Publishing your code (e.g. Github)[/td]
[td]-200% at least[/td]
[/tr]
[tr]
[td]Copying code from Internet, other teams or solutions from previous semesters[/td]
[td]-200% at least[/td]
[/tr]
[tr]
[td]Any kind of collaboration across teams[/td]
[td]-200% at least[/td]
[/tr]
[/tbody]
[/table]
END-PANEL