Overview of Tasks

# Overview of Tasks ### Web Tier This task requires you to build the web-tier of the web service. The web-tier should accept RESTful requests and send back responses. #### Design constraints * Execution model: you can use any web framework you want, but you **must** compare at least two web frameworks and summarize them in your report for this phase. * Spot instances are highly recommended during the development period, otherwise it is very likely that you will exceed the overall budget for this phase and fail the project. #### Recommendations To test your web-tier, use the Heartbeat query (q1). It will be wise to ensure that your system comes close to satisfying the minimum throughput requirement of heartbeat requests before you move forward. However, as you design the web-tier, make sure to account for the cost. Write an automatic script, or make a new AMI to launch instances and configure your web-tier. It can help to reconstruct the web-tier quickly, which may happen several times in the exploration and experimentation process. Please terminate your instances when not needed. These hints can help you save time or your cost will increase higher than the budget and lead to failure. #### Hints Although we do not have any constraints on the web-tier, the performance of different web frameworks varies a lot. Choosing a slow web framework may have a negative impact on the throughput of every query. Therefore, we strongly recommend that you do some investigation about this topic before you start. You may find this resource helpful, [Techempower](https://www.techempower.com/benchmarks/), it provides a detailed benchmark for mainstream frameworks. You can also compare the performance of different frameworks under our testing environment by testing q1 since it is just a heartbeat message and has no interactions with the storage-tier. Please also think about whether the web framework you choose has API support for MySQL and HBase. In your report, you need to be able to convince us that your team has done enough exploration and experimentation to allow for a detailed comparison of different frameworks, which led your team to make an informed decision. Your team should include evidence of this performance comparison. ### ETL This task requires you to load the Twitter dataset into the database using the extract, transform and load (ETL) process in data warehousing. In the extract step, you will extract data from an outside source. For this phase, the outside source is a JSON Twitter dataset of tweets stored on S3, containing about 200 million tweets. The transform step applies a series of functions on the extracted data to realize the data needed in the target database. The transform step relies on the schema design of the target database. The load phase loads the data into the target database. You will have to carefully design the ETL process using AWS, Azure or GCP resources. Considerations include the programming model used for the ETL job, the type and number of instances needed to execute the job. Given the above, you should be able to come up with an expected time to complete the job and hence an expected overall cost of the ETL job. Once this step is completed, you should backup your database to save cost. If you use EMR, you can backup HBase on S3 using the commands: `hbase snapshot create -n snapshotName -t tableName` `hbase snapshot export -snapshot snapshotName -copy-to s3://bucketName/folder -mappers 2` This backup command will run a Hadoop MapReduce job and you can monitor it from both Amazon EMR debug tool or by accessing the Jobtracker. To learn more about HBase S3 backup, please refer to [this link](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase-snapshot.html). #### Design constraints * Programming model: you can use any programming model you see fit for the ETL job. * AWS resources: You should use SPOT instances for this step otherwise it is very likely that you will exceed the budget and fail the project. #### Recommendations Always test the correctness of your system using a tiny dataset (example 200MB). If your ETL job fails or produces wrong results using the larger dataset, you will be burning through your budget. After the database is populated with data, it will be wise to test the throughput of your storage-tier for different types of queries. Ensure that your system produces correct query responses for all query types. #### Hints **Think** about the schema design of your database before attempting the ETL job. How to do ETL correctly and efficiently will be a critical part of your success in this project. Notice that ETL on a large dataset could take 10 - 30 hours for just a single run, so it will be very painful to do it more than once, although this may be inevitable since you will be refining your schema throughout your development. Think about possible ways to reduce the time and cost of the ETL job since you might have to run it several times. You may find your ETL job extremely time consuming because of the large dataset and/or the poor design of your ETL process. Due to many reasons that could lead to the failure of your ETL job, please start thinking about your database schema and your ETL job as early as possible. Try to utilize parallelism as much as possible, loading the data with a single process/thread is a waste of time and computing resources, MapReduce may be your friend, though other methods work as long as they can accelerate your ETL process. ### Storage Tier For your system to provide responses for q2-q3, you need to store the required data in your storage-tier (database). Your web-tier connects to the storage-tier and queries it in order to get the response and then sends the response to the requester. You are going to use both HBase and MySQL in this phase. We will provide you with some references that will accelerate your learning process. You are expected to read and learn about these database systems on your own in order to finish this task. You can only use HBase as a cluster mode (e.g., using EMR in AWS) but **cannot** use it as a standalone mode. Otherwise, you will miss the purpose of using HBase. This task requires you to construct the storage-tier. It should be able to store whatever data you need to satisfy the query requests. You should use spot instances for system development. You are not allowed to use Amazon Database Services like RDS or Cassandra for this project. You need to consider the design of the table structure for the database. The design of the table significantly affects the performance of a database. In this task, you should also test and make sure that your web-tier connects to your storage-tier (database) and can get responses for queries. Besides, since we are using AWS in this project, your storage IOPS performance may be related to the throughput credits and burst performance. This is important for the Live Test, which will last for several hours. Please take the burst balance into consideration when you are designing your storage tier. #### Recommendations Test the **functionality** of the database with a few dummy entries before loading your entire dataset. The functionality test ensures that your database can correctly produce the response to the queries. #### References 1. MySQL [http://dev.mysql.com/doc/](http://dev.mysql.com/doc/) 2. HBase [https://hbase.apache.org/](https://hbase.apache.org/) 3. AWS Storage Performance [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html#EBSVolumeTypes_piops](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html#EBSVolumeTypes_piops) ### Additional Resources and References #### Resources 1. [Benchmarks of web servers](https://www.techempower.com/benchmarks/) 2. [Schwartz, B., and P. Zaitsev. "A brief introduction to goal-driven performance optimization." White paper, Percona (2010).](http://www.percona.com/blog/2010/05/04/goal-driven-performance-optimization-white-paper-available/) 3. [Practical MySQL Performance Optimization](http://www.percona.com/resources/mysql-webinars/practical-mysql-performance-optimization) 4. [HBase Cheat Sheet](https://dzone.com/refcardz/hbase) #### Additional References These are interesting papers that deal with the theory and the core problems that you will be solving for the Team Project. You may choose to read them if you want to understand more about how Internet-scale companies (Google, Facebook, Twitter) achieve performance at scale. **Architecting web servers** 1. [Erb, Benjamin. "Concurrent programming for scalable web architectures." Informatiktage. 2012.](http://vts.uni-ulm.de/docs/2012/8082/vts_8082_11772.pdf) 2. [Pariag, David, et al. "Comparing the performance of web server architectures." ACM SIGOPS Operating Systems Review. Vol. 41\. No. 3\. ACM, 2007.](https://www.ece.cmu.edu/~ece845/docs/pariag-2007.pdf) 3. [Hu, James C., Irfan Pyarali, and Douglas C. Schmidt. "Measuring the impact of event dispatching and concurrency models on web server performance over high-speed networks." Global Telecommunications Conference, 1997\. GLOBECOM'97., IEEE. Vol. 3\. IEEE, 1997.](https://www.dre.vanderbilt.edu/~schmidt/PDF/globalinternet.pdf) **Clustering web servers** 1. [Schroeder, Trevor, Steve Goddard, and Byrov Ramamurthy. "Scalable web server clustering technologies." Network, IEEE 14.3 (2000): 38-45.](http://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1083&context=csearticles) 2. [Cardellini, Valeria, Michele Colajanni, and S. Yu Philip. "Dynamic load balancing on web-server systems." IEEE Internet computing 3.3 (1999): 28-39.](http://www.ics.uci.edu/~cs230/reading/DLB.pdf) 3. [Paudyal, Umesh. "Scalable web application using node.js and couchdb." (2011).](http://uu.diva-portal.org/smash/get/diva2:443102/FULLTEXT01.pdf) **Optimizing a Multi-tier System** 1. [Fitzpatrick, Brad. "Distributed caching with memcached." Linux journal 2004.124 (2004): 5.](http://www.linuxjournal.com/article/7451) 2. [Graziano, Pablo. "Speed up your web site with Varnish." Linux Journal 2013.227 (2013): 4.](http://www.linuxjournal.com/content/speed-your-web-site-varnish) 3. [Reese, Will. "Nginx: the high-performance web server and reverse proxy." Linux Journal 2008.173 (2008): 2.](http://www.linuxjournal.com/magazine/nginx-high-performance-web-server-and-reverse-proxy) **Scalable and Performant Data Stores** 1. [DeCandia, Giuseppe, et al. "Dynamo: amazon's highly available key-value store." ACM SIGOPS Operating Systems Review. Vol. 41\. No. 6\. ACM, 2007.](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) 2. [Cattell, Rick. "Scalable SQL and NoSQL data stores." ACM SIGMOD Record 39.4 (2011): 12-27.](http://www.cattell.net/datastores/Datastores.pdf) **Web Server Performance Measurement** 1. [Slothouber, Louis P. "A model of web server performance." Proceedings of the 5th International World wide web Conference. 1996.](http://www.oocities.org/webserverperformance/webmodel.pdf) 2. [Banga, Gaurav, and Peter Druschel. "Measuring the Capacity of a Web Server." USENIX Symposium on Internet Technologies and Systems. 1997.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.3268&rep=rep1&type=pdf) 3. [Nottingham, Mark. "On HTTP Load Testing"](https://www.mnot.net/blog/2011/05/18/http_benchmark_rules)