Development flow

# Development flow ### Modularization With two or three people on the team, you need to think about how to maximize parallelism among members. One common (yet inefficient) approach in past years was each person is responsible for one query. But without a fast web-tier in Query 1, it is impossible to reach the target RPS in Query 2, and then Query 3 (in Phase 2). Given the description of the components in the system, i.e., web-tier (web server & logic), storage-tier (database), and ETL, it might be wise to let each person work on one of them. Remember, it also takes some efforts to explore and set up infrastructure (EC2/EMR and software, etc). Among components, it will be helpful to agree on a set of interfaces. For example, between web-tier and storage-tier, there can be a `getQ2RawData(keyword, user_id)` function. The developer who works on the web-tier may assume that this function will always return raw data efficiently, and compose a response based on it; the developer who works on the storage-tier will construct the data query, build this interface between storage-tier and web-tier and try his best to optimize the data query. This interface can even possibly work for both MySQL and HBase. This also helps the testing process, as you will see below. You are free to divide the work in any reasonable way, as long as everyone contributes roughly equally to the project. However, life might be easier if you follow this suggestion, especially as you move forward to later phases. ### Branch-based Git workflow START-PANEL:"info" After updating the `github-username` field in your [TPZ Profile](https://theproject.zone/profile), you will receive an invitation to the [CloudComputingTeamProject](https://github.com/CloudComputingTeamProject) GitHub organization. You must update the `github-username` field of your student profile, not the profile associated with your team. END-PANEL We have created a private repository on GitHub for each team. Please commit all your code changes to this repository. It will help you collaborate with the team, and keep track of your progress and contributions. Finally, at the end of each phase, we will evaluate how you used Git! #### Making changes [Here](https://guides.github.com/introduction/flow/) is a quick introduction to the branch-based workflow. The idea is as follows: - `master` is a permanent branch which always reflects a production-ready state. In our case, this can be seen as “checkpoints” in your development. - Whenever you want to create a new feature or improve existing code, create a branch from the tip of `master`. You are free to experiment and make as many commits as necessary, because other team members still have a stable `master` to use. There are two options here: - Push the branch to remote (i.e. GitHub) regularly, so your teammates know what you are doing. Plus, you won’t lose code. - Just keep the branch locally. In larger companies, this is more likely to happen. You are free to choose either approach. - When the feature is ready, create a [pull request](https://help.github.com/articles/about-pull-requests/). Your changes on this branch will be merged to `master`. You may want to revisit the Git primer for basic usage of Git commands. START-PANEL:"danger" Please put all your code and report (named `report.pdf`) in the private Git repo we prepared for you. We will check your repo at the deadlines. You will also have to upload your report using TheProject.Zone. END-PANEL #### Code review It is always nice to have another pair of eyes when your changes can directly affect the correctness of the whole system. You will also be more likely to write clear, maintainable code, because there is another human-being who will read your code. Instead of saying “Hey, can you look at my new code?”, and sending messages back and forth, Git provides a convenient web interface for doing code reviews (the change log is clearly marked!). [Here](https://github.com/features/code-review) is an overview of the process. - When you submit a pull request, at least one of your teammates needs to review the code change. - The `master` branch is [protected](https://help.github.com/articles/about-protected-branches/), so code review is mandatory. - The reviewers may make comments on any lines, or make a summary. They may also choose to approve the code change, so it will get merged into `master`; or request changes from the author. It might take some extra time to do the review, but in the end it will prove to be worthwhile. More bugs can be captured before the code is used in production. What’s more, when you go to interviews, you can confidently show off how the team code development process works and your Git skills! ### Continuous testing Say you have finished a new feature in the web-tier and want to see if it is correct. Instead of launching a bunch of EC2 instances and wait for an hour to load data, you can test your component locally. Because the storage-tier piece on the `master` branch is stable and its interface should still work, there is no need to actually use any DB. To test a small part of the system, people often do [unit testing](https://en.wikipedia.org/wiki/Unit_testing). In our scenario, the developer who works on the web-tier may have created a list of sample requests at the beginning. Instead of going to the storage-tier, she uses another class that implements `getTweets(keyword, user_id)` and reads from a local data file; that is, to use a “mock” storage-tier for testing. Then she can check if the web-tier gives some hand-calculated responses. It also might be desirable to test the entire system as a whole locally to save money ([integration testing](https://en.wikipedia.org/wiki/Integration_testing)). It is possible to install MySQL and HBase locally (see [HBase Standalone mode](http://hbase.apache.org/0.94/book/standalone_dist.html) for more information), and run a local “MapReduce” (covered in Project 1.2), import data, and test the web service on a small set of data. After this, you will know that the system is likely to function correctly on real, costly cloud resources. However, you are not allowed to use the HBase Standalone mode in a real submission. Continuous testing lets developers be more confident about their code changes: the new feature is well tested, and no existing functionality is broken. In many large IT companies, there are complicated systems that manage the source version control as well as automatic continuous testing: revisions will only be accepted if they are reviewed and they pass all tests. For our project, we do not require you to spend too much time on setting up the test infrastructure (but it would be awesome to try some tools [here](https://github.com/marketplace/category/continuous-integration)!). You can use anything reasonable; for example, `junit`, or even some hand-written test code. For integration tests, some bash scripts might also be handy. In your reports for each phase, we will ask how you test your system, and look at your code. Note code review and tests are not there to make TAs happy; they will eventually save you time and prevent drama (especially during live tests), which happen every semester. Therefore, learn these real-world skills and use them wisely!