# 50.043 Database Project Documentation
## Group 22
**Chow Jia Yi | 1003597
Nan Shing Kham Shing Alice | 1003747
Tiffany Goh Yi Lin
Xavier Tan De Jun**
## Contents
1. Tech Stack
2. How To Run Production Scripts (via Amazon CloudFormation)
3. How To Tear Down Production System
4. Application Features
5. Flask API
6. Scrapper
7. Production System Databases
8. Analytics System
9. Analytics Tasks
10. Automation Scripts
## Tech Stack
This application consists of a React-Redux frontend and a Flask backend.
To compile, change directory to Book-Review-Web-Application/website and run
```bash
npm run production
```
to compile and minify the React code into a bundle. You can see the bundle files at `bookReviewAPI/src/templates`.
When the Flask application is served, it will read the static files through that folder.
## Project Architecture on Github
| **automationScripts** contains the scripts to setup the production and analytics systems.
| - **hdfsScripts** contains the scripts to set up the Hadoop clusters.
| - **mongodbScripts** contains helper code to set up MongoDB.
| - **mysqlScripts** contains scripts to set up and load data into MySQL.
| **bookReviewAPI** contains the Flask code for the APIs.
| - **src** contains Flask code.
| --- Directory **templates** contains the html pages.
| --- Directory **static** contains the minified React code which will be run from Flask backend. This directory will be created when ```npm run production``` is run.
| --- **config.py** contains the URIs of the databases.
| --- **controllers.py** contains all of the API calls.
| --- **error.py** contains all the helper functions to handle errors.
| --- **models.py** contains the MySQL Database schema.
| - **requirements.txt** contains all the dependencies needed for the Flask API. When setting up the production server, ```pip3 install -r requirements.txt``` will be run to install all required dependencies for the backend.
| **website** contains the frontend React application, which will be compiled into static files in production.
## How To Run Production Scripts (via Amazon Cloud Formation)
Get the full file URI of production_template.json and pass it as the file URI.
Run the following:
```bash
aws cloudformation create-stack --stack-name [stack name] --template-body [file URI] --parameters ParameterKey=InstanceType,ParameterValue=[instance type] ParameterKey=ImageId,ParameterValue=t2.medium ParameterKey=KeyName,ParameterValue=[name of private key]
```
Expect to wait for 5 to 10 minutes for the production system to be fully loaded.
To access the webpage, type
```
aws cloudformation describe-stacks --stack-name [stack name]
```
When the stack has fully setup, it will output the IP address for the Backend Server(BEPublicIP), MongoDB, and MySQL. Copy and paste the BEPublicIP to access the application.
## How To Tear Down Production System
1. Run the following:
```bash
aws cloudformation delete-stack --stack-name [stack name]
```
## Application Features
Website visitors can register an account to become a user.
1. **Home Page**
On The Home Page, visitors can search for a book three ways:
1. By ++Genre section++
- visitor clicks on a genre provided
- visitor will be redirected to the Genre Page
2. By ++Ratings++
- visitor clicks on a Rating checkbox (eg: `4 and higher` which indicates books with an average rating of 4 stars and above)
- results will be displayed below on the same page
2. By ++Title++
- visitor enters the search keyword in the search bar and click `Search` button
- results will be displayed below on the same page
On the Navigation Bar, visitor can choose to register for an account or login into account.
2. **Login Page**
User can login into account and view user details and bookshelf on the Profile Page. Logged in Users can also leave a book review on the Book Page of a particular book.
3. **Register Page**
Visitors can register for an account to become a user.
4. **Genre Page**
Genre Page displays books of the selected genre. To improve efficiency, the first 1000 books are displayed first. Visitor can click on a book to view the book; this will redirect visitor to Book Page.
5. **Book Page**
Book Page displays book details and reviews of the book. A user can leave a review at the ++My Review++ section (if they are logged in).
Logged in users can also save books that they are interested in under 'Want to Read', and books that they have read under 'History'.
They can then view the saved books in the Profile page.
6. **Profile Page**
Profile Page displays user's details and bookshelf. The bookshelf contains books user is currently reading or has read (did not manage to debug Bookshelf section, currently not working).
7. **Add Book Page**
Users can submit a book to the website admin if book is not available on site.
## Flask API
We used Flask, a web application framework written in Python based on Werkzeug WSGI toolkit and Jinja2 template engine to build our application. The files are present in /bookReviewAPI/src on our Github repository. controllers.py (`/bookReviewsAPI/src/controllers.py`) is the main file that connect to the backend; it contains routes and functions to fetch the Database instance and collection.
The data is sent to the frontend in controllers.py, which also contains the main code that renders all the HTML templates present in the static folder.
## Scrapper
There is limited data provided in the provided databases. Many of the book titles are not present. We conducted the scrapping as shown in `/web_scrapper_title.py` file.
## Production System Databases
### MongoDB
Our MongoDB server is hosted on a separate EC2 instance. It contains two databases, one named book_metadata, and the other named web_logs.
The book_metadata database stores metadata of all books, including the title, related books, price, asin number and so on. All metadata of the books are stored in a collection named metadata.
The web_logs database contains the activities from users. Whenever a user makes an API request to the backend through the website, information about the request is stored in the web_logs database through the logs collection.
Each log record has the following information:
* Request method (GET, POST, etc)
* Request path
* Reponse status code (404, 200, 500 etc)
* Duration
* Timestamp
* IP address
* Host
* Request parameters
### SQL
Our MySQL server is hosted on a separate EC2 instance.
The MySQL database contains the following tables:
* User ( uid varchar not null primary key,
password text
)
* Review ( (uid, asin) primary key,
helpful varchar,
overall int,
reviewText varchar,
summary varchar,
reviewTime datetime,
unixReviewTime bigint
)
* Book (asin varchar not null)
* ReadHistory (uid varchar, asin varchar)
* Interest (uid varchar, asin varchar)
## Analytics System
## Analytics Tasks
We did not manage to get the ingestion tools from the respective databases to our hadoop clusters promptly.
But we have finalised the analytics tasks script which can be run by following the commands in AnalyticsScript.txt on namenode ec2 instance with user hadoop.
### Pearson Correlation: Between price and average review length
The value is calculated using the formula: 
r = Pearson Correlation value
n = total number of values
x = values in the first dataset (the prices of the books)
y = values in the second dataset (the average review lengths of the books)
We have a class PearsonCorrelationCalculator defined in pearson_correlation.py. There are five map-reduce tasks defined in the `calculate_pearson_correlation` fuction. The correlation value is then calculated using the outputs of the map-reduce tasks and the formula.
The calculated Pearson Correlation value is 0.0228853646.
### TF-IDF: Computing the term frequency inverse document frequency metric on the review text
## Automation Scripts
###### tags: `NodeJS` `React` `Redux` `Flask` MongoDB` `SQL` `Documentation` `HDFS` `Spark` `Javascript` `Python`