Assignment 1: SQL

# Assignment 1: SQL ![Twitter meme about database](https://i.imgur.com/kJvLkbR.jpg) ## Overview You can click [here](https://classroom.github.com/a/vjIgZ3QN) to get the stencil code for Homework 1. Reference [this guide](https://docs.google.com/document/d/1v3IQrC_0pFxsRBXsvCEzKBDAmYjzuSJCvXhkg8ewDn0/edit) for more information about Github and Github Classroom. The data is located in the `/data` folder. To ensure compatibility with the autograder, you should not modify the stencil unless instructed otherwise. For this assignment, please write each of your queries to its corresponding SQL file. Failing to do so may break the autograder and result in a low grade. ## Part 0: Setup ### Python Virtual Environment - **Option 1: Department Machine** Execute Code: First, ssh into the department machine by running `ssh [cs login]@ssh.cs.brown.edu` and typing your password when prompted. Then, navigate to the assignment directory and activate the course virtual environment by running `source /ifs/CS/courses/cs1951a/venv/bin/activate`. You can now run your code for the assignment. To deactivate this virtual environment, simply type `deactivate`. - **Option 2: Your Own Device** **Python** This course uses Python `3.11.2`. We expect you to use the same version of Python locally, as our Gradescope autograder runs on this version too. Using any other version of Python (especially if more than 1 minor version apart) will likely result in issues installing dependencies for this course's assignments. There are a number of ways to install Python `3.11.2`. The easiest way is to go to [Python's official website](https://www.python.org/downloads/) to download and install the correct version. Scroll to "Looking for a specific release?" and click on the Download button next to Python `3.11.2`. The page should provide instructions for each major OS. Windows users should check the box to add Python to PATH. If you're on MacOS, an alternative is to use [Homebrew](https://brew.sh/) to install Python 3.11. Simply run `$ brew install python@3.11` If you don't have Homebrew already, install it with `$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"` To verify your installation on MacOS, run `python3.11` in your terminal, and you should see a REPL. On Windows, run `python --version` to confirm you have Python 3.11.2. **Virtual environment** In a terminal, `cd` into the cloned stencil and run ```bash ./create_venv.sh source ~/cs1951a_venv/bin/activate ``` to create the virtual environment. In the future, whenver you create a new terminal, run ```bash source ~/cs1951a_venv/bin/activate ``` to activate the virtual environment. If you're on a Windows machine, these instructions do not work on Command Prompt. You need to run these commands in Git Bash, which you can download [here](https://git-scm.com/downloads). Alternatively, you can either work on the department machine or follow [these instructions](https://docs.google.com/document/d/19h0q6sd4yhirPbZorttmX0uYC40NAEAnpkdLPxeGK3c/edit?usp=sharing) to create the environment in the Command Prompt. After activating your virtual environment, you are free to run your program in the command line. Remember that you'll need to activate the environment every time you launch a new terminal. ### SQLite If you are working locally, you can check if SQLite is installed already by running `sqlite3 -version` in your terminal. You can refer to [this guide](https://www.tutorialspoint.com/sqlite/sqlite_installation.htm) or [this guide](https://www.servermania.com/kb/articles/install-sqlite/) to install SQLite and learn how to use SQLite on your machine. SQLite is installed on all department machines and by default on Mac. It can be accessed from the command line using `sqlite3`. Running `sqlite3 somedb.db` from your terminal will launch an environment that will allow you to type your SQL queries directly into the terminal. You can exit this environment by pushing Ctrl+D or by typing `.exit` and pressing enter. As a more explicit example, to open a SQL environment where you can query the movies.db database, you can type: `$ sqlite3 movies.db` To execute a SQL statement that you have saved in a solution file, you can run the following command: `$ sqlite3 movies.db < sql_solutions.sql` For more information on using SQLite from the command line, see the [official documentation](http://www.sqlite.org/sqlite.html). Additionally, we have provided very helpful hints for most of the problems; you should be able to use these as a starting point if you get stuck before looking up additional information online. Alternatively, there are also really powerful SQLite tools out there that you can use, especially if you prefer a more graphical way of interacting with SQLite. Check out the section below for more details. ### Database The method we recommend the most is [SQLite Browser](https://sqlitebrowser.org). It will let you upload .db files and run your SQL Code Some other tools are [SQLite Viewer](https://inloop.github.io/sqlite-viewer/) (a web application) and SQLTools + SQLite packages (if you are developing on VS Code). Additionally, you can install and use SQLite extensions on your internet browser (e.g., [SQLite Manager](https://chrome.google.com/webstore/detail/sqlite-manager/njognipnngillknkhikjecpnbkefclfe?hl=en) for Chrome and Mozilla) ## Part 1: Starting Off (20 points) This part of the assignment builds off the exercises you completed in the [SQL lab](https://drive.google.com/drive/folders/1hBdd_k4HVgHvhsLTT_PSjGnnQMwnmAkL?usp=sharing). If you have not yet completed the lab, please do so before starting this assignment. There are some really useful hints and examples you can borrow from the lab for this assignment. The database and schema are described again below, but they are the same from the lab. For each problem, we've left useful keywords that you may want to use. If you have not seen a keyword before, you're expected to look up the relevant documentation on [this site](https://www.w3schools.com/sql/default.asp). We have provided a database named `people.db` with the ID, name, occupation, and age of some Brown students and alumni. Here is the schema: ``` people_main(ID INTEGER, name TEXT, occupation TEXT, age INTEGER) people_likes(ID1 INTEGER, ID2 INTEGER) people_friends(ID1 INTEGER, ID2 INTEGER) ``` In the `people_main` table, ID is a unique identifier for a particular student or alumni name, occupation and age correspond to the person's first name, occupation and age. In the `people_friends` table, each `(ID1, ID2)` pair indicates that the particular person with ID1 is friends with the person with ID2 (and vice versa). The friendship is mutual, and, if `(ID1, ID2)` is in the table, it is guaranteed that `(ID2, ID1)` exists in the table. In the `people_likes` table, each `(ID1, ID2)` pair indicates that the student or alumni with `ID1` likes the person with `ID2`. The `(ID1, ID2)` pair in the table does NOT guarantee that the `(ID2, ID1)` pair also exists in the table. Your job is to write SQL queries for the data being requested: 1. (4 points) Write a SQL statement that returns the name and number of people that like each person. Results should be ordered by count (descending), and then by name (A-Z). Save the query to `part1_problem1.sql`. **Keyword(s):** Use a `LEFT JOIN`!** **Hint:** [This page](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/) is quite useful for understanding how `JOIN` works! 2. (4 points) Write a SQL statement that returns the two occupations that have the lowest count of popular people. There should be two columns: name and count. Consider popular people as people who are liked by at least one other person. Return the two occupations and the counts. Results should be ordered by occupation (A-Z). Save the query to `part1_problem2.sql`. **Keyword(s):** The `LIMIT `statement will come in handy!** 3. (4 points) Write a SQL statement that returns the name and occupation of all people who have more than 3 friends. Results should be ordered by name (A-Z). Save the query to `part1_problem3.sql`. **Keyword(s):** You'll need to take a look at the `HAVING` function. 4. (4 points) Write a SQL statement that returns the distinct name and age of all people who are liked by anyone younger than them. Results should be ordered by name (A-Z). Save the query to `part1_problem4.sql`. **Keyword(s):** You'll need to use the `DISTINCT` and `JOIN` keywords. 5. (4 points) Write a SQL statement to find pairs (A, B) such that person A likes person B, but A is not friends with B. The query should return 4 columns: ID of person 1, name of person 1, ID of person 2 and name of person 2. Results should be ordered by ID1 (ascending), then ID2 (ascending). Save the query to `part1_problem5.sql`. **Keyword(s):** You'll need to use a `JOIN` statement.** Time to join stuff! ![Meme about SQL joins](https://i.imgur.com/jqnRL3T.jpg =275x) ## Part 2: Getting Harder (60 points) For this part of the assignment, you will be using the TMDB Movie Dataset, which has been exported to the `movies.db` database. The database schema is as follows: ``` movies(budget INTEGER, homepage TEXT, id INTEGER, original_language TEXT, original_title TEXT, overview TEXT, popularity REAL, release_date TEXT, revenue REAL, runtime INTEGER, status TEXT, tagline TEXT, title TEXT, vote_average REAL, vote_count INTEGER) scores(review TEXT, min_score INTEGER, max_score INTEGER) ``` We encourage you to use the `WITH` operator, which lets you divide your query into separate queries. This separate query is known as a CTE (common table expression) where the result is stored in a virtual table. This virtual table can then be accessed from the main query in the same way that you would access one of the existing databases, such as `movies` or `scores`. For more on this, check out this [link](https://www.geeksforgeeks.org/cte-in-sql/). As an example, we can define a subquery and use it in another query as follows (there is also an example in the lab): ``` WITH subquery AS ( SELECT original_title, vote_average FROM movies ) SELECT original_title FROM subquery; ``` 1. (10 points) Write a SQL query to find the `original_title`, `budget` and `release_date` of the movie "John Carter" and append to that the movie `original_title`, `budget`, and `release_date` of the movie that was released 9 days after "John Carter". You can add days to a particular date by using the `date` function. For example, in order to add 3 days to to '2012-07-16', you can use `date('2012-07-16', '+3 days')` **Keyword(s):** The `UNION` statement should come in handy. 2. (10 points) Write a SQL query to count the number of movies that start with "The", end with a "2," or contain the word "shark". Your query should be case insensitive and return one column with one entry (i.e. a single value). You should be using `original_title` instead of `title`. **Keyword(s):** You may want to look into `CASE` statements and the `LIKE` operator. (Lab!) 3. (10 points) Write a SQL query to select the `original_title` of all movies and a column that contains a 1 if there exists another movie that has the same vote average and the same runtime as that movie, and a 0 otherwise. Results should be ordered by `original_title` (A-Z). **Keyword(s):** You may want to look into the `EXISTS` operator. **Hint:** Think about possible edge cases. 4. (10 points) Write a SQL query that returns the `original_title`, `vote_average` and `review` of every movie. The reviews depends on the `vote_average` as described in the `scores` table. For example, movies with a vote average between 2 and 3.9 (inclusive) are reviewed as 'poor', whereas movies with a vote average between 9 and 10 (inclusive) are reviewed as 'excellent'. If a movie is reviewed as 'awful' or 'poor' then `original_title` should read 'do not watch'. Results should be ordered by `id` (ascending). For example, the output should have the following format: ``` 'Snow White' | 8.7 | 'great' 'Toy Story' | 9.3 | 'must see' 'do not watch' | 2.3 | 'poor' ``` **Keywords(s)**: Look into the `BETWEEN` statement and how it can be used in a `JOIN`. **Hint:** Do NOT modify the current database by using `UPDATE`. Take a look at the `CASE` operation example from the lab. 5. (10 points) Write a SQL query that finds the `original_title`, `release_date` and `revenue` of all the movies whose revenue exceeded the average revenue of all the movies released on the same day (including itself). Results should be ordered by `release_date` (ascending), and then `revenue` (descending). **Keywords(s):** You'll find the `GROUP BY`, `AVG`, and `JOIN` operations helpful! 6. (10 points) Write a SQL query that, for each `original_language` that has **more than 2 movies**, finds the number of movies that were reviewed as 'poor' and the number of movies that were reviewed as 'good'. Like in the 4th question, you will need to look at the `scores` table to see what the review categories are and use the `vote_average` field of a movie to determine which review category it falls under. Your query should return 3 columns (`original_language`, `num_poor` which is the number of 'poor' movies for that language, and `num_good` which is the number of 'good' movies for the language, in this order). Your results should be ordered by number of 'good' movies (descending) and then number of 'poor' movies (ascending). Remember to only include languages that have more than 2 movies! **Keywords(s):** The `JOIN` statement will come in handy! **Hint:** Refer to the examples from the lab! ## Part 3: Optimization (10 points) We have also provided you with the `athletes.db` database (although you will not need to query it). The schema is as follows: `school_athletes(ID INTEGER, name TEXT, school TEXT, performance_score INTEGER, division TEXT)` For the query below, explain why the given query might not be the most efficient way to accomplish the task. Write out an optimized version of the query in `writeup.md`. Explain what steps you took to optimize it and why your version would be more efficient. 1. (6 points) The SQL query to optimize is as follows: ``` SELECT ID, name FROM school_athletes AS athletes WHERE school = 'Brown' and performance_score > ( SELECT AVG(performance_score) FROM school_athletes WHERE division = athletes.division ); ``` 2. (4 points) Consider two tables. Table A is very long with 1 billion rows and 5 columns. Table B is very wide with 1000 rows and 10,000 columns. If we were to join the two tables and want to make sure the join is performant, how should we best filter the tables? Assume that we can select from each table and then join the results. Specifically, state the table in which we should use `WHERE` heavily and the table in which we should be careful about what values we use in our `SELECT`. ## Handing In After finishing the assignment (and any assignment in the future), run `python zip_assignment.py` in the command line from your assignment directory, and fix any issues brought up by the script. After the script has been run successfully, you should find the file `sql-submission-1951A.zip` in your assignment directory. Please submit this zip file on Gradescope under the respective assignment. ## Credits Made with ♥ by Jens and Esteban, updated for Spring 2022 by Aakansha Mathur, updated for Spring 2023 by Alex and Annabel. Movie Database from [Kaggle](https://www.kaggle.com/tmdb/tmdb-movie-metadata).