# [Learn SQL](https://www.codecademy.com/learn/learn-sql) > Lee Tsung-Tang > ###### tags: `SQL` `database` [TOC] {%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %} ## Manipulation ### Introduction to SQL The statements covered in this course use SQLite Relational Database Management System (RDBMS). You can also access a glossary of all the SQL commands taught in this course. ### Relational Databases ```SQL= SELECT * FROM celebs; ``` A `table` is a collection of data organized into rows and columns. Tables are sometimes referred to as relations. Here the table is ==celebs==. A `column` is a set of data values of a particular type. Here, ==id, name, and age== are the columns. A `row` is a single record in a table. The first row in the celebs table has: - An `id` of 1 - A `name` of Justin Bieber - An `age` of 22 All data stored in a relational database is of a certain data type. Some of the most common data types are: - `INTEGER`, a positive or negative whole number - `TEXT`, a text string - `DATE`, the date formatted as YYYY-MM-DD - `REAL`, a decimal value ### Statements The code below is a SQL statement. A statement is text that the database recognizes as a valid command. Statements always end in a semicolon `;`. ```sql= CREATE TABLE table_name ( column_1 data_type, column_2 data_type, column_3 data_type ); ``` 1. `CREATE TABLE` is a *clause*. Clauses perform specific tasks in SQL. By convention, clauses are written in capital letters. Clauses can also be referred to as commands. <!-- 2. `table_name` refers to the name of the table that the command is applied to. (column_1 data_type, column_2 data_type, column_3 data_type) is a parameter. A parameter is a list of columns, data types, or values that are passed to a clause as an argument. Here, the parameter is a list of column names and the associated data type. The structure of SQL statements vary. The number of lines used does not matter. A statement can be written all on one line, or split up across multiple lines if it makes it easier to read. In this course, you will become familiar with the structure of common statements. --> ... 中間待補 ### Update The `UPDATE` statement ==edits a row in a table==. You can use the `UPDATE` statement when you want to change existing records. `id` =4 的 `twitter_handle` 改為 `@taylorswift13` ```sql= UPDATE celebs SET twitter_handle = '@taylorswift13' WHERE id = 4; ``` 1. `UPDATE` is a clause that edits a row in the table. 2. `celebs` is the name of the table. 3. `SET` is a clause that indicates the column to edit. - `twitter_handle` is the name of the column that is going to be updated - `@taylorswift13` is the new value that is going to be inserted into the `twitter_handle` column. 4. `WHERE` is a clause that indicates ==which row(s)== to update with the new column value. Here the row with a 4 in the id column is the row that will have the `twitter_handle` updated to `@taylorswift13`. #### Instructions ![](https://i.imgur.com/ZlwmTiO.png) #### QA: How is `ALTER` different from `UPDATE`? **Answer** Although similar in the sense that both statements will modify a table, these statements are quite different. The `ALTER` statement is used to ==modify columns==. With ALTER, you can add columns, remove them, or even modify them. The `UPDATE` statement is used to ==modify rows==. However, `UPDATE` can only update a row, and ==cannot== remove or add rows. ### Delete The `DELETE FROM` statement deletes one or more rows from a table. ```sql= DELETE FROM celebs WHERE twitter_handle IS NULL; ``` 1. `DELETE FROM` is a clause that lets you delete rows from a table. 2. `celebs` is the name of the table we want to delete rows from. 3. `WHERE` is a clause that lets you select which rows you want to delete. Here we want to delete all of the rows where the `twitter_handle` column `IS NULL`. 4. `IS NULL` is a condition in SQL that returns true when the value is NULL and false otherwise. #### Instructions ![](https://i.imgur.com/wlddkU4.png) #### QA: What if we only want to delete a specific number of rows? Answer To delete only a specific number of rows, you can utilize the `LIMIT` statement. The value provided for `LIMIT` will be how many rows to affect. For example, this statement will only delete the first 5 rows that match the condition, ```sql= DELETE FROM table WHERE condition LIMIT 5; ``` ### Constraints *Constraints* that ==add information== about how a column can be used are invoked after specifying the data type for a column. They can be used to tell the database to **reject** inserted data that does not adhere to a certain restriction. The statement below sets constraints on the `celebs` table. ```sql= CREATE TABLE celebs ( id INTEGER PRIMARY KEY, name TEXT UNIQUE, date_of_birth TEXT NOT NULL, date_of_death TEXT DEFAULT 'Not Applicable' ); ``` 1. `PRIMARY KEY` columns can be used to ==uniquely identify the row==. Attempts to insert a row with an identical value to a row ==already== in the table will result in a *constraint violation* which will not allow you to insert the new row. 2. `UNIQUE` columns have a *different value for every row*. This is similar to `PRIMARY KEY` except a table can have many different `UNIQUE` columns. 3. `NOT NULL` columns must have a value. Attempts to insert a row without a value for a `NOT NULL` column will result in a constraint violation and the new row will not be inserted. 4. `DEFAULT` columns take an additional argument that will be the **assumed** value for an inserted row if the new row does not specify a value for that column. #### Instructions ![](https://i.imgur.com/aPSV3PL.png) #### QA: What are some reasons to apply constraints to a table? Answer 1. ==prevent invalid data== in the table. This is very important, because invalid data can cause issues and unexpected results from calculations. 2. Constraints can let us ==prevent missing data==, which is usually filled as `NULL` within the table. Instead of having missing values set to NULL, we can set constraints so that the missing values are given some default value instead, like `0`. 3. Another important reason to add a constraint is for ==uniqueness==, usually in the form of values like the `id`, or identifier column. By using a constraint like the `PRIMARY KEY`, we can ensure that every row has their own unique id value. ### Review SQL is a programming language designed to manipulate and manage data stored in relational databases. - A *relational* database is a database that organizes information into one or more tables. - A *table* is a collection of data organized into rows and columns. A *statement* is a string of characters that the database recognizes as a valid command. - `CREATE TABLE` creates a new table. - `INSERT INTO` adds a new row to a table. - `SELECT` queries data from a table. - `ALTER TABLE` changes an existing table. - `UPDATE` edits a row in a table. - `DELETE FROM` deletes rows from a table. *Constraints* add information about how a column can be used #### cheat sheet ![](https://i.imgur.com/k71aygH.png) ![](https://i.imgur.com/9yjVsXL.png) ![](https://i.imgur.com/Guvxfra.png) https://codecademy-content.s3.amazonaws.com/cheatsheets/learn-sql/learn-sql-manipulation.pdf https://www.codecademy.com/learn/learn-sql/modules/learn-sql-manipulation/reference #### QA: Are there any other commonly used SQL commands? Answer 1. One such command is `DROP TABLE`, which you can use to permanently *remove a table* from a database. 2. Other commands, such as `ANALYZE`, which is used to *obtain statistics* about a table. ## QUERIES ### Select We can select individual columns by their names (separated by a comma): ```sql= SELECT column1, column2 FROM table_name; ``` #### Instructions ![](https://i.imgur.com/MR3YLge.png) #### QA: Do SQL clauses always have to be entirely capitalized? Answer ==No==, SQLite, which Codecademy uses, is case-insensitive when it comes to clauses like SELECT and FROM which can be cased in any way. This is different from other programming languages such as Python where casing is quite important. Example ```sql= /* Both of the following queries will return the same result. */ SELECT * FROM table; select * from table; ``` ### As ```sql= SELECT name AS 'Titles' FROM movies; ``` `AS` is a keyword in SQL that allows you to ==rename a column or table== using an alias. The new name can be anything you want as long as you put it inside of single quotes. Here we renamed the `name` column as `Titles`. Some important things to note: - it’s best practice to surround your aliases with ==single quotes==. - When using `AS`, the columns are not being renamed in the table. The aliases only appear in the result. #### QA: Can we alias multiple columns in a single query? Answer ==Yes==, you can alias multiple columns at a time in the same query. Example ```sql= SELECT course_id AS "Course ID", exercise_id AS "Exercise ID" FROM bugs; ``` ### Distinct examining *distinct values* exist in a particular column `DISTINCT` is used to return ***unique values*** in the output. It filters out all duplicate values in the specified column(s). For instance, ```sql= SELECT tools FROM inventory; ``` ``` tools Hammer Nails Nails Nails ``` By adding `DISTINCT` before the column name, ```sql= SELECT DISTINCT tools FROM inventory; ``` ``` tools Hammer Nails ``` #### QA: Can we apply `DISTINCT` to a `SELECT` query with multiple columns? **Yes**, the `DISTINCT` clause can be applied to any valid `SELECT` query. It is important to note that `DISTINCT` will filter out all rows that are not unique in terms of all selected columns. Example Let’s assume that in the Codecademy database there is a table `bugs` which stores information about opened bug reports. It might have columns like `course_id`, `exercise_id`, `reported_by`, `reported_date`, `report_url`, etc. For the purpose of this example, let’s say that this is our table: ``` id course_id exercise_id reported_by 1 5 4 Tod 2 5 4 Alex 3 5 3 Roy 4 5 4 Roy 5 7 4 Alex 6 7 8 Tod 7 14 2 Alex 8 14 4 Tod 9 14 6 Tod 10 14 2 Roy ``` ```sql= SELECT DISTINCT reported_by FROM bugs; ``` ``` reported_by Alex Tod Roy ``` Our coworker would like to know in which exercises bugs have been reported. This gets trickier because now we have to query two columns: `course_id` and `exercise_id`. Let’s try to use the same approach as before: > SELECT DISTINCT course_id, exercise_id FROM bugs; ``` course_id exercise_id 14 2 5 4 14 4 14 6 5 3 7 4 7 8 ``` Is this the result we were hoping for? Yes. It is true that there are duplicated values in the `course_id` and `exercise_id`, but ==every row is unique== ### Where We can ***restrict*** our query results using the `WHERE` clause in order to obtain only the information we want. Following this format, the statement below filters the result set to only include *top rated movies* (IMDb ratings greater than 8): ```sql= SELECT * FROM movies WHERE imdb_rating > 8; ``` `WHERE` clause filters the result set to only include rows where the *following condition is true*. Comparison operators used with the WHERE clause are: ``` = equal to != not equal to > greater than < less than >= greater than or equal to <= less than or equal to ``` #### QA: Can we compare values of two columns in a `WHERE` clause? Answer **Yes**, within a `WHERE` clause you can compare the values of two columns. Example ```sql= /* This will return all rows where the value in the x column is greater than the y column value. */ SELECT x, y FROM coordinates WHERE x > y; ``` ### Like I `LIKE` can be a useful operator when you want to ==compare similar values==. The `movies` table contains two films with similar titles, ‘Se7en’ and ‘Seven’. How could we select all movies that **start** with ‘Se’ and **end** with ‘en’ and have exactly one character in the middle? ```sql= SELECT * FROM movies WHERE name LIKE 'Se_en'; ``` `LIKE` is a special operator ==used with the `WHERE` clause== to search for a specific pattern in a column. `Se_en` represents a pattern with a wildcard character. The `_` means you can ==substitute== any individual character here without breaking the pattern. The names `Seven` and `Se7en` both match this pattern. #### QA: Can we apply the LIKE operator to values other than TEXT? Answer **Yes**, you can apply the `LIKE` operator to numerical values as well. Whenever you use `LIKE` however, you must ***always wrap the pattern within a pair of quotations***, whether for matching a number or a string. Example ```sql= /* This will select movies where the id number starts with 2 and is followed by any two numbers. */ SELECT * FROM movies WHERE id LIKE '2__'; ``` ### Like II The percentage sign `%` is another wildcard character that can be used with `LIKE`. This statement below filters the result set to only include movies with names that ==begin== with the letter ‘A’: ```sql= SELECT * FROM movies WHERE name LIKE 'A%'; ``` `%` is a wildcard character that *matches zero or more missing letters* in the pattern. For example: - `A%` matches all movies with names that begin with letter ‘A’ - `%a` matches all movies that end with ‘a’ We can also use % both before and after a pattern: ```sql= SELECT * FROM movies WHERE name LIKE '%man%'; ``` Here, any movie that contains the word ‘man’ in its name will be returned in the result. `LIKE` is not case sensitive. ‘Batman’ and ‘Man of Steel’ will both appear in the result of the query above. ![](https://i.imgur.com/1vPhceq.png) #### QA: How do we search for patterns containing the actual characters “%” or “_”? Answer When searching for a pattern containing the specific characters `%` or `_`, we can utilize the *escape character `\`*, similarly to its use in Python. ```sql= Example /* In this pattern, we use an escape character before '%'. This will only match "%" and not be used like the wildcard character. This query will match any titles that end with ' 100%'. */ SELECT * FROM books WHERE title LIKE '% 100\%'; ``` ### Is Null Unknown values are indicated by `NULL`. It is ***not*** possible to test for `NULL` values with comparison operators, such as `=` and `!=`. Instead, we will have to use these operators: - `IS NULL` - `IS NOT NULL` To filter for all movies with an IMDb rating: ```sql= SELECT name FROM movies WHERE imdb_rating IS NOT NULL; ``` #### QA: When storing missing data, should I store them as NULL? Answer It can depend entirely on how you need the data to be stored and utilized. Let’s say that you have a table of **employee information**, which included their address. Say that we wanted to check all rows of this table and find where any addresses are missing. If we stored the addresses as `TEXT` values, we might choose to store all the missing values as either `''` or as `NULL`. If we stored the missing address values as an empty string `''` then these values are not `NULL`. Empty strings are seen as a ==string of length 0==. So, if we ran a query using ```sql= WHERE address IS NULL ``` it would not give us the rows with missing address values. We would have to check using ```sql= WHERE address = '' ``` With a table containing many different data types, it may be helpful and more *convenient to store any missing values in general* as just `NULL` so that we can utilize the `IS NULL` and `IS NOT NULL` operators. ### Between The `BETWEEN` operator is used in a `WHERE` clause to filter the result set within a certain range. It accepts two values that are either numbers, text or dates. For example, this statement filters the result set to only include movies with years from 1990 up to, and including 1999. ```sql= SELECT * FROM movies WHERE year BETWEEN 1990 AND 1999; ``` When the values are ==text==, `BETWEEN` filters the result set for <font color='red'>within the alphabetical range</font>. In this statement, `BETWEEN` filters the result set to only include movies with names that *begin with the letter ‘A’ up to, but ==not including== ones that begin with ‘J’*. ```sql= SELECT * FROM movies WHERE name BETWEEN 'A' AND 'J'; ``` <font color=red>However, if a movie has a name of simply ‘J’, it would actually match</font>. This is because `BETWEEN` goes up to the second value — up to ‘J’. So the movie named ‘J’ would be included in the result set but not ‘Jaws’. #### QA: When applying the BETWEEN operator on TEXT values, how are values compared? Answer In most programming languages, including SQLite and Python, TEXT or string values are compared based on their ==lexicographical ordering==, and when using the BETWEEN operator for a range of TEXT values in SQL, the values will be sorted in this way. If two words have different lengths, but match up to the last letter of the shorter word, the shorter word will appear first in the ordering. Example ``` A = "Alien" B = "Aliens" C = "Alike" /* Because A and B share the same sequence of characters up to the last character of A, which is shorter, A < B. Also, because "k" comes after "e" in the alphabet, C will come last in the ordering of these 3 words. A < B < C */ ``` ### And Sometimes we want to combine *multiple conditions* in a `WHERE` clause to make the result set more specific and useful. ```sql= SELECT * FROM movies WHERE year BETWEEN 1990 AND 1999 AND genre = 'romance'; ``` - `year BETWEEN 1990 AND 1999` is the 1st condition. - `genre = 'romance'` is the 2nd condition. - AND combines the two conditions. With AND, both conditions must be true for the row to be included in the result. #### QA: Is the AND used in BETWEEN the same as the AND operator used between multiple conditions? Answer No, although they may be assumed to be the same thing, the `AND` used with a `BETWEEN`, like `BETWEEN 1990 AND 1999` is not quite the same `AND` used when combining multiple conditions. When used in a `BETWEEN` statement, we are ==not== combining two separate conditions, but providing a <font color=red>range of values</font> to obtain the values within that range. However, we can easily rewrite a `BETWEEN` to one with two conditions, like these queries which would be identical. ```sql= SELECT * FROM movies WHERE year BETWEEN 1990 AND 1999; ``` ```sql= SELECT * FROM movies WHERE year >= 1990 AND year <= 1999; ``` ### Or Similar to `AND`, the `OR` operator can also be used to combine multiple conditions in `WHERE`, but there is a fundamental difference: `OR` operator displays a row if ==any== condition is true. ```sql= SELECT * FROM movies WHERE year > 2014 OR genre = 'action'; ``` - `year > 2014` is the 1st condition. - `genre = 'action'` is the 2nd condition. - `OR` combines the two conditions. With OR, if any of the conditions are true, then the row is added to the result. #### QA: Can we write conditions using both AND and OR? Answer Yes, queries can combine multiple conditions using `AND` and `OR` without a real limit to how many conditions you can combine. However, the more conditions you combine, the more specific the results will be and the more complex it can get. Example ```sql= /* This will select movies with id values from 10 to 20 inclusive, OR with id values from 50 to 60 inclusive. */ SELECT * FROM movies WHERE (id > 10 AND id < 20) OR (id > 50 AND id < 60); ``` ### Order By We can sort the results using `ORDER BY`, either ==alphabetically or numerically==. Sorting the results often makes the data more useful and easier to analyze. For example, if we want to sort everything by the movie’s title from A through Z: ```sql= SELECT * FROM movies ORDER BY name; ``` - `ORDER BY` is a clause that indicates you want to sort the result set by a particular column. - `name` is the specified column. Sometimes we want to sort things in a **decreasing order**. For example, if we want to select all of the well-received movies, sorted from highest to lowest by their year: ```sql= SELECT * FROM movies WHERE imdb_rating > 8 ORDER BY year DESC; ``` - `DESC` is a keyword used in `ORDER BY` to sort the results in descending order (high to low or Z-A). - `ASC` is a keyword used in `ORDER BY` to sort the results in ascending order (low to high or A-Z). The column that we `ORDER BY` doesn’t even have to be one of the columns that we’re displaying. Note: `ORDER BY` always goes after `WHERE` (if WHERE is present). #### QA: In SQL, can we apply ORDER BY with multiple columns? Yes, following the `ORDER BY`, you can list more than one column for which to order the data by. When ordering by more than one column, it will *first order the data on the first column*, then, keeping the previous column order, it will order on the next column, and so on. You can also specify ascending or descending order for each listed column. Example ```sql= /* This will order on the year, then order the names in reverse alphabetical order, preserving the order of the year column. */ SELECT year, name FROM movies ORDER BY year ASC, name DESC; ``` ### Limit For instance, imagine that we just want to see a few examples of records. ```sql= SELECT * FROM movies LIMIT 10; ``` `LIMIT` is a clause that lets you specify the maximum number of rows the result set will have. This saves space on our screen and makes our queries run faster. LIMIT always goes at the very end of the query. Also, it is <font color=red>not supported in all SQL databases</font>. #### Instructions Combining your knowledge of `LIMIT` and `ORDER BY`, write a query that returns the top 3 highest rated movies. Select all the columns. ![](https://i.imgur.com/lxNPe2O.png) #### QA: In SQL, what happens if we set a LIMIT value that is greater than the total number of rows? Answer If the number set in the `LIMIT` clause surpasses the number of rows available to select, then it will just <font color=red>exclude the remaining amount of rows</font> in the result set. ```sql= Example /* Say the table `name` has only 90 rows. Then, since 100 is greater than the number of rows, it will just return what rows are there. */ SELECT * FROM names LIMIT 100; ``` ### Case A `CASE` statement allows us to create different outputs (usually in the `SELECT` statement). It is SQL’s way of handling ==if-then logic==. Suppose we want to condense the ratings in `movies` to three levels: - If the rating is above 8, then it is Fantastic. - If the rating is above 6, then it is Poorly Received. - Else, Avoid at All Costs. ```sql= SELECT name, CASE WHEN imdb_rating > 8 THEN 'Fantastic' WHEN imdb_rating > 6 THEN 'Poorly Received' ELSE 'Avoid at All Costs' END FROM movies; ``` ![](https://i.imgur.com/BaWovH2.png) - Each `WHEN` tests a condition and the following `THEN` gives us the string if the condition is true. - The `ELSE` gives us the string if all the above conditions are false. - The `CASE` statement must end with `END`. In the result, you have to scroll right because the column name is very long. To shorten it, we can ==rename the column== to ‘Review’ using `AS`: ```sql= SELECT name, CASE WHEN imdb_rating > 8 THEN 'Fantastic' WHEN imdb_rating > 6 THEN 'Poorly Received' ELSE 'Avoid at All Costs' END AS 'Review' FROM movies; ``` ![](https://i.imgur.com/uI52rc8.png) #### Instructions Select the `name` column and use a `CASE` statement to create the second column that is: - ‘Chill’ if `genre = 'romance'` - ‘Chill’ if `genre = 'comedy'` - ‘Intense’ in all other cases Optional: Rename the whole `CASE` statement to ‘Mood’ using `AS`. ![](https://i.imgur.com/4WGJD2H.png) #### QA: For a CASE statement, do all values provided by THEN have to match a single data type? Answer No, for `CASE` statements, the `THEN` values do not have to return only a single type of value. In fact, you can have each `THEN` in a single `CASE` statement return different value types such as `TEXT`, `REAL`, and `INTEGER`. Example ```sql= SELECT CASE WHEN condition1 THEN "text" WHEN condition2 THEN 100 WHEN condition3 THEN 3.14 END AS 'example' FROM table; ``` ## AGGREGATE FUNCTIONS ### Introduction Calculations performed on multiple rows of a table are called **aggregates**. In this lesson, we have given you a table named `fake_apps` which is made up of fake mobile applications data. Here is a quick preview of some important aggregates that we will cover in the next five exercises: - `COUNT()`: count the number of rows - `SUM()`: the sum of the values in a column - `MAX()`/`MIN()`: the largest/smallest value - `AVG()`: the average of the values in a column - `ROUND()`: round the values in the column ### Count The fastest way to calculate how many rows are in a table is to use the `COUNT()` function. `COUNT()` is a function that takes the name of a column as an argument and counts the number of non-empty values in that column. ```sql= SELECT COUNT(*) FROM table_name; ``` Here, we want to count every row, so we pass `*` as an argument inside the parenthesis. #### Instructions 1. Let’s count how many apps are in the table. In the code editor, run: ```sql= SELECT COUNT(*) FROM fake_apps; ``` ![](https://i.imgur.com/kzIvhp0.png) 2. Add a `WHERE` clause in the previous query to count how many free apps are in the table. ![](https://i.imgur.com/ApRDxAN.png) #### QA: Does COUNT() include duplicate values of a column? Answer Yes, when using the `COUNT()` function on a column in SQL, it will include duplicate values by default. It essentially counts all rows for which there is a value in the column. If you wanted to count only the unique values in a column, then you can utilize the `DISTINCT` clause within the `COUNT()` function. Example ```sql= /* This will return 22, the number of distinct category values. */ SELECT COUNT(DISTINCT category) FROM fake_apps; ``` ### Sum SQL makes it easy to add all values in a particular column using `SUM()`. `SUM()` is a function that takes the name of a column as an argument and returns the sum of all the values in that column. What is the total number of downloads for all of the apps combined? ```sql= SELECT SUM(downloads) FROM fake_apps; ``` This adds all values in the `downloads` column. ### Max / Min The `MAX()` and `MIN()` functions return the highest and lowest values in a column, respectively. How many downloads does the most popular app have? ```sql= SELECT MAX(downloads) FROM fake_apps; ``` ![](https://i.imgur.com/feTPsFO.png) The most popular app has 31,090 downloads! `MAX()` takes the name of a column as an argument and returns the largest value in that column. Here, we returned the largest value in the `downloads` column. `MIN()` works the same way but it does the exact opposite; it returns the smallest value. #### Instruction Delete the previous query. Write a new query that returns the price of the most expensive app. ![](https://i.imgur.com/hyAqWdD.png) #### QA: If multiple rows have the minimum or maximum value, which one is returned when using MAX/MIN? Answer Typically, when you have more than one row that contains the minimum or maximum value in a column, the ==topmost row== containing that value will be returned in the result. For example, if the table contained multiple rows with the minimum price of 0.0, then the result of a query with `MIN(price) `will choose the topmost row from the table that had this price value. Example ```sql= /* This should return the siliconphase app, because it was the topmost row that had the minimum price value of the column. */ SELECT id, name, MIN(price) FROM fake_apps; ``` ### Average SQL uses the `AVG()` function to quickly calculate the average value of a particular column. The statement below returns the average number of downloads for an app in our database: ```sql= SELECT AVG(downloads) FROM fake_apps; ``` ![](https://i.imgur.com/vZFOWob.png) The `AVG()` function works by taking a column name as an argument and returns the average value for that column. #### QA: In SQL, how can we get the average of only the unique values of a column? Answer To run the `AVG()` function on a column such that it only averages the unique values in the column, we could use the `DISTINCT` clause right before the column name. Example ```sql= /* Returns 2.02365 */ SELECT AVG(price) FROM fake_apps; /* Returns 4.15833.. */ SELECT AVG(DISTINCT price) FROM fake_apps; ``` ### Round By default, SQL tries to be as precise as possible without rounding. We can make the result table easier to read using the `ROUND()` function. `ROUND()` function takes two arguments inside the parenthesis: 1. a column name 2. an integer It rounds the values in the column to the number of decimal places specified by the integer. ```sql= SELECT ROUND(price, 0) FROM fake_apps; ``` Here, we pass the column `price` and integer `0` as arguments. SQL rounds the values in the column to 0 decimal places in the output. #### Instructions 1. Let’s return the name column and a rounded price column. In the code editor, type: ```sql= SELECT name, ROUND(price, 0) FROM fake_apps; ``` ![](https://i.imgur.com/ZwoKj9N.png) 2. we were able to get the average price of an app ($2.02365) using this query: ```sql= SELECT AVG(price) FROM fake_apps; ``` Now, let’s edit this query so that it rounds this result to 2 decimal places. ![](https://i.imgur.com/8C1HxbF.png) #### QA: Does the ROUND() function round up? Answer When using the `ROUND()` function, you can provide a second argument, which is the precision, or number of decimal places to round the number on. In SQLite, rounding is done by rounding up if the next decimal value is 5, and rounds down if the value is less than 5. For example, ```sql= /* This will result in 4.0 */ SELECT ROUND(3.5, 0); /* This will result in 6.4 */ SELECT ROUND(6.42, 1); /* This will result in 6.0 */ SELECT ROUND(6.42, 0); ``` ### Group By I Oftentimes, we will want to calculate an aggregate for data with certain characteristics. For instance, we might want to know the mean IMDb ratings for all movies ==each year==. We could calculate each number by a series of queries with different WHERE statements, like so: ```sql= SELECT AVG(imdb_rating) FROM movies WHERE year = 1999; SELECT AVG(imdb_rating) FROM movies WHERE year = 2000; SELECT AVG(imdb_rating) FROM movies WHERE year = 2001; ``` and so on. We can use `GROUP BY` to do this in a single step: ```sql= SELECT year, AVG(imdb_rating) FROM movies GROUP BY year ORDER BY year; ``` `GROUP BY` is a clause in SQL that is used with aggregate functions. It is used in collaboration with the `SELECT` statement to arrange identical data into groups. The `GROUP BY` statement comes after any `WHERE` statements, but before `ORDER BY` or `LIMIT`. #### Instructions 1. In the code editor, type: ```sql= SELECT price, COUNT(*) FROM fake_apps GROUP BY price; ``` ![](https://i.imgur.com/oJq8bvF.png) Here, our aggregate function is `COUNT()` and we arranged `price` into groups. What do you expect the result to be? 2. In the previous query, add a `WHERE` clause to count the total number of apps that have been downloaded more than 20,000 times, at each price. ![](https://i.imgur.com/Imni8JM.png) 3. Write a new query that calculates the total number of downloads for each category. Select `category` and `SUM(downloads)`. ![](https://i.imgur.com/BqpreOq.png) #### QA: When using the GROUP BY clause, do we always have to group by one of the selected columns listed after SELECT? Answer No, you can `GROUP BY` a column that was ==not included in the SELECT statement==. For example, this query does not list the price column in the SELECT, but it does group the data by that column. ```sql= SELECT name, downloads FROM fake_apps GROUP BY price; ``` However, usually we do include the grouped by column in the `SELECT` for the sake of clarity, so that it’s easier to see what rows belong to which group. ### Group By II Sometimes, we want to `GROUP BY` a calculation done on a column. For instance, we might want to know how many movies have IMDb ratings that round to 1, 2, 3, 4, 5. We could do this using the following syntax: ```sql= SELECT ROUND(imdb_rating), COUNT(name) FROM movies GROUP BY ROUND(imdb_rating) ORDER BY ROUND(imdb_rating); ``` However, this query may be time-consuming to write and more prone to error. SQL lets us use <font color=red>column reference(s)</font> in our `GROUP BY` that will make our lives easier. - 1 is the first column selected - 2 is the second column selected - 3 is the third column selected and so on. The following query is equivalent to the one above: ```sql= SELECT ROUND(imdb_rating), COUNT(name) FROM movies GROUP BY 1 ORDER BY 1; ``` Here, the `1` refers to the first column in our `SELECT` statement, `ROUND(imdb_rating)`. #### Instructions 1. Suppose we have the query below: ```sql= SELECT category, price, AVG(downloads) FROM fake_apps GROUP BY category, price; ``` ![](https://i.imgur.com/yJ9V6xP.png) Write the exact query, but use column reference numbers instead of column names after `GROUP BY`. ```sql= SELECT category, price, AVG(downloads) FROM fake_apps GROUP BY 1, 2; ``` #### QA: Do column references have to follow the order the columns are listed in the SELECT? Answer No, once you list the columns after the `SELECT`, they can be referenced by the order they appeared, starting from 1 for the first listed column. You are not limited to referencing them in the exact order they were listed, like `GROUP BY 1, 2, 3` You can freely use the references in any order, like you would normally without using references. `GROUP BY 3, 1, 2` ### Having In addition to being able to group data using `GROUP BY`, SQL also allows you to ==filter which groups to include and which to exclude==. For instance, imagine that we want to see how many movies of different genres were produced each year, but we only care about years and genres with at least 10 movies. We can’t use `WHERE` here because we don’t want to filter the rows; we want to ==filter groups==. This is where `HAVING` comes in. `HAVING` is very similar to `WHERE`. In fact, all types of `WHERE` clauses you learned about thus far can be used with `HAVING`. We can use the following for the problem: ```sql= SELECT year, genre, COUNT(name) FROM movies GROUP BY 1, 2 HAVING COUNT(name) > 10; ``` - When we want to limit the results of a query based on values of the individual rows, use `WHERE`. - When we want to limit the results of a query based on an aggregate property, use `HAVING`. `HAVING` statement always comes after `GROUP BY`, but before `ORDER BY` and `LIMIT`. #### Instructions Suppose we have the query below: ```sql= SELECT price, ROUND(AVG(downloads)), COUNT(*) FROM fake_apps GROUP BY price; ``` It returns the average downloads (rounded) and the number of apps – at each price point. However, certain price points don’t have very many apps, so their average downloads are less meaningful. Add a `HAVING` clause to restrict the query to price points that have more than 10 apps. ```sql= SELECT price, ROUND(AVG(downloads)), COUNT(*) FROM fake_apps GROUP BY price HAVING COUNT(price) > 10; ``` ![](https://i.imgur.com/XZMXieH.png) #### QA: Can a WHERE clause be applied with a HAVING statement in the same query? Answer Yes, you can absolutely apply a `WHERE` clause in a query that also utilizes a `HAVING` statement. When you apply a `WHERE` clause in the same query, it must ==always be before any `GROUP BY`==, which in turn must be before any `HAVING`. As a result, the data is essentially filtered on the `WHERE` condition first. Then, from this filtered data, it is grouped by specified columns and then further filtered based on the `HAVING` condition. Example ```sql= /* This will first filter the movies with a box_office > 500000. Then, it will group those results by genre, and finally restrict the query to genres that have more than 5 movies. */ SELECT genre, ROUND(AVG(score)) FROM movies WHERE box_office > 500000 GROUP BY genre HAVING COUNT(*) > 5; ``` ## MULTIPLE TABLES ### Introduction In order to efficiently store data, we often spread related information across multiple tables. For instance, imagine that we’re running a magazine company where users can have different types of subscriptions to different products. Different subscriptions might have many different properties. Each customer would also have lots of associated information. So instead, we can split our data into three tables: 1. `orders` would contain just the information necessary to describe what was ordered: - `order_id`, `customer_id`, `subscription_id`, `purchase_date` 2. `subscriptions` would contain the information to describe each type of subscription: - `subscription_id`, `description`, `price_per_month`, `subscription_length` 3. `customers` would contain the information for each customer: - `customer_id`, `customer_name`, `address` In this lesson, we’ll learn the SQL commands that will help us work with data that is stored in multiple tables. ### Combining Tables with SQL Combining tables manually is time-consuming. Luckily, SQL gives us an easy sequence for this: it’s called a `JOIN`. If we want to combine `orders` and `customers`, we would type: ```sql= SELECT * FROM orders JOIN customers ON orders.customer_id = customers.customer_id; ``` Let’s break down this command: 1. The third line uses `JOIN` to say that we want to combine information from `orders` with customers. 2. The fourth line tells us ==how to combine the two tables==. We want to match `orders` table’s `customer_id` column with `customers` table’s `customer_id` column. Because column names are often repeated across multiple tables, we use the syntax `table_name.column_name` to be sure that our requests for columns are unambiguous. In our example, we use this syntax in the `ON` statement, but we will also use it in the `SELECT` or any other statement where we refer to column names. For example: Instead of selecting all the columns using `*`, if we only wanted to select orders table’s `order_id` column and customers table’s `customer_name` column, we could use the following query: ```sql= SELECT orders.order_id, customers.customer_name FROM orders JOIN customers ON orders.customer_id = customers.customer_id; ``` ![](https://i.imgur.com/0zxDRMC.png) #### Instructions Add a second query after your first one that only selects rows from the join where `description` is equal to ‘Fashion Magazine’. ```sql= SELECT * FROM orders JOIN subscriptions ON orders.subscription_id = subscriptions.subscription_id WHERE subscriptions.description = 'Fashion Magazine'; ``` ![](https://i.imgur.com/1drQ37k.png) #### QA: When doing an INNER JOIN, are columns matched on NULL values? Answer No, when you have ``NULL`` values in a column, these are never matched to other `NULL` values. This is because `NULL` signifies the absence of any value, and cannot be compared as they will never equal anything. Doing say `NULL = NULL` results in False. Let’s take for example the animation given in the exercise, which shows how `INNER JOIN` works. Let’s say that an additional row was added to each table, with `NULL` in the `C2` column, such that they become Left table: ```sql= C1, C2 A, B Q, W X, Y T, NULL ``` Right table: ```sql= C2, C3 B, C E, R Y, Z NULL, V ``` If we inner joined these tables the same way, we would end up with the same result, because `NULL` values are not matched. ```sql= C1, C2, C3 A, B, C X, Y, Z ``` ### Inner Joins Let’s revisit how we joined `orders` and `customers`. For every possible value of `customer_id` in orders, there was a corresponding row of customers with the same `customer_id`. What if that wasn’t true? ![](https://i.imgur.com/1UGYfys.png) ![](https://i.imgur.com/eaXIBVA.png) The first and last rows have matching values of `c2`. The middle rows do not match. The final result has all values from the first and last rows but does ==not== include the <font color=red>non-matching</font> middle row. #### Instruction Join `newspaper` table and `online` table on their `id` columns (the unique ID of the subscriber). How many rows are in this table? ```sql= SELECT COUNT(*) FROM newspaper; SELECT COUNT(*) FROM online; SELECT COUNT(*) FROM newspaper JOIN online ON newspaper.id = online.id; ``` ![](https://i.imgur.com/2oJfQ5Z.png) ### Left Joins What if we want to combine two tables and keep some of the un-matched rows? SQL lets us do this through a command called `LEFT JOIN`. A left join will keep all rows from the first table, regardless of whether there is a matching row in the second table. ![](https://i.imgur.com/jKjroIw.png) ![](https://i.imgur.com/fmBwckN.png) ```sql= SELECT * FROM table1 LEFT JOIN table2 ON table1.c2 = table2.c2; ``` 1. The first line selects all columns from both tables. 2. The second line selects `table1` (the “left” table). 3. The third line performs a `LEFT JOIN` on `table2` (the “right” table). 4. The fourth line tells SQL how to perform the join (by looking for matching values in column `c2`). #### Instruction In order to find which users do <font color=red>not subscribe</font> to the online edition, we need to add a `WHERE` clause. Add a second query after your first one that adds the following `WHERE` clause and condition: `WHERE online.id IS NULL` This will select rows where there was no corresponding row from the `online` table. ```sql= SELECT * FROM newspaper LEFT JOIN online ON newspaper.id = online.id WHERE online.id IS NULL; ``` ![](https://i.imgur.com/3bMtMUL.png) ### Primary Key vs Foreign Key Let’s return to our example of the magazine subscriptions. Recall that we had three tables: `orders`, `subscriptions`, and `customers`. Each of these tables has a column that <font color=red>uniquely identifies </font> each row of that table: - `order_id` for `orders` - `subscription_id` for `subscriptions` - `customer_id` for `customers` These special columns are called ==primary keys==. Primary keys have a few requirements: - None of the values can be `NULL`. - Each value must be ==unique== - A table <font color=red>==can not== have more than one primary key column</font>. Let’s reexamine the `orders` table: ![](https://i.imgur.com/t2HttdQ.png) Note that `customer_id` (the primary key for `customers`) and `subscription_id` (the primary key for `subscriptions`) both appear in this. When the primary key for one table appears in a different table, it is called a ==foreign key==. So `customer_id` is a primary key when it appears in `customers`, but a <font color=red>foreign key when it appears in `orders` </font>. In this example, our primary keys all had somewhat descriptive names. Generally, the primary key will just be called `id`. Foreign keys will have more descriptive names. Why is this important? The most common types of joins will be *joining a foreign key from one table with the primary key from another table*. For instance, when we join `orders` and `customers`, we join on `customer_id`, which is a foreign key in `orders` and the primary key in `customers`. #### Instruction Suppose Columbia University has two tables in their database: - The `classes` table contains information on the classes that the school offers. Its primary key is id. - The `students` table contains information on all students in the school. Its primary key is `id`. It contains the foreign key `class_id`, which corresponds to the primary key of `classes`. Perform an ==inner join== of classes and students using the primary and foreign keys described above, and select all the columns. ```sql= SELECT * FROM classes JOIN students ON classes.id = students.class_id; ``` ![](https://i.imgur.com/liWEGHx.png) #### QA: Is it possible for a table to have more than one unique identifier column, like an id column? Answer Yes, it is possible for a table to have more than one column which can uniquely identify a row of data. A column that can uniquely identify a record of data is known as a `"Candidate Key"`. Tables can have multiple `"Candidate Key"`s, each of which could potentially be the `"Primary Key"`, but there must only be one `"Primary Key"` per table. Usually, the column chosen as the `"Primary Key"` follows the naming convention like `customer_id` or `product_id`. For example, say that we had a table of ==employee records==, with the columns `employee_id` and `phone_number`. Every employee has a unique `employee_id` value, and a unique `phone_number` value. Both of these columns can be unique identifiers for a row, so they are `"Candidate keys"`, but the `"Primary Key"` would most likely be set to `employee_id`. ### Cross Join Sometimes, we just want to *combine all rows of one table with all rows of another table*. For instance, if we had a table of `shirts` and a table of `pants`, we might want to know all the possible combinations to create different outfits. Our code might look like this: ```sql= SELECT shirts.shirt_color, pants.pants_color FROM shirts CROSS JOIN pants; ``` - The first two lines select the columns `shirt_color` and `pants_color`. - The third line pulls data from the table `shirts`. - The fourth line performs a `CROSS JOIN` with pants. A more common usage of `CROSS JOIN` is when we need to compare each row of a table to a list of values. Let’s return to our `newspaper` subscriptions. This table contains two columns that we haven’t discussed yet: - `start_month`: the first month where the customer subscribed to the print newspaper (i.e., `2` for February) - `end_month`: the final month where the customer subscribed to the print newspaper Suppose we wanted to know how many users were subscribed during ==each month== of the year. For each month (`1`, `2`, `3`) we would need to know if a user was subscribed. Follow the steps below to see how we can use a `CROSS JOIN` to solve this proble #### Instruction ##### 1 Let’s start by counting the number of customers who were subscribed to the `newspaper` during March. Use `COUNT(*)` to count the number of rows and a `WHERE` clause to restrict to two conditions: - `start_month <= 3` - `end_month >= 3` ```sql= SELECT COUNT(*) FROM newspaper WHERE (start_month <=3 ) AND (end_month >=3); ``` ##### 2 Our database contains another table called `months` which contains the numbers between 1 and 12. Select all columns from the cross join of `newspaper` and `months`. ```sql= SELECT * FROM newspaper CROSS JOIN months; ``` ![](https://i.imgur.com/EYCu4Ew.png) ##### 3 Create a third query where you add a `WHERE` statement to your cross join to restrict to two conditions: - `start_month <= month` - `end_month >= month` This will select all months where a user was subscribed ```sql= SELECT * FROM newspaper CROSS JOIN months WHERE (start_month <= month) AND (end_month >= month ); ``` ![](https://i.imgur.com/XeQU4IO.png) ##### 4 Create a final query where you aggregate over each month to count the number of subscribers. Fill in the blanks in the following query: ```sql= SELECT month, COUNT(*) FROM ________ CROSS JOIN ________ WHERE ________ AND ________ GROUP BY ________; ``` ```sql= SELECT month, COUNT(*) FROM newspaper CROSS JOIN months WHERE (start_month <= month) AND (end_month >= month) GROUP BY month; ``` ![](https://i.imgur.com/OaFxNUI.png) ### Union Sometimes we just want to ==stack== one dataset on top of the other. Well, the `UNION` operator allows us to do that. Suppose we have two tables and they have the same columns. `table1`: ![](https://i.imgur.com/j5unpI7.png) `table2`: ![](https://i.imgur.com/WHce5Ns.png) If we combine these two with `UNION`: ```sql= SELECT * FROM table1 UNION SELECT * FROM table2; ``` The result would be: ![](https://i.imgur.com/10jzFU4.png) SQL has strict rules for appending data: - Tables must have the <font color=red>same number of columns</font>. - The columns must have the <font color=red>same data types in the same order</font> as the first table. #### Instructions Let’s return to our `newspaper` and `online` subscriptions. We’d like to create one big table with both sets of data. Use `UNION` to stack the `newspaper` table on top of the `online` table. ```sql= SELECT * FROM newspaper UNION SELECT * FROM online; ``` ![](https://i.imgur.com/dY7lAYM.png) #### QA: What happens if the tables we perform the UNION operator on have duplicate rows? Answer When you combine tables with `UNION`, <font color=red>duplicate rows</font> will be ==excluded==. If, however, you wanted to include duplicates, certain versions of SQL provides the `UNION ALL` operator. ### With Often times, we want to combine two tables, but one of the tables is the result of another calculation. Let’s return to our magazine order example. For instance, they might want to know *how many magazines* ==each customer== subscribes to. We can easily calculate this using our `orders` table: ```sql= SELECT customer_id, COUNT(subscription_id) AS 'subscriptions' FROM orders GROUP BY customer_id; ``` ![](https://i.imgur.com/lnZLawy.png) This query is good, but a `customer_id` isn’t terribly useful for our marketing department, they probably want to know the ==customer’s name==. We want to be able to join the results of this query with our `customers` table, which will tell us the name of each customer. We can do this by using a `WITH` clause. ```sql= WITH previous_results AS ( SELECT ... ... ... ... ) SELECT * FROM previous_results JOIN customers ON _____ = _____; ``` - The `WITH` statement allows us to perform a ==separate query== (such as aggregating customer’s subscriptions) - `previous_results` is the alias that we will use to reference any columns from the query inside of the `WITH` clause - We can then go on to do whatever we want with this <font color=red>temporary table</font> (such as join the temporary table with another table) Essentially, we are putting a ==whole first query inside the parentheses `()`== and giving it a name. After that, we can use this name as if it’s a table and write a new query using the first query. #### Instruction Place the whole query below into a `WITH` statement, inside parentheses `()`, and give it name `previous_query`: ```sql= SELECT customer_id, COUNT(subscription_id) AS 'subscriptions' FROM orders GROUP BY customer_id ``` Join the temporary table `previous_query` with `customers` table and select the following columns: - `customers.customer_name` - `previous_query.subscriptions` ```sql= WITH previous_query AS ( SELECT customer_id, COUNT(subscription_id) AS 'subscriptions' FROM orders GROUP BY customer_id ) SELECT customers.customer_name, previous_query.subscriptions FROM previous_query JOIN customers ON previous_query.customer_id = customers.customer_id; ``` ![](https://i.imgur.com/P2dfa6l.png) #### QA: Can we use WITH for more than one nested query in SQL? Answer Yes, you can use `WITH` for more than one nested query. You can do so by listing each query using commas after the `WITH`. For example, ```sql= WITH query1 AS (SELECT column1 FROM table1 WHERE condition1), query2 AS (SELECT column2 FROM table2 WHERE condition2) … ``` ### Multiple Tables cheatsheet ![](https://i.imgur.com/NvXaA6T.png) ![](https://i.imgur.com/PSxX5pY.png)