# [Learn SQL](https://www.codecademy.com/learn/learn-sql)
> Lee Tsung-Tang
> ###### tags: `SQL` `database`
[TOC]
{%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %}
## Manipulation
### Introduction to SQL
The statements covered in this course use SQLite Relational Database Management System (RDBMS). You can also access a glossary of all the SQL commands taught in this course.
### Relational Databases
```SQL=
SELECT * FROM celebs;
```
A `table` is a collection of data organized into rows and columns. Tables are sometimes referred to as relations. Here the table is ==celebs==.
A `column` is a set of data values of a particular type. Here, ==id, name, and age== are the columns.
A `row` is a single record in a table. The first row in the celebs table has:
- An `id` of 1
- A `name` of Justin Bieber
- An `age` of 22
All data stored in a relational database is of a certain data type. Some of the most common data types are:
- `INTEGER`, a positive or negative whole number
- `TEXT`, a text string
- `DATE`, the date formatted as YYYY-MM-DD
- `REAL`, a decimal value
### Statements
The code below is a SQL statement. A statement is text that the database recognizes as a valid command. Statements always end in a semicolon `;`.
```sql=
CREATE TABLE table_name (
column_1 data_type,
column_2 data_type,
column_3 data_type
);
```
1. `CREATE TABLE` is a *clause*. Clauses perform specific tasks in SQL. By convention, clauses are written in capital letters. Clauses can also be referred to as commands.
<!-- 2. `table_name` refers to the name of the table that the command is applied to.
(column_1 data_type, column_2 data_type, column_3 data_type) is a parameter. A parameter is a list of columns, data types, or values that are passed to a clause as an argument. Here, the parameter is a list of column names and the associated data type.
The structure of SQL statements vary. The number of lines used does not matter. A statement can be written all on one line, or split up across multiple lines if it makes it easier to read. In this course, you will become familiar with the structure of common statements. -->
... 中間待補
### Update
The `UPDATE` statement ==edits a row in a table==. You can use the `UPDATE` statement when you want to change existing records.
`id` =4 的 `twitter_handle` 改為 `@taylorswift13`
```sql=
UPDATE celebs
SET twitter_handle = '@taylorswift13'
WHERE id = 4;
```
1. `UPDATE` is a clause that edits a row in the table.
2. `celebs` is the name of the table.
3. `SET` is a clause that indicates the column to edit.
- `twitter_handle` is the name of the column that is going to be updated
- `@taylorswift13` is the new value that is going to be inserted into the `twitter_handle` column.
4. `WHERE` is a clause that indicates ==which row(s)== to update with the new column value. Here the row with a 4 in the id column is the row that will have the `twitter_handle` updated to `@taylorswift13`.
#### Instructions

#### QA: How is `ALTER` different from `UPDATE`?
**Answer**
Although similar in the sense that both statements will modify a table, these statements are quite different.
The `ALTER` statement is used to ==modify columns==. With ALTER, you can add columns, remove them, or even modify them.
The `UPDATE` statement is used to ==modify rows==. However, `UPDATE` can only update a row, and ==cannot== remove or add rows.
### Delete
The `DELETE FROM` statement deletes one or more rows from a table.
```sql=
DELETE FROM celebs
WHERE twitter_handle IS NULL;
```
1. `DELETE FROM` is a clause that lets you delete rows from a table.
2. `celebs` is the name of the table we want to delete rows from.
3. `WHERE` is a clause that lets you select which rows you want to delete. Here we want to delete all of the rows where the `twitter_handle` column `IS NULL`.
4. `IS NULL` is a condition in SQL that returns true when the value is NULL and false otherwise.
#### Instructions

#### QA: What if we only want to delete a specific number of rows?
Answer
To delete only a specific number of rows, you can utilize the `LIMIT` statement. The value provided for `LIMIT` will be how many rows to affect.
For example, this statement will only delete the first 5 rows that match the condition,
```sql=
DELETE FROM table
WHERE condition
LIMIT 5;
```
### Constraints
*Constraints* that ==add information== about how a column can be used are invoked after specifying the data type for a column. They can be used to tell the database to **reject** inserted data that does not adhere to a certain restriction. The statement below sets constraints on the `celebs` table.
```sql=
CREATE TABLE celebs (
id INTEGER PRIMARY KEY,
name TEXT UNIQUE,
date_of_birth TEXT NOT NULL,
date_of_death TEXT DEFAULT 'Not Applicable'
);
```
1. `PRIMARY KEY` columns can be used to ==uniquely identify the row==. Attempts to insert a row with an identical value to a row ==already== in the table will result in a *constraint violation* which will not allow you to insert the new row.
2. `UNIQUE` columns have a *different value for every row*. This is similar to `PRIMARY KEY` except a table can have many different `UNIQUE` columns.
3. `NOT NULL` columns must have a value. Attempts to insert a row without a value for a `NOT NULL` column will result in a constraint violation and the new row will not be inserted.
4. `DEFAULT` columns take an additional argument that will be the **assumed** value for an inserted row if the new row does not specify a value for that column.
#### Instructions

#### QA: What are some reasons to apply constraints to a table?
Answer
1. ==prevent invalid data== in the table. This is very important, because invalid data can cause issues and unexpected results from calculations.
2. Constraints can let us ==prevent missing data==, which is usually filled as `NULL` within the table. Instead of having missing values set to NULL, we can set constraints so that the missing values are given some default value instead, like `0`.
3. Another important reason to add a constraint is for ==uniqueness==, usually in the form of values like the `id`, or identifier column. By using a constraint like the `PRIMARY KEY`, we can ensure that every row has their own unique id value.
### Review
SQL is a programming language designed to manipulate and manage data stored in relational databases.
- A *relational* database is a database that organizes information into one or more tables.
- A *table* is a collection of data organized into rows and columns.
A *statement* is a string of characters that the database recognizes as a valid command.
- `CREATE TABLE` creates a new table.
- `INSERT INTO` adds a new row to a table.
- `SELECT` queries data from a table.
- `ALTER TABLE` changes an existing table.
- `UPDATE` edits a row in a table.
- `DELETE FROM` deletes rows from a table.
*Constraints* add information about how a column can be used
#### cheat sheet



https://codecademy-content.s3.amazonaws.com/cheatsheets/learn-sql/learn-sql-manipulation.pdf
https://www.codecademy.com/learn/learn-sql/modules/learn-sql-manipulation/reference
#### QA: Are there any other commonly used SQL commands?
Answer
1. One such command is `DROP TABLE`, which you can use to permanently *remove a table* from a database.
2. Other commands, such as `ANALYZE`, which is used to *obtain statistics* about a table.
## QUERIES
### Select
We can select individual columns by their names (separated by a comma):
```sql=
SELECT column1, column2
FROM table_name;
```
#### Instructions

#### QA: Do SQL clauses always have to be entirely capitalized?
Answer
==No==, SQLite, which Codecademy uses, is case-insensitive when it comes to clauses like SELECT and FROM which can be cased in any way. This is different from other programming languages such as Python where casing is quite important.
Example
```sql=
/* Both of the following queries will return the same result. */
SELECT * FROM table;
select * from table;
```
### As
```sql=
SELECT name AS 'Titles'
FROM movies;
```
`AS` is a keyword in SQL that allows you to ==rename a column or table== using an alias. The new name can be anything you want as long as you put it inside of single quotes. Here we renamed the `name` column as `Titles`.
Some important things to note:
- it’s best practice to surround your aliases with ==single quotes==.
- When using `AS`, the columns are not being renamed in the table. The aliases only appear in the result.
#### QA: Can we alias multiple columns in a single query?
Answer
==Yes==, you can alias multiple columns at a time in the same query.
Example
```sql=
SELECT course_id AS "Course ID", exercise_id AS "Exercise ID"
FROM bugs;
```
### Distinct
examining *distinct values* exist in a particular column
`DISTINCT` is used to return ***unique values*** in the output. It filters out all duplicate values in the specified column(s).
For instance,
```sql=
SELECT tools
FROM inventory;
```
```
tools
Hammer
Nails
Nails
Nails
```
By adding `DISTINCT` before the column name,
```sql=
SELECT DISTINCT tools
FROM inventory;
```
```
tools
Hammer
Nails
```
#### QA: Can we apply `DISTINCT` to a `SELECT` query with multiple columns?
**Yes**, the `DISTINCT` clause can be applied to any valid `SELECT` query. It is important to note that `DISTINCT` will filter out all rows that are not unique in terms of all selected columns.
Example
Let’s assume that in the Codecademy database there is a table `bugs` which stores information about opened bug reports. It might have columns like `course_id`, `exercise_id`, `reported_by`, `reported_date`, `report_url`, etc. For the purpose of this example, let’s say that this is our table:
```
id course_id exercise_id reported_by
1 5 4 Tod
2 5 4 Alex
3 5 3 Roy
4 5 4 Roy
5 7 4 Alex
6 7 8 Tod
7 14 2 Alex
8 14 4 Tod
9 14 6 Tod
10 14 2 Roy
```
```sql=
SELECT DISTINCT reported_by FROM bugs;
```
```
reported_by
Alex
Tod
Roy
```
Our coworker would like to know in which exercises bugs have been reported. This gets trickier because now we have to query two columns: `course_id` and `exercise_id`. Let’s try to use the same approach as before:
> SELECT DISTINCT course_id, exercise_id FROM bugs;
```
course_id exercise_id
14 2
5 4
14 4
14 6
5 3
7 4
7 8
```
Is this the result we were hoping for? Yes. It is true that there are duplicated values in the `course_id` and `exercise_id`, but ==every row is unique==
### Where
We can ***restrict*** our query results using the `WHERE` clause in order to obtain only the information we want.
Following this format, the statement below filters the result set to only include *top rated movies* (IMDb ratings greater than 8):
```sql=
SELECT *
FROM movies
WHERE imdb_rating > 8;
```
`WHERE` clause filters the result set to only include rows where the *following condition is true*.
Comparison operators used with the WHERE clause are:
```
= equal to
!= not equal to
> greater than
< less than
>= greater than or equal to
<= less than or equal to
```
#### QA: Can we compare values of two columns in a `WHERE` clause?
Answer
**Yes**, within a `WHERE` clause you can compare the values of two columns.
Example
```sql=
/*
This will return all rows where the value in the
x column is greater than the y column value.
*/
SELECT x, y
FROM coordinates
WHERE x > y;
```
### Like I
`LIKE` can be a useful operator when you want to ==compare similar values==.
The `movies` table contains two films with similar titles, ‘Se7en’ and ‘Seven’.
How could we select all movies that **start** with ‘Se’ and **end** with ‘en’ and have exactly one character in the middle?
```sql=
SELECT *
FROM movies
WHERE name LIKE 'Se_en';
```
`LIKE` is a special operator ==used with the `WHERE` clause== to search for a specific pattern in a column.
`Se_en` represents a pattern with a wildcard character.
The `_` means you can ==substitute== any individual character here without breaking the pattern. The names `Seven` and `Se7en` both match this pattern.
#### QA: Can we apply the LIKE operator to values other than TEXT?
Answer
**Yes**, you can apply the `LIKE` operator to numerical values as well.
Whenever you use `LIKE` however, you must ***always wrap the pattern within a pair of quotations***, whether for matching a number or a string.
Example
```sql=
/*
This will select movies where the id number
starts with 2 and is followed by any two numbers.
*/
SELECT *
FROM movies
WHERE id LIKE '2__';
```
### Like II
The percentage sign `%` is another wildcard character that can be used with `LIKE`.
This statement below filters the result set to only include movies with names that ==begin== with the letter ‘A’:
```sql=
SELECT *
FROM movies
WHERE name LIKE 'A%';
```
`%` is a wildcard character that *matches zero or more missing letters* in the pattern. For example:
- `A%` matches all movies with names that begin with letter ‘A’
- `%a` matches all movies that end with ‘a’
We can also use % both before and after a pattern:
```sql=
SELECT *
FROM movies
WHERE name LIKE '%man%';
```
Here, any movie that contains the word ‘man’ in its name will be returned in the result.
`LIKE` is not case sensitive. ‘Batman’ and ‘Man of Steel’ will both appear in the result of the query above.

#### QA: How do we search for patterns containing the actual characters “%” or “_”?
Answer
When searching for a pattern containing the specific characters `%` or `_`, we can utilize the *escape character `\`*, similarly to its use in Python.
```sql=
Example
/*
In this pattern, we use an escape character before '%'.
This will only match "%" and not be used like the
wildcard character.
This query will match any titles that end with
' 100%'.
*/
SELECT *
FROM books
WHERE title LIKE '% 100\%';
```
### Is Null
Unknown values are indicated by `NULL`.
It is ***not*** possible to test for `NULL` values with comparison operators, such as `=` and `!=`.
Instead, we will have to use these operators:
- `IS NULL`
- `IS NOT NULL`
To filter for all movies with an IMDb rating:
```sql=
SELECT name
FROM movies
WHERE imdb_rating IS NOT NULL;
```
#### QA: When storing missing data, should I store them as NULL?
Answer
It can depend entirely on how you need the data to be stored and utilized.
Let’s say that you have a table of **employee information**, which included their address. Say that we wanted to check all rows of this table and find where any addresses are missing. If we stored the addresses as `TEXT` values, we might choose to store all the missing values as either `''` or as `NULL`.
If we stored the missing address values as an empty string `''` then these values are not `NULL`. Empty strings are seen as a ==string of length 0==. So, if we ran a query using
```sql=
WHERE address IS NULL
```
it would not give us the rows with missing address values. We would have to check using
```sql=
WHERE address = ''
```
With a table containing many different data types, it may be helpful and more *convenient to store any missing values in general* as just `NULL` so that we can utilize the `IS NULL` and `IS NOT NULL` operators.
### Between
The `BETWEEN` operator is used in a `WHERE` clause to filter the result set within a certain range. It accepts two values that are either numbers, text or dates.
For example, this statement filters the result set to only include movies with years from 1990 up to, and including 1999.
```sql=
SELECT *
FROM movies
WHERE year BETWEEN 1990 AND 1999;
```
When the values are ==text==, `BETWEEN` filters the result set for <font color='red'>within the alphabetical range</font>.
In this statement, `BETWEEN` filters the result set to only include movies with names that *begin with the letter ‘A’ up to, but ==not including== ones that begin with ‘J’*.
```sql=
SELECT *
FROM movies
WHERE name BETWEEN 'A' AND 'J';
```
<font color=red>However, if a movie has a name of simply ‘J’, it would actually match</font>. This is because `BETWEEN` goes up to the second value — up to ‘J’. So the movie named ‘J’ would be included in the result set but not ‘Jaws’.
#### QA: When applying the BETWEEN operator on TEXT values, how are values compared?
Answer
In most programming languages, including SQLite and Python, TEXT or string values are compared based on their ==lexicographical ordering==, and when using the BETWEEN operator for a range of TEXT values in SQL, the values will be sorted in this way.
If two words have different lengths, but match up to the last letter of the shorter word, the shorter word will appear first in the ordering.
Example
```
A = "Alien"
B = "Aliens"
C = "Alike"
/*
Because A and B share the same sequence of characters
up to the last character of A, which is shorter, A < B.
Also, because "k" comes after "e" in the alphabet, C will
come last in the ordering of these 3 words.
A < B < C
*/
```
### And
Sometimes we want to combine *multiple conditions* in a `WHERE` clause to make the result set more specific and useful.
```sql=
SELECT *
FROM movies
WHERE year BETWEEN 1990 AND 1999
AND genre = 'romance';
```
- `year BETWEEN 1990 AND 1999` is the 1st condition.
- `genre = 'romance'` is the 2nd condition.
- AND combines the two conditions.
With AND, both conditions must be true for the row to be included in the result.
#### QA: Is the AND used in BETWEEN the same as the AND operator used between multiple conditions?
Answer
No, although they may be assumed to be the same thing, the `AND` used with a `BETWEEN`, like
`BETWEEN 1990 AND 1999`
is not quite the same `AND` used when combining multiple conditions. When used in a `BETWEEN` statement, we are ==not== combining two separate conditions, but providing a <font color=red>range of values</font> to obtain the values within that range.
However, we can easily rewrite a `BETWEEN` to one with two conditions, like these queries which would be identical.
```sql=
SELECT *
FROM movies
WHERE year BETWEEN 1990 AND 1999;
```
```sql=
SELECT *
FROM movies
WHERE year >= 1990
AND year <= 1999;
```
### Or
Similar to `AND`, the `OR` operator can also be used to combine multiple conditions in `WHERE`, but there is a fundamental difference:
`OR` operator displays a row if ==any== condition is true.
```sql=
SELECT *
FROM movies
WHERE year > 2014
OR genre = 'action';
```
- `year > 2014` is the 1st condition.
- `genre = 'action'` is the 2nd condition.
- `OR` combines the two conditions.
With OR, if any of the conditions are true, then the row is added to the result.
#### QA: Can we write conditions using both AND and OR?
Answer
Yes, queries can combine multiple conditions using `AND` and `OR` without a real limit to how many conditions you can combine. However, the more conditions you combine, the more specific the results will be and the more complex it can get.
Example
```sql=
/*
This will select movies with id values
from 10 to 20 inclusive, OR with id
values from 50 to 60 inclusive.
*/
SELECT *
FROM movies
WHERE
(id > 10 AND id < 20)
OR
(id > 50 AND id < 60);
```
### Order By
We can sort the results using `ORDER BY`, either ==alphabetically or numerically==. Sorting the results often makes the data more useful and easier to analyze.
For example, if we want to sort everything by the movie’s title from A through Z:
```sql=
SELECT *
FROM movies
ORDER BY name;
```
- `ORDER BY` is a clause that indicates you want to sort the result set by a particular column.
- `name` is the specified column.
Sometimes we want to sort things in a **decreasing order**. For example, if we want to select all of the well-received movies, sorted from highest to lowest by their year:
```sql=
SELECT *
FROM movies
WHERE imdb_rating > 8
ORDER BY year DESC;
```
- `DESC` is a keyword used in `ORDER BY` to sort the results in descending order (high to low or Z-A).
- `ASC` is a keyword used in `ORDER BY` to sort the results in ascending order (low to high or A-Z).
The column that we `ORDER BY` doesn’t even have to be one of the columns that we’re displaying.
Note: `ORDER BY` always goes after `WHERE` (if WHERE is present).
#### QA: In SQL, can we apply ORDER BY with multiple columns?
Yes, following the `ORDER BY`, you can list more than one column for which to order the data by.
When ordering by more than one column, it will *first order the data on the first column*, then, keeping the previous column order, it will order on the next column, and so on.
You can also specify ascending or descending order for each listed column.
Example
```sql=
/*
This will order on the year, then order the
names in reverse alphabetical
order, preserving the order
of the year column.
*/
SELECT year, name
FROM movies
ORDER BY year ASC, name DESC;
```
### Limit
For instance, imagine that we just want to see a few examples of records.
```sql=
SELECT *
FROM movies
LIMIT 10;
```
`LIMIT` is a clause that lets you specify the maximum number of rows the result set will have. This saves space on our screen and makes our queries run faster.
LIMIT always goes at the very end of the query. Also, it is <font color=red>not supported in all SQL databases</font>.
#### Instructions
Combining your knowledge of `LIMIT` and `ORDER BY`, write a query that returns the top 3 highest rated movies.
Select all the columns.

#### QA: In SQL, what happens if we set a LIMIT value that is greater than the total number of rows?
Answer
If the number set in the `LIMIT` clause surpasses the number of rows available to select, then it will just <font color=red>exclude the remaining amount of rows</font> in the result set.
```sql=
Example
/*
Say the table `name` has only 90 rows.
Then, since 100 is greater than the number of rows,
it will just return what rows are there.
*/
SELECT *
FROM names
LIMIT 100;
```
### Case
A `CASE` statement allows us to create different outputs (usually in the `SELECT` statement). It is SQL’s way of handling ==if-then logic==.
Suppose we want to condense the ratings in `movies` to three levels:
- If the rating is above 8, then it is Fantastic.
- If the rating is above 6, then it is Poorly Received.
- Else, Avoid at All Costs.
```sql=
SELECT name,
CASE
WHEN imdb_rating > 8 THEN 'Fantastic'
WHEN imdb_rating > 6 THEN 'Poorly Received'
ELSE 'Avoid at All Costs'
END
FROM movies;
```

- Each `WHEN` tests a condition and the following `THEN` gives us the string if the condition is true.
- The `ELSE` gives us the string if all the above conditions are false.
- The `CASE` statement must end with `END`.
In the result, you have to scroll right because the column name is very long. To shorten it, we can ==rename the column== to ‘Review’ using `AS`:
```sql=
SELECT name,
CASE
WHEN imdb_rating > 8 THEN 'Fantastic'
WHEN imdb_rating > 6 THEN 'Poorly Received'
ELSE 'Avoid at All Costs'
END AS 'Review'
FROM movies;
```

#### Instructions
Select the `name` column and use a `CASE` statement to create the second column that is:
- ‘Chill’ if `genre = 'romance'`
- ‘Chill’ if `genre = 'comedy'`
- ‘Intense’ in all other cases
Optional: Rename the whole `CASE` statement to ‘Mood’ using `AS`.

#### QA: For a CASE statement, do all values provided by THEN have to match a single data type?
Answer
No, for `CASE` statements, the `THEN` values do not have to return only a single type of value. In fact, you can have each `THEN` in a single `CASE` statement return different value types such as `TEXT`, `REAL`, and `INTEGER`.
Example
```sql=
SELECT
CASE
WHEN condition1 THEN "text"
WHEN condition2 THEN 100
WHEN condition3 THEN 3.14
END AS 'example'
FROM table;
```
## AGGREGATE FUNCTIONS
### Introduction
Calculations performed on multiple rows of a table are called **aggregates**.
In this lesson, we have given you a table named `fake_apps` which is made up of fake mobile applications data.
Here is a quick preview of some important aggregates that we will cover in the next five exercises:
- `COUNT()`: count the number of rows
- `SUM()`: the sum of the values in a column
- `MAX()`/`MIN()`: the largest/smallest value
- `AVG()`: the average of the values in a column
- `ROUND()`: round the values in the column
### Count
The fastest way to calculate how many rows are in a table is to use the `COUNT()` function.
`COUNT()` is a function that takes the name of a column as an argument and counts the number of non-empty values in that column.
```sql=
SELECT COUNT(*)
FROM table_name;
```
Here, we want to count every row, so we pass `*` as an argument inside the parenthesis.
#### Instructions
1. Let’s count how many apps are in the table.
In the code editor, run:
```sql=
SELECT COUNT(*)
FROM fake_apps;
```

2. Add a `WHERE` clause in the previous query to count how many free apps are in the table.

#### QA: Does COUNT() include duplicate values of a column?
Answer
Yes, when using the `COUNT()` function on a column in SQL, it will include duplicate values by default. It essentially counts all rows for which there is a value in the column.
If you wanted to count only the unique values in a column, then you can utilize the `DISTINCT` clause within the `COUNT()` function.
Example
```sql=
/* This will return 22, the number of distinct category values. */
SELECT COUNT(DISTINCT category)
FROM fake_apps;
```
### Sum
SQL makes it easy to add all values in a particular column using `SUM()`.
`SUM()` is a function that takes the name of a column as an argument and returns the sum of all the values in that column.
What is the total number of downloads for all of the apps combined?
```sql=
SELECT SUM(downloads)
FROM fake_apps;
```
This adds all values in the `downloads` column.
### Max / Min
The `MAX()` and `MIN()` functions return the highest and lowest values in a column, respectively.
How many downloads does the most popular app have?
```sql=
SELECT MAX(downloads)
FROM fake_apps;
```

The most popular app has 31,090 downloads!
`MAX()` takes the name of a column as an argument and returns the largest value in that column. Here, we returned the largest value in the `downloads` column.
`MIN()` works the same way but it does the exact opposite; it returns the smallest value.
#### Instruction
Delete the previous query.
Write a new query that returns the price of the most expensive app.

#### QA: If multiple rows have the minimum or maximum value, which one is returned when using MAX/MIN?
Answer
Typically, when you have more than one row that contains the minimum or maximum value in a column, the ==topmost row== containing that value will be returned in the result.
For example, if the table contained multiple rows with the minimum price of 0.0, then the result of a query with `MIN(price) `will choose the topmost row from the table that had this price value.
Example
```sql=
/*
This should return the siliconphase app, because
it was the topmost row that had the minimum price
value of the column.
*/
SELECT id, name, MIN(price)
FROM fake_apps;
```
### Average
SQL uses the `AVG()` function to quickly calculate the average value of a particular column.
The statement below returns the average number of downloads for an app in our database:
```sql=
SELECT AVG(downloads)
FROM fake_apps;
```

The `AVG()` function works by taking a column name as an argument and returns the average value for that column.
#### QA: In SQL, how can we get the average of only the unique values of a column?
Answer
To run the `AVG()` function on a column such that it only averages the unique values in the column, we could use the `DISTINCT` clause right before the column name.
Example
```sql=
/* Returns 2.02365 */
SELECT AVG(price)
FROM fake_apps;
/* Returns 4.15833.. */
SELECT AVG(DISTINCT price)
FROM fake_apps;
```
### Round
By default, SQL tries to be as precise as possible without rounding. We can make the result table easier to read using the `ROUND()` function.
`ROUND()` function takes two arguments inside the parenthesis:
1. a column name
2. an integer
It rounds the values in the column to the number of decimal places specified by the integer.
```sql=
SELECT ROUND(price, 0)
FROM fake_apps;
```
Here, we pass the column `price` and integer `0` as arguments. SQL rounds the values in the column to 0 decimal places in the output.
#### Instructions
1. Let’s return the name column and a rounded price column.
In the code editor, type:
```sql=
SELECT name, ROUND(price, 0)
FROM fake_apps;
```

2. we were able to get the average price of an app ($2.02365) using this query:
```sql=
SELECT AVG(price)
FROM fake_apps;
```
Now, let’s edit this query so that it rounds this result to 2 decimal places.

#### QA: Does the ROUND() function round up?
Answer
When using the `ROUND()` function, you can provide a second argument, which is the precision, or number of decimal places to round the number on.
In SQLite, rounding is done by rounding up if the next decimal value is 5, and rounds down if the value is less than 5.
For example,
```sql=
/* This will result in 4.0 */
SELECT ROUND(3.5, 0);
/* This will result in 6.4 */
SELECT ROUND(6.42, 1);
/* This will result in 6.0 */
SELECT ROUND(6.42, 0);
```
### Group By I
Oftentimes, we will want to calculate an aggregate for data with certain characteristics.
For instance, we might want to know the mean IMDb ratings for all movies ==each year==. We could calculate each number by a series of queries with different WHERE statements, like so:
```sql=
SELECT AVG(imdb_rating)
FROM movies
WHERE year = 1999;
SELECT AVG(imdb_rating)
FROM movies
WHERE year = 2000;
SELECT AVG(imdb_rating)
FROM movies
WHERE year = 2001;
```
and so on.
We can use `GROUP BY` to do this in a single step:
```sql=
SELECT year,
AVG(imdb_rating)
FROM movies
GROUP BY year
ORDER BY year;
```
`GROUP BY` is a clause in SQL that is used with aggregate functions. It is used in collaboration with the `SELECT` statement to arrange identical data into groups.
The `GROUP BY` statement comes after any `WHERE` statements, but before `ORDER BY` or `LIMIT`.
#### Instructions
1. In the code editor, type:
```sql=
SELECT price, COUNT(*)
FROM fake_apps
GROUP BY price;
```

Here, our aggregate function is `COUNT()` and we arranged `price` into groups.
What do you expect the result to be?
2. In the previous query, add a `WHERE` clause to count the total number of apps that have been downloaded more than 20,000 times, at each price.

3. Write a new query that calculates the total number of downloads for each category.
Select `category` and `SUM(downloads)`.

#### QA: When using the GROUP BY clause, do we always have to group by one of the selected columns listed after SELECT?
Answer
No, you can `GROUP BY` a column that was ==not included in the SELECT statement==.
For example, this query does not list the price column in the SELECT, but it does group the data by that column.
```sql=
SELECT name, downloads
FROM fake_apps
GROUP BY price;
```
However, usually we do include the grouped by column in the `SELECT` for the sake of clarity, so that it’s easier to see what rows belong to which group.
### Group By II
Sometimes, we want to `GROUP BY` a calculation done on a column.
For instance, we might want to know how many movies have IMDb ratings that round to 1, 2, 3, 4, 5. We could do this using the following syntax:
```sql=
SELECT ROUND(imdb_rating),
COUNT(name)
FROM movies
GROUP BY ROUND(imdb_rating)
ORDER BY ROUND(imdb_rating);
```
However, this query may be time-consuming to write and more prone to error.
SQL lets us use <font color=red>column reference(s)</font> in our `GROUP BY` that will make our lives easier.
- 1 is the first column selected
- 2 is the second column selected
- 3 is the third column selected
and so on.
The following query is equivalent to the one above:
```sql=
SELECT ROUND(imdb_rating),
COUNT(name)
FROM movies
GROUP BY 1
ORDER BY 1;
```
Here, the `1` refers to the first column in our `SELECT` statement, `ROUND(imdb_rating)`.
#### Instructions
1. Suppose we have the query below:
```sql=
SELECT category,
price,
AVG(downloads)
FROM fake_apps
GROUP BY category, price;
```

Write the exact query, but use column reference numbers instead of column names after `GROUP BY`.
```sql=
SELECT category,
price,
AVG(downloads)
FROM fake_apps
GROUP BY 1, 2;
```
#### QA: Do column references have to follow the order the columns are listed in the SELECT?
Answer
No, once you list the columns after the `SELECT`, they can be referenced by the order they appeared, starting from 1 for the first listed column.
You are not limited to referencing them in the exact order they were listed, like
`GROUP BY 1, 2, 3`
You can freely use the references in any order, like you would normally without using references.
`GROUP BY 3, 1, 2`
### Having
In addition to being able to group data using `GROUP BY`, SQL also allows you to ==filter which groups to include and which to exclude==.
For instance, imagine that we want to see how many movies of different genres were produced each year, but we only care about years and genres with at least 10 movies.
We can’t use `WHERE` here because we don’t want to filter the rows; we want to ==filter groups==.
This is where `HAVING` comes in.
`HAVING` is very similar to `WHERE`. In fact, all types of `WHERE` clauses you learned about thus far can be used with `HAVING`.
We can use the following for the problem:
```sql=
SELECT year,
genre,
COUNT(name)
FROM movies
GROUP BY 1, 2
HAVING COUNT(name) > 10;
```
- When we want to limit the results of a query based on values of the individual rows, use `WHERE`.
- When we want to limit the results of a query based on an aggregate property, use `HAVING`.
`HAVING` statement always comes after `GROUP BY`, but before `ORDER BY` and `LIMIT`.
#### Instructions
Suppose we have the query below:
```sql=
SELECT price,
ROUND(AVG(downloads)),
COUNT(*)
FROM fake_apps
GROUP BY price;
```
It returns the average downloads (rounded) and the number of apps – at each price point.
However, certain price points don’t have very many apps, so their average downloads are less meaningful.
Add a `HAVING` clause to restrict the query to price points that have more than 10 apps.
```sql=
SELECT price,
ROUND(AVG(downloads)),
COUNT(*)
FROM fake_apps
GROUP BY price
HAVING COUNT(price) > 10;
```

#### QA: Can a WHERE clause be applied with a HAVING statement in the same query?
Answer
Yes, you can absolutely apply a `WHERE` clause in a query that also utilizes a `HAVING` statement.
When you apply a `WHERE` clause in the same query, it must ==always be before any `GROUP BY`==, which in turn must be before any `HAVING`.
As a result, the data is essentially filtered on the `WHERE` condition first.
Then, from this filtered data, it is grouped by specified columns and then further filtered based on the `HAVING` condition.
Example
```sql=
/*
This will first filter the movies with a box_office > 500000.
Then, it will group those results by genre, and finally restrict
the query to genres that have more than 5 movies.
*/
SELECT genre, ROUND(AVG(score))
FROM movies
WHERE box_office > 500000
GROUP BY genre
HAVING COUNT(*) > 5;
```
## MULTIPLE TABLES
### Introduction
In order to efficiently store data, we often spread related information across multiple tables.
For instance, imagine that we’re running a magazine company where users can have different types of subscriptions to different products. Different subscriptions might have many different properties. Each customer would also have lots of associated information.
So instead, we can split our data into three tables:
1. `orders` would contain just the information necessary to describe what was ordered:
- `order_id`, `customer_id`, `subscription_id`, `purchase_date`
2. `subscriptions` would contain the information to describe each type of subscription:
- `subscription_id`, `description`, `price_per_month`, `subscription_length`
3. `customers` would contain the information for each customer:
- `customer_id`, `customer_name`, `address`
In this lesson, we’ll learn the SQL commands that will help us work with data that is stored in multiple tables.
### Combining Tables with SQL
Combining tables manually is time-consuming. Luckily, SQL gives us an easy sequence for this: it’s called a `JOIN`.
If we want to combine `orders` and `customers`, we would type:
```sql=
SELECT *
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id;
```
Let’s break down this command:
1. The third line uses `JOIN` to say that we want to combine information from `orders` with customers.
2. The fourth line tells us ==how to combine the two tables==. We want to match `orders` table’s `customer_id` column with `customers` table’s `customer_id` column.
Because column names are often repeated across multiple tables, we use the syntax `table_name.column_name` to be sure that our requests for columns are unambiguous. In our example, we use this syntax in the `ON` statement, but we will also use it in the `SELECT` or any other statement where we refer to column names.
For example: Instead of selecting all the columns using `*`, if we only wanted to select orders table’s `order_id` column and customers table’s `customer_name` column, we could use the following query:
```sql=
SELECT orders.order_id,
customers.customer_name
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id;
```

#### Instructions
Add a second query after your first one that only selects rows from the join where `description` is equal to ‘Fashion Magazine’.
```sql=
SELECT *
FROM orders
JOIN subscriptions
ON orders.subscription_id = subscriptions.subscription_id
WHERE subscriptions.description = 'Fashion Magazine';
```

#### QA: When doing an INNER JOIN, are columns matched on NULL values?
Answer
No, when you have ``NULL`` values in a column, these are never matched to other `NULL` values. This is because `NULL` signifies the absence of any value, and cannot be compared as they will never equal anything. Doing say `NULL = NULL` results in False.
Let’s take for example the animation given in the exercise, which shows how `INNER JOIN` works. Let’s say that an additional row was added to each table, with `NULL` in the `C2` column, such that they become
Left table:
```sql=
C1, C2
A, B
Q, W
X, Y
T, NULL
```
Right table:
```sql=
C2, C3
B, C
E, R
Y, Z
NULL, V
```
If we inner joined these tables the same way, we would end up with the same result, because `NULL` values are not matched.
```sql=
C1, C2, C3
A, B, C
X, Y, Z
```
### Inner Joins
Let’s revisit how we joined `orders` and `customers`. For every possible value of `customer_id` in orders, there was a corresponding row of customers with the same `customer_id`.
What if that wasn’t true?


The first and last rows have matching values of `c2`. The middle rows do not match. The final result has all values from the first and last rows but does ==not== include the <font color=red>non-matching</font> middle row.
#### Instruction
Join `newspaper` table and `online` table on their `id` columns (the unique ID of the subscriber).
How many rows are in this table?
```sql=
SELECT COUNT(*)
FROM newspaper;
SELECT COUNT(*)
FROM online;
SELECT COUNT(*)
FROM newspaper
JOIN online
ON newspaper.id = online.id;
```

### Left Joins
What if we want to combine two tables and keep some of the un-matched rows?
SQL lets us do this through a command called `LEFT JOIN`. A left join will keep all rows from the first table, regardless of whether there is a matching row in the second table.


```sql=
SELECT *
FROM table1
LEFT JOIN table2
ON table1.c2 = table2.c2;
```
1. The first line selects all columns from both tables.
2. The second line selects `table1` (the “left” table).
3. The third line performs a `LEFT JOIN` on `table2` (the “right” table).
4. The fourth line tells SQL how to perform the join (by looking for matching values in column `c2`).
#### Instruction
In order to find which users do <font color=red>not subscribe</font> to the online edition, we need to add a `WHERE` clause.
Add a second query after your first one that adds the following `WHERE` clause and condition:
`WHERE online.id IS NULL`
This will select rows where there was no corresponding row from the `online` table.
```sql=
SELECT *
FROM newspaper
LEFT JOIN online
ON newspaper.id = online.id
WHERE online.id IS NULL;
```

### Primary Key vs Foreign Key
Let’s return to our example of the magazine subscriptions. Recall that we had three tables: `orders`, `subscriptions`, and `customers`.
Each of these tables has a column that <font color=red>uniquely identifies </font> each row of that table:
- `order_id` for `orders`
- `subscription_id` for `subscriptions`
- `customer_id` for `customers`
These special columns are called ==primary keys==.
Primary keys have a few requirements:
- None of the values can be `NULL`.
- Each value must be ==unique==
- A table <font color=red>==can not== have more than one primary key column</font>.
Let’s reexamine the `orders` table:

Note that `customer_id` (the primary key for `customers`) and `subscription_id` (the primary key for `subscriptions`) both appear in this.
When the primary key for one table appears in a different table, it is called a ==foreign key==.
So `customer_id` is a primary key when it appears in `customers`, but a <font color=red>foreign key when it appears in `orders` </font>.
In this example, our primary keys all had somewhat descriptive names. Generally, the primary key will just be called `id`. Foreign keys will have more descriptive names.
Why is this important? The most common types of joins will be *joining a foreign key from one table with the primary key from another table*. For instance, when we join `orders` and `customers`, we join on `customer_id`, which is a foreign key in `orders` and the primary key in `customers`.
#### Instruction
Suppose Columbia University has two tables in their database:
- The `classes` table contains information on the classes that the school offers. Its primary key is id.
- The `students` table contains information on all students in the school. Its primary key is `id`. It contains the foreign key `class_id`, which corresponds to the primary key of `classes`.
Perform an ==inner join== of classes and students using the primary and foreign keys described above, and select all the columns.
```sql=
SELECT *
FROM classes
JOIN students
ON classes.id = students.class_id;
```

#### QA: Is it possible for a table to have more than one unique identifier column, like an id column?
Answer
Yes, it is possible for a table to have more than one column which can uniquely identify a row of data. A column that can uniquely identify a record of data is known as a `"Candidate Key"`. Tables can have multiple `"Candidate Key"`s, each of which could potentially be the `"Primary Key"`, but there must only be one `"Primary Key"` per table. Usually, the column chosen as the `"Primary Key"` follows the naming convention like `customer_id` or `product_id`.
For example, say that we had a table of ==employee records==, with the columns `employee_id` and `phone_number`. Every employee has a unique `employee_id` value, and a unique `phone_number` value. Both of these columns can be unique identifiers for a row, so they are `"Candidate keys"`, but the `"Primary Key"` would most likely be set to `employee_id`.
### Cross Join
Sometimes, we just want to *combine all rows of one table with all rows of another table*.
For instance, if we had a table of `shirts` and a table of `pants`, we might want to know all the possible combinations to create different outfits.
Our code might look like this:
```sql=
SELECT shirts.shirt_color,
pants.pants_color
FROM shirts
CROSS JOIN pants;
```
- The first two lines select the columns `shirt_color` and `pants_color`.
- The third line pulls data from the table `shirts`.
- The fourth line performs a `CROSS JOIN` with pants.
A more common usage of `CROSS JOIN` is when we need to compare each row of a table to a list of values.
Let’s return to our `newspaper` subscriptions. This table contains two columns that we haven’t discussed yet:
- `start_month`: the first month where the customer subscribed to the print newspaper (i.e., `2` for February)
- `end_month`: the final month where the customer subscribed to the print newspaper
Suppose we wanted to know how many users were subscribed during ==each month== of the year. For each month (`1`, `2`, `3`) we would need to know if a user was subscribed. Follow the steps below to see how we can use a `CROSS JOIN` to solve this proble
#### Instruction
##### 1
Let’s start by counting the number of customers who were subscribed to the `newspaper` during March.
Use `COUNT(*)` to count the number of rows and a `WHERE` clause to restrict to two conditions:
- `start_month <= 3`
- `end_month >= 3`
```sql=
SELECT COUNT(*)
FROM newspaper
WHERE (start_month <=3 ) AND (end_month >=3);
```
##### 2
Our database contains another table called `months` which contains the numbers between 1 and 12.
Select all columns from the cross join of `newspaper` and `months`.
```sql=
SELECT *
FROM newspaper
CROSS JOIN months;
```

##### 3
Create a third query where you add a `WHERE` statement to your cross join to restrict to two conditions:
- `start_month <= month`
- `end_month >= month`
This will select all months where a user was subscribed
```sql=
SELECT *
FROM newspaper
CROSS JOIN months
WHERE (start_month <= month) AND
(end_month >= month );
```

##### 4
Create a final query where you aggregate over each month to count the number of subscribers.
Fill in the blanks in the following query:
```sql=
SELECT month,
COUNT(*)
FROM ________
CROSS JOIN ________
WHERE ________ AND ________
GROUP BY ________;
```
```sql=
SELECT month, COUNT(*)
FROM newspaper
CROSS JOIN months
WHERE (start_month <= month) AND
(end_month >= month)
GROUP BY month;
```

### Union
Sometimes we just want to ==stack== one dataset on top of the other. Well, the `UNION` operator allows us to do that.
Suppose we have two tables and they have the same columns.
`table1`:

`table2`:

If we combine these two with `UNION`:
```sql=
SELECT *
FROM table1
UNION
SELECT *
FROM table2;
```
The result would be:

SQL has strict rules for appending data:
- Tables must have the <font color=red>same number of columns</font>.
- The columns must have the <font color=red>same data types in the same order</font> as the first table.
#### Instructions
Let’s return to our `newspaper` and `online` subscriptions. We’d like to create one big table with both sets of data.
Use `UNION` to stack the `newspaper` table on top of the `online` table.
```sql=
SELECT *
FROM newspaper
UNION
SELECT *
FROM online;
```

#### QA: What happens if the tables we perform the UNION operator on have duplicate rows?
Answer
When you combine tables with `UNION`, <font color=red>duplicate rows</font> will be ==excluded==.
If, however, you wanted to include duplicates, certain versions of SQL provides the `UNION ALL` operator.
### With
Often times, we want to combine two tables, but one of the tables is the result of another calculation.
Let’s return to our magazine order example. For instance, they might want to know *how many magazines* ==each customer== subscribes to. We can easily calculate this using our `orders` table:
```sql=
SELECT customer_id,
COUNT(subscription_id) AS 'subscriptions'
FROM orders
GROUP BY customer_id;
```

This query is good, but a `customer_id` isn’t terribly useful for our marketing department, they probably want to know the ==customer’s name==.
We want to be able to join the results of this query with our `customers` table, which will tell us the name of each customer. We can do this by using a `WITH` clause.
```sql=
WITH previous_results AS (
SELECT ...
...
...
...
)
SELECT *
FROM previous_results
JOIN customers
ON _____ = _____;
```
- The `WITH` statement allows us to perform a ==separate query== (such as aggregating customer’s subscriptions)
- `previous_results` is the alias that we will use to reference any columns from the query inside of the `WITH` clause
- We can then go on to do whatever we want with this <font color=red>temporary table</font> (such as join the temporary table with another table)
Essentially, we are putting a ==whole first query inside the parentheses `()`== and giving it a name. After that, we can use this name as if it’s a table and write a new query using the first query.
#### Instruction
Place the whole query below into a `WITH` statement, inside parentheses `()`, and give it name `previous_query`:
```sql=
SELECT customer_id,
COUNT(subscription_id) AS 'subscriptions'
FROM orders
GROUP BY customer_id
```
Join the temporary table `previous_query` with `customers` table and select the following columns:
- `customers.customer_name`
- `previous_query.subscriptions`
```sql=
WITH previous_query AS (
SELECT customer_id,
COUNT(subscription_id) AS 'subscriptions'
FROM orders
GROUP BY customer_id
)
SELECT customers.customer_name,
previous_query.subscriptions
FROM previous_query
JOIN customers
ON previous_query.customer_id = customers.customer_id;
```

#### QA: Can we use WITH for more than one nested query in SQL?
Answer
Yes, you can use `WITH` for more than one nested query. You can do so by listing each query using commas after the `WITH`.
For example,
```sql=
WITH
query1 AS (SELECT column1 FROM table1 WHERE condition1),
query2 AS (SELECT column2 FROM table2 WHERE condition2)
…
```
### Multiple Tables cheatsheet

