SQL - HackMD

--- title: SQL tags: learning notes, SQL --- # Learn SQL [TOC] ## Manipulation ### Relational Databases * A relational database is a database that organizes information into one or more tables. Here, the relational database contains one table. * **A table is a collection of data organized into rows and columns.** Tables are sometimes referred to as relations. Here the table is `celebs`. * **A column is a set of data values of a particular type.** Here, `id`, `name`, and `age` are the columns. * **A row is a single record in a table.** The first row in the celebs table has: - An `id` of `1` - A `name` of `Justin Bieber` - An `age` of `22` * **All data stored in a relational database is of a certain data type.** Some of the most common data types are: - `INTEGER`, a positive or negative whole number - `TEXT`, a text string - `DATE`, the date formatted as YYYY-MM-DD for the year, month, and day - `REAL`, a decimal value ### Statement A statement is text that the database recognizes as a valid command. Statements always end in a semicolon `;`. ``` CREATE TABLE table_name ( column_1 data_type, column_2 data_type, column_3 data_type ); ``` * `CREATE TABLE` is a clause. Clauses perform specific tasks in SQL. By convention, clauses are written in capital letters. Clauses can also be referred to as commands. * `table_name` refers to the name of the table that the command is applied to. * `(column_1 data_type, column_2 data_type, column_3 data_type)` is a parameter. A parameter is a list of columns, data types, or values that are passed to a clause as an argument. Here, the parameter is a list of column names and the associated data type. * A statement can be written all on one line, or split up across multiple lines if it makes it easier to read. ### Create `CREATE` statements allow us to create a new table in the database. You can use the `CREATE` statement anytime you want to create a new table from scratch. The statement below creates a new table named `celebs`. ``` CREATE TABLE celebs ( id INTEGER, name TEXT, age INTEGER ); ``` ### Insert The `INSERT` statement inserts a new row into a table. You can use the `INSERT` statement when you want to add new records. The statement below enters a record for `Justin Bieber` into the `celebs` table. ``` INSERT INTO celebs (id, name, age) VALUES (1, 'Justin Bieber', 22); ``` ### Select `SELECT` statements are used to fetch data from a database. In the statement below, `SELECT` returns all data in the name column of the `celebs` table. ``` SELECT name FROM celebs; ``` * - `SELECT` is a clause that indicates that the statement is a query. You will use `SELECT` every time you query data from a database. - `name` specifies the column to query data from. - `FROM celebs` specifies the name of the table to query data from. In this statement, data is queried from the `celebs` table. * `SELECT` statements always return a new table called the **result set**. ### Alter The `ALTER TABLE` statement adds a new column to a table. You can use this command when you want to add columns to a table. The statement below adds a new column `twitter_handle` to the `celebs` table. ``` ALTER TABLE celebs ADD COLUMN twitter_handle TEXT; ``` * `ALTER TABLE` is a clause that lets you make the specified changes. * `celebs` is the name of the table that is being changed. * `ADD COLUMN` is a clause that lets you add a new column to a table: - `twitter_handle` is the name of the new column being added - `TEXT` is the data type for the new column * `NULL` is a special value in SQL that represents missing or unknown data. Here, the rows that existed before the column was added have `NULL` values for `twitter_handle`. ### Update **The `UPDATE` statement edits a row in a table.** You can use the `UPDATE` statement when you want to change existing records. The statement below updates the record with an `id` value of `4` to have the `twitter_handle` `@taylorswift13`. ``` UPDATE celebs SET twitter_handle = '@taylorswift13' WHERE id = 4; ``` * `UPDATE` is a clause that edits a row in the table. * `celebs` is the name of the table. * `SET` is a clause that indicates the column to edit. - `twitter_handle` is the name of the column that is going to be updated - `@taylorswift13` is the new value that is going to be inserted into the `twitter_handle` column. * `WHERE` is a clause that indicates which row(s) to update with the new column value. Here the row with a `4` in the `id` column is the row that will have the `twitter_handle` updated to `@taylorswift13`. ### Delete **The `DELETE FROM` statement deletes one or more rows from a table.** You can use the statement when you want to delete existing records. The statement below deletes all records in the `celeb` table with no `twitter_handle`: ``` DELETE FROM celebs WHERE twitter_handle IS NULL; ``` * `DELETE FROM` is a clause that lets you delete rows from a table. * `celebs` is the name of the table we want to delete rows from. * `WHERE` is a clause that lets you select which rows you want to delete. Here we want to delete all of the rows where the `twitter_handle` column `IS` `NULL`. * `IS` `NULL` is a condition in SQL that returns true when the value is `NULL` and false otherwise. ### Constraints Constraints that add information about how a column can be used are invoked after specifying the data type for a column. They can be used to tell the database to reject inserted data that does not adhere to a certain restriction. The statement below sets constraints on the `celebs` table. ``` CREATE TABLE celebs ( id INTEGER PRIMARY KEY, name TEXT UNIQUE, date_of_birth TEXT NOT NULL, date_of_death TEXT DEFAULT 'Not Applicable' ); ``` * **`PRIMARY KEY` columns can be used to uniquely identify the row.** Attempts to insert a row with an identical value to a row already in the table will result in a constraint violation which will not allow you to insert the new row. * `UNIQUE` columns have a different value for every row. This is similar to `PRIMARY KEY` except **a table can have many different `UNIQUE` columns.** * **`NOT NULL` columns must have a value.** Attempts to insert a row without a value for a NOT NULL column will result in a constraint violation and the new row will not be inserted. * `DEFAULT` columns take an additional argument that will be the assumed value for an inserted row if the new row does not specify a value for that column. ### Generalizations - SQL is a programming language designed to manipulate and manage data stored in relational databases. - A relational database is a database that organizes information into one or more tables. - A table is a collection of data organized into rows and columns. ## Queries One of the core purposes of the SQL language is to retrieve information stored in a database. This is commonly referred to as querying. Queries allow us to communicate with the database by asking questions and having the result set return data relevant to the question. ### Select Suppose we are only interested in two of the columns. We can select individual columns by their names (separated by a comma): ``` SELECT column1, column2 FROM table_name; ``` ### As `AS` is a keyword in SQL that allows you to rename a column or table using an alias. The new name can be anything you want as long as you put it inside of single quotes. Here we renamed the `name` column as `Titles`. ``` SELECT name AS 'Titles' FROM movies; ``` * Some important things to note: - Although it’s not always necessary, it’s best practice to surround your aliases with single quotes. - When using `AS`, the columns are not being renamed in the table. The aliases only appear in the result. ### Distinct When we are examining data in a table, it can be helpful to know what distinct values exist in a particular column. `DISTINCT` is used to return unique values in the output. It filters out all duplicate values in the specified column(s). For instance, ``` SELECT tools FROM inventory; ``` might produce: |tools| |:--------:| |Hammer| |Nails| |Nails| |Nails| ----------- By adding `DISTINCT` before the column name, ``` SELECT DISTINCT tools FROM inventory; ``` the result would now be: |tools| |:--------:| |Hammer| |Nails| * Filtering the results of a query is an important skill in SQL. It is easier to see the different possible `genres` in the `movie` table after the data has been filtered than to scan every row in the table. ### Where **We can restrict our query results using the `WHERE` clause in order to obtain only the information we want.** Following this format, the statement below filters the result set to only include top rated movies (IMDb ratings greater than 8): ``` SELECT * FROM movies WHERE imdb_rating > 8; ``` * How does it work? - `WHERE` clause filters the result set to only include rows where the following condition is true. - `imdb_rating > 8` is the condition. Here, only rows with a value greater than 8 in the `imdb_rating` column will be returned. ### Like **LIKE can be a useful operator when you want to compare similar values.** The movies table contains two films with similar titles, ‘Se7en’ and ‘Seven’. How could we select all movies that start with ‘Se’ and end with ‘en’ and have exactly one character in the middle? ``` SELECT * FROM movies WHERE name LIKE 'Se_en'; ``` * - `LIKE` is a special operator used with the WHERE clause to **search for a specific pattern** in a column. - `name LIKE 'Se_en'` is a condition evaluating the `name` column for a specific pattern. - `Se_en` represents a pattern with a wildcard character. --- The percentage sign `%` is another wildcard character that can be used with `LIKE`. This statement below filters the result set to only include movies with names that begin with the letter ‘A’: ``` SELECT * FROM movies WHERE name LIKE 'A%'; ``` * `%` is a wildcard character that matches zero or more missing letters in the pattern. For example: - `A%` matches all movies with names that begin with letter ‘A’ - `%a` matches all movies that end with ‘a’ We can also use `%` both before and after a pattern: ``` SELECT * FROM movies WHERE name LIKE '%man%'; ``` Here, any movie that contains the word ‘man’ in its name will be returned in the result. **`LIKE` is not case sensitive.** ‘Batman’ and ‘Man of Steel’ will both appear in the result of the query above. ### Is Null You might have noticed that there are a few missing values in the `movies` table. More often than not, the data you encounter will have missing values. Unknown values are indicated by `NULL`. ``` SELECT name FROM movies WHERE imdb_rating IS NOT NULL; ``` ### Between **The `BETWEEN` operator can be used in a `WHERE` clause to filter the result set within a certain range.** The values can be numbers, text or dates. This statement filters the result set to only include movies with `name`s that begin with letters ‘A’ up to, but not including ‘J’. ``` SELECT * FROM movies WHERE name BETWEEN 'A' AND 'J'; ``` Here is another one: ``` SELECT * FROM movies WHERE year BETWEEN 1990 AND 1999; ``` In this statement, the `BETWEEN` operator is being used to filter the result set to only include movies with years between 1990 up to, and including 1999. --- * Really interesting point to emphasize again: - `BETWEEN` two letters is not inclusive of the 2nd letter. - `BETWEEN` two numbers is inclusive of the 2nd number. ### AND ``` SELECT * FROM movies WHERE year BETWEEN 1990 AND 1999 AND genre = 'romance'; ``` * With `AND`, both conditions must be true for the row to be included in the result. ### OR ``` SELECT * FROM movies WHERE year > 2014 OR genre = 'action'; ``` * With `OR`, if any of the conditions are true, then the row is added to the result. ### Order By We can sort the results using `ORDER BY`, either alphabetically or numerically. Sorting the results often makes the data more useful and easier to analyze. For example, if we want to sort everything by the movie’s title from A through Z: ``` SELECT * FROM movies ORDER BY name; ``` * - `ORDER BY` is a clause that indicates you want to sort the result set by a particular column. - `name` is the specified column. --- Sometimes we want to sort things in a decreasing order. For example, if we want to select all of the well-received movies, sorted from highest to lowest by their year: ``` SELECT * FROM movies WHERE imdb_rating > 8 ORDER BY year DESC; ``` * - `DESC` is a keyword used in `ORDER BY` to sort the results in descending order (high to low or Z-A). - `ASC` is a keyword used in `ORDER BY` to sort the results in ascending order (low to high or A-Z). * Note: `ORDER BY` always goes after `WHERE`(if `WHERE` is present). ### Limit We’ve been working with a fairly small table (fewer than 250 rows), but most SQL tables contain hundreds of thousands of records. **In those situations, it becomes important to cap the number of rows in the result.** For instance, imagine that we just want to see a few examples of records. ``` SELECT * FROM movies LIMIT 10; ``` `LIMIT` is a clause that lets you specify the maximum number of rows the result set will have. This saves space on our screen and makes our queries run faster. Here, we specify that the result set can’t have more than 10 rows. `LIMIT` always goes at the very end of the query. Also, it is not supported in all SQL databases. ### Case A `CASE` statement allows us to create different outputs (usually in the `SELECT` statement). It is SQL’s way of handling *if-then* logic. * Suppose we want to condense the ratings in movies to three levels: - *If the rating is above 8, then it is Fantastic.* - *If the rating is above 6, then it is Poorly Received.* - *Else, Avoid at All Costs.* ``` SELECT name, CASE WHEN imdb_rating > 8 THEN 'Fantastic' WHEN imdb_rating > 6 THEN 'Poorly Received' ELSE 'Avoid at All Costs' END FROM movies; ``` * - Each `WHEN` tests a condition and the following `THEN` gives us the string if the condition is true. - The `ELSE` gives us the string if all the above conditions are false. - The `CASE` statement must end with `END`. * In the result, you have to scroll right because the column name is very long. To shorten it, we can rename the column to ‘Review’ using `AS`: ``` SELECT name, CASE WHEN imdb_rating > 8 THEN 'Fantastic' WHEN imdb_rating > 6 THEN 'Poorly Received' ELSE 'Avoid at All Costs' END AS 'Review' FROM movies; ``` ## Aggregate Functions Now, we are going to learn how to perform calculations using SQL. Calculations performed on multiple rows of a table are called **aggregates**. ### Count The fastest way to calculate how many rows are in a table is to use the `COUNT()` function. `COUNT()` is a function that **takes the name of a column as an argument** and counts the number of non-empty values in that column. ``` SELECT COUNT(*) FROM table_name; ``` ### Sum `SUM()` is a function that **takes the name of a column as an argument** and returns the sum of all the values in that column. What is the total number of downloads for all of the apps combined? ``` SELECT SUM(downloads) FROM fake_apps; ``` ### Max / Min The `MAX()` and `MIN()` functions return the highest and lowest values in a column, respectively. How many downloads does the most popular app have? ``` SELECT MAX(downloads) FROM fake_apps; ``` `MAX()` takes the name of a column as an argument and returns the largest value in that column. Here, we returned the largest value in the `downloads` column. `MIN()` works the same way but it does the exact opposite; it returns the smallest value. ### Average SQL uses the `AVG()` function to quickly calculate the average value of a particular column. The statement below returns the average number of downloads for an app in our database: ``` SELECT AVG(downloads) FROM fake_apps; ``` The `AVG()` function works by taking a column name as an argument and returns the average value for that column. ### Round By default, SQL tries to be as precise as possible without rounding(四捨五入). We can make the result table easier to read using the `ROUND()` function. * `ROUND()` function takes two arguments inside the parenthesis: - a column name - an integer * **It rounds the values in the column to the number of decimal places specified by the integer.** ``` SELECT ROUND(price, 0) FROM fake_apps; ``` Here, we pass the column price and integer 0 as arguments. SQL rounds the values in the column to 0 decimal places in the output. ### Group By ``` SELECT year, AVG(imdb_rating) FROM movies GROUP BY year ORDER BY year; ``` * `GROUP BY` is a clause in SQL that is used with aggregate functions. It is used in collaboration with the `SELECT` statement to arrange identical data into groups. * **The `GROUP BY` statement comes after any `WHERE` statements, but before `ORDER BY` or `LIMIT`.** --- Sometimes, we want to `GROUP BY` a calculation done on a column. * SQL lets us use column reference(s) in our `GROUP BY` that will make our lives easier. - `1` is the first column selected - `2` is the second column selected - `3` is the third column selected ``` SELECT ROUND(imdb_rating), COUNT(name) FROM movies GROUP BY 1 ORDER BY 1; ``` Here, the `1` refers to the first column in our `SELECT` statement, `ROUND(imdb_rating)`. ### Having In addition to being able to group data using `GROUP BY`, SQL also allows you to filter which groups to include and which to exclude. For instance, imagine that we want to see how many movies of different genres were produced each year, but we only care about years and genres with at least 10 movies. We can’t use `WHERE` here because **we don’t want to filter the rows; we want to filter groups.** **`HAVING` is very similar to `WHERE`. In fact, all types of `WHERE` clauses you learned about thus far can be used with `HAVING`.** We can use the following for the problem: ``` SELECT year, genre, COUNT(name) FROM movies GROUP BY 1, 2 HAVING COUNT(name) > 10; ``` * **When we want to limit the results of a query based on values of the individual rows, use `WHERE`.** * **When we want to limit the results of a query based on an aggregate property, use `HAVING`.** `HAVING` statement always comes after `GROUP BY`, but before `ORDER BY` and `LIMIT`. ## Multiple Tables In order to efficiently store data, we often spread related information across multiple tables. However, a lot of this information would be repeated. If the same customer has multiple subscriptions, that customer’s name and address will be reported multiple times. If the same subscription type is ordered by multiple customers, then the subscription price and subscription description will be repeated. This will make our table big and unmanageable. So instead, we can split our data into many tables. ### Combining Tables with SQL Combining tables manually is time-consuming. Luckily, SQL gives us an easy sequence for this: it’s called a `JOIN`. If we want to combine orders and customers, we would type: ``` SELECT * FROM orders JOIN customers ON orders.customer_id = customers.customer_id; ``` * Let’s break down this command: - The first line selects all columns from our combined table. If we only want to select certain columns, we can specify which ones we want. - The second line specifies the first table that we want to look in, `orders` - The third line uses `JOIN` to say that we want to combine information from `orders` with `customers`. - The fourth line tells us how to combine the two tables. We want to match `orders` table’s `customer_id` column with `customers` table’s `customer_id column`. * Because column names are often repeated across multiple tables, we use the syntax `table_name.column_name` to be sure that our requests for columns are unambiguous. ### Inner Joins When we perform a simple `JOIN` (often called an inner join) our result only includes rows that match our `ON` condition. Consider the following animation, which illustrates an inner join of two tables on `table1.c2` = `table2.c2`: ![inner-join](https://s3.amazonaws.com/codecademy-content/courses/learn-sql/multiple-tables/inner-join.gif) The first and last rows have matching values of `c2`. The middle rows do not match. The final result has all values from the first and last rows but does not include the non-matching middle row. ### Left Joins What if we want to combine two tables and keep some of the un-matched rows? SQL lets us do this through a command called `LEFT JOIN`. A left join will keep all rows from the first table, regardless of whether there is a matching row in the second table. ![left-join](https://s3.amazonaws.com/codecademy-content/courses/learn-sql/multiple-tables/left-join.gif) The first and last rows have matching values of `c2`. The middle rows do not match. The final result will keep all rows of the first table but will omit the un-matched row from the second table. This animation represents a table operation produced by the following command: ``` SELECT * FROM table1 LEFT JOIN table2 ON table1.c2 = table2.c2; ``` * - The first line selects all columns from both tables. - The second line selects `table1` (the “left” table). - The third line performs a `LEFT JOIN` on `table2` (the “right” table). - The fourth line tells SQL how to perform the join (by looking for matching values in column `c2`). ### Primary Key vs Foreign Key * Primary keys have a few requirements: - None of the values can be `NULL`. - Each value must be unique (i.e., you can’t have two customers with the same `customer_id` in the `customers table`). - A table can not have more than one primary key column. Let’s reexamine the `orders` table: |order_id|customer_id|subscription_id |purchase_date| | :--------: | :--------: | :--------: | :--------: | | 1 | 2 | 3 |2017-01-01| | 2 | 2 | 2 |2017-01-01| | 3 | 3 | 1 |2017-01-01| Note that `customer_id` (the primary key for customers) and `subscription_id` (the primary key for subscriptions) both appear in this. When the primary key for one table appears in a different table, it is called a **foreign key**. *Why is this important?* **The most common types of joins will be joining a foreign key from one table with the primary key from another table.** For instance, when we join `orders` and `customers`, we join on `customer_id`, which is a foreign key in `orders` and the `primary key` in `customers`. ### Cross Join So far, we’ve focused on matching rows that have some information in common. **Sometimes, we just want to combine all rows of one table with all rows of another table.(交叉加入)** For instance, if we had a table of `shirts` and a table of `pants`, we might want to know all the possible combinations to create different outfits. Our code might look like this: ``` SELECT shirts.shirt_color, pants.pants_color FROM shirts CROSS JOIN pants; ``` * - The first two lines select the columns `shirt_color` and `pants_color`. - The third line pulls data from the table `shirts`. - The fourth line performs a `CROSS JOIN` with `pants`. * **Notice that cross joins don’t require an `ON` statement. You’re not really joining on any columns!** * A more common usage of `CROSS JOIN` is when we need to compare each row of a table to a list of values. * Let’s return to our `newspaper` subscriptions. This table contains two columns that we haven’t discussed yet: - `start_month`: the first month where the customer subscribed to the print newspaper (i.e., 2 for February) - `end_month`: the final month where the customer subscribed to the print newspaper * Suppose we wanted to know how many users were subscribed during each month of the year. For each month (`1`, `2`, `3`) we would need to know if a user was subscribed. ### Union Sometimes we just want to stack one dataset on top of the other. Well, the `UNION` operator allows us to do that. ``` SELECT * FROM table1 UNION SELECT * FROM table2; ``` * SQL has strict rules for appending data: - Tables must have the same number of columns. - The columns must have the same data types in the same order as the first table. ### With Often times, we want to combine two tables, but one of the tables is the result of another calculation. We want to be able to join the results of this query with our customers table, which will tell us the name of each customer. We can do this by using a `WITH` clause. ``` WITH previous_results AS ( SELECT ... ... ... ... ) SELECT * FROM previous_results JOIN customers ON _____ = _____; ``` * - The `WITH` statement allows us to perform a separate query (such as aggregating customer’s subscriptions) - `previous_results` is the alias that we will use to reference any columns from the query inside of the `WITH` clause - We can then go on to do whatever we want with this temporary table (such as join the temporary table with another table) * Essentially, we are putting a whole first query inside the parentheses `()` and giving it a name. After that, we can use this name as if it’s a table and write a new query using the first query. * Do not include `;` inside of the `()` of your `WITH` statement.