SQL Introduction

# SQL Introduction :::info [TOC] ::: ![SQL](https://hackmd.io/_uploads/S1iOZqJSZx.png) ## Preface SQL is a **relational query language** used in practice. - Implements ideas from **Relational Algebra, Relational Calculus, and Datalog**. > (without recursion, with slight modifications) - More **user-friendly** and closer to **natural language**. - Focuses on **core relational functionalities**. - Has many standards. (e.g., SQL99) ### Basic Form 1. `SELECT` returned attribute(s) → Specifies **which attributes (columns)** to return. 2. `FROM` relation(s) → Specifies **which relation(s) (tables)** to query. 3. `WHERE` condition(s) → Filters tuples based on **conditions on the tables**. >Execution Semantics: > - Apply the **`WHERE` conditions** to all relations in the **`FROM`** clause. > - Return the **values of attributes** listed in the **`SELECT`** clause ## Core Functionality - ✰ Example Schema: ```C++ Coffee(coffee_name, producer) CoffeeShop(shop_name, address, license) CoffeeDrinker(drinker_name, address, phone) Likes(drinker_name, coffee_name) Sells(shop_name, coffee_name, price) Frequents(drinker_name, shop_name) ``` ### 1. Single Relational Query > - _What brands are made by Baristas?_ ```SQL SELECT coffee_name FROM CoffeeTable WHERE producer = 'Baristas'; ``` - **Original Table** | coffee_name | producer | |-|-| | Caribou | Bristas | | Costa | Coava | | Kenya | Delta Coffee | - **Query Result** | coffee_name | |-| | Caribou | ### 2. Select All Attributes Query `(*)` > - What coffee brands are produced by Baristas? ```SQL SELECT * FROM CoffeeTable WHERE producer = 'Baristas'; ``` - **Original Table** | coffee_name | producer | |-|-| | Caribou | Bristas | | Costa | Coava | | Kenya | Delta Coffee | - **Query Result** | coffee_name | producer | |-|-| | Caribou | Baristas | ### 3. Query with `WHERE` clause > - Used to **filter tuples (rows)** based on conditions. > - May contain **complex conditions**. - Operators | Operator | Example | |-|-| | **Logical operators** | `AND`, `OR`, `NOT` | | **Comparison operators** | `<`, `>`, `=`, `<>` _(not equal to)_, `<=`, `>=`, etc. | | **Type-specific operators** | `LIKE`, etc. | > - What coffee brands are produced by Baristas or Coava? ```SQL SELECT coffee_name FROM CoffeeTable WHERE producer = 'Baristas' OR producer = 'Coava'; ``` - **Original Table** | coffee_name | producer | |-|-| | Caribou | Bristas | | Costa | Coava | | Kenya | Delta Coffee | - **Query Result** | coffee_name | |-| | Caribou | | Costa | ### 4. Multi Relation Query `(Join)` > - Find relations between different types of entities. > - What are the coffee brands liked by at least one person who frequents _Culture_? - **Solution Without Join** ```SQL SELECT DISTINCT coffee_name FROM Likes, Frequents WHERE Frequents.shop_name = ‘Culture’ AND Frequents.drinker_name = Likes.drinker_name; ``` --- - **Join Example** ```SQL SELECT * FROM CoffeeShop LEFT JOIN Frequents ON CoffeeShop.shop_name = Frequents.shop_name; ``` > By default, a `join` returns only tuples that satisfy the join condition. > **Left outer `join`** returns all tuples from the left relation. - **Original Tables** - _CoffeeShop_ Table | shop_name | address | license | |-|-|-| | Culture | King Blvd. | 201 | | Interzone | Monroe Ave. | 302 | - _Frequents_ Table | drinker_name | shop_name | |-|-| | Jonn | Culture | - **Query Result** | shop_name | address | license | drinker_name | shop_name | |-|-|-|-|-| | Culture | King Blvd. | 201 | John | Culture | | Interzone | Monroe Ave. | 302 | `null` | `null` | > Right (full) outer join retains all tuples from the right (all) relation. ## Nested Query 1. Queries that have another query in their `WHERE` clause 2. The queries that appear in the `WHERE` clause are called **subqueries**. - ✰ Example Schema: ```C++ Coffee(coffee_name, producer) CoffeeShop(shop_name, address, license) CoffeeDrinker(drinker_name, address, phone) Likes(drinker_name, coffee_name) Sells(shop_name, coffee_name, price) Frequents(drinker_name, shop_name) ``` > - **++Question 1++** > Using `Sells(shop_name, coffee_name, price)`, find the coffee shops that serve **Costa** for the ++same price++ that **Culture** charges for **Kenya**. - Step 1: Find **Culture**’s price for **Kenya** _(Culture-Kenya Price)_ ```SQL SELECT price FROM Sells WHERE shop_name = 'Culture' AND coffee_name = 'Kenya'; ``` - Step 2: Find shops that sell **Costa** at that price **(Nested Query)** ```SQL SELECT shop_name FROM Sells WHERE coffee_name = 'Costa' AND price = ( SELECT price FROM Sells WHERE shop_name = 'Culture' AND coffee_name = 'Kenya' ); ``` --- > - **++Question 2.a++** > Using `Sells(shop_name, coffee_name, price)`, find the coffee shops that serve **Kenya** for a _cheaper price_ than the price that ==all== coffee shops charge for **Costa**. - Step 1: Find the set of all prices for **Costa**. - Step 2: Find the shops that offer **Kenya** at a cheaper price than all values in ++Step 1++. ```SQL SELECT shop_name FROM Sells WHERE coffee_name = 'Kenya' AND price < ALL (SELECT price FROM Sells WHERE coffee_name = 'Costa'); ``` > - **++Question 2.b++** > Using `Sells(shop_name, coffee_name, price)`, find the coffee shops that serve **Kenya** for a cheaper price than the price that ==at least one== coffee shop charges for **Costa**. - Step 1: Find the set of all prices for **Costa**. - Step 2: Find the shops that offer **Kenya** at a _cheaper_ price than ++at least one value++ in ++Step 1++. ```SQL SELECT shop_name FROM Sells WHERE coffee_name = 'Kenya' AND price < ANY (SELECT price FROM Sells WHERE coffee_name = 'Costa'); ``` :::spoiler **📓 `ALL`, `ANY` clause** {state='close'} - They are used to **compare a value with a set of values**. - `ALL`: Condition is true only if the comparison holds for all values in the set. > Equivalent to logical **AND** - `ANY`: Condition is true if the comparison holds for **at least one value** in the set. > Equivalent to logical **OR** ::: --- > - **++Question 3++** > Using `Coffee(coffee_name, producer)` and `Likes(drinker_name, coffee_name)` find the producers of the coffee brands **John** likes. - Step 1: Find the set of coffee brands that **John likes**. - Step 2: Then find the producers of those coffee brands. ```SQL SELECT producer FROM Coffee WHERE coffee_name IN (SELECT coffee_name FROM Likes WHERE drinker_name = 'John'); ``` :::spoiler **📓 `IN`, `NOT IN` clause** {state='close'} - `IN`: Checks whether a value belongs to a **set returned by a subquery**. - `NOT IN`: To check if the result of a subquery ++**does not**++ contains a particular value. ::: --- > - **++Question 4++** > Using `Coffee(coffee_name, producer)`, find the coffee brands that are the only brand made by their producers. - Brand `c` is the only brand made by producer `p` ⇔ There does **NOT** exist another brand made by `p` with `name ≠ c` ```SQL SELECT coffee_name FROM Coffee c WHERE NOT EXISTS ( SELECT * FROM Coffee c2 WHERE c2.producer = c.producer AND c2.coffee_name <> c.coffee_name ); ``` :::spoiler **📓 `EXISTS`, `NOT EXISTS` clause** {state='close'} - `EXISTS`: Checks whether a subquery returns **at least one tuple**. - `NOT EXISTS`: Checks that a subquery returns **no tuples**. ::: ## Aggregation Function 1. Used to **compute summary values** over a set of tuples. 2. Operates on one or more attributes to produce a **single value** per group or for the whole table. ### Standard Functions | Function | Description | | -------- | --------------------------- | | `COUNT` | Number of tuples | | `SUM` | Sum of attribute values | | `AVG` | Average of attribute values | | `MIN` | Minimum value | | `MAX` | Maximum value | ### Advanced Functions | Function | Description | | ---------- | ---------------------------- | | `VARIANCE` | Variance of numeric values | | `STDDEV` | Standard deviation | | `REGR_*` | Linear regression statistics | :::warning - Aggregation ignores `NULLs` by default (except `COUNT(*)`). - Often used with `GROUP BY` to aggregate per group instead of the whole table. ::: - ✰ Example Schema: ```C++ Coffee(coffee_name, producer) CoffeeShop(shop_name, address, license) Sells(shop_name, coffee_name, price) ``` > - **++Question 1++** > Using `Coffee(coffee_name, producer)`, find the ++number++ of all coffee brands. ```SQL Select COUNT(coffee_name) FROM Coffee ``` > - **++Question 2++** > Using `Sell(shop_name, coffee_name, price)`, find the ++number++ of all coffee shops. - Sell might have multiple tuples with the same shop_name! (duplicates) - `DISTINCT shop_name` ensures each shop is counted once. ```SQL SELECT COUNT(DISTINCT shop_name) FROM Sell; ``` :::info `DISTINCT`：A modifier → used to remove duplicate values before aggregation. ::: > - **++Question 3++** > Using `Sells(shop_name, coffee_name, price)` find the ++minimum++ price of each coffee brand. - Step 1: Group tuples with the same brand. - Step 2: Compute `MIN` over the prices in each group. ```SQL SELECT coffee_name, MIN(price) AS min_price FROM Sells GROUP BY coffee_name; ``` > 1. `GROUP BY coffee_name` → groups tuples with the same coffee brand > 2. `MIN(price)` → computes the minimum price within each group > 3. `AS` → Rename the `MIN(price)` output attribute as '**min_price**' :::info **`GROUP BY`**：Groups tuples that have the same values on specified attributes, and applies aggregation functions **within each group**. 1. Can use **multiple attributes**. 2. In the `SELECT` clause, every attribute must be **either** - an aggregated value (e.g., `MIN`, `COUNT`), or - an attribute listed in the `GROUP BY` clause. - ❌ ==Wrong Example== > There is **no single value** of `shop_name` for the group. > Therefore, it must be added to `GROUP BY` or removed / aggregated. ```SQL SELECT coffee_name, MIN(price), shop_name FROM Sells GROUP BY coffee_name; ``` ::: > - **++Question 4++** > Using `Sells(shop_name, coffee_name, price)` find the ++minimum++ price of each coffee brand **sold in _Interzone_ or _Ava_**. ```SQL SELECT coffee_name, MIN(price) AS minprice FROM Sells WHERE shop_name = 'Interzone' OR shop_name = 'Ava' GROUP BY coffee_name; ``` > 1. `WHERE` filters tuples to only those sold by _Interzone_ or _Ava_ > 2. Remaining tuples are grouped by `coffee_name` > 3. `MIN(price)` is computed within each group > - **++Question 5++** > Using `Sells(shop_name, coffee_name, price)`, find the ++minimum++ price of each brand **whose ++maximum++ price is less than 11**. - **❌ Invalid** ```SQL SELECT coffee_name, MIN(price) FROM Sells WHERE MAX(price) < 11 -- invalid GROUP BY coffee_name; ``` - **✅ Valid** ```SQL SELECT coffee_name, MIN(price) AS minprice FROM Sells GROUP BY coffee_name HAVING MAX(price) < 11; ``` :::info **`HAVING`**：Filter Groups (After `GROUP BY`) 1. `WHERE` **cannot** use aggregation functions (e.g., `MAX`, `MIN`) 2. ++Aggregated conditions++ must be written in `HAVING` 3. `HAVING` filters **groups**, not individual tuples ::: > - **++Question 5++** > Using `Sells(shop_name, coffee_name, price)`, find the ++minimum++ price of each brand **served in more than three coffee shops**. ```SQL Select coffee_name, Min(price) As minprice From Sells Group By coffee_name Having Count(shop_name) > 3; ``` ## Set Operation 1. Set operations treat query results as **sets of tuples**. 2. Relations involved must be **union-compatible**. - (same number of attributes, compatible domains) ### Basic Operation | Operation | Meaning | |---------|--------| | `R UNION S` | Tuples that appear in **R or S** | | `R INTERSECT S` | Tuples that appear in **both R and S** | | `R EXCEPT S` | Tuples in **R but not in S** _(Not supported in some system)_ | ```SQL SELECT a FROM R WHERE a NOT IN ( SELECT a FROM S ); -- to simulate `EXCEPT` ``` ### Duplicates Operation > Set operations **remove duplicates by default** > Using `ALL` **keeps duplicates** | Operation | Default | With ALL | |---------|--------|----------| | `UNION` | removes duplicates | `UNION ALL` keeps duplicates | | `INTERSECT` | removes duplicates | `INTERSECT ALL` keeps duplicates | | `EXCEPT` | removes duplicates | `EXCEPT ALL` keeps duplicates | - ✰ Example Schema: ```C++ Pair(parent, child) ``` > - **++Question 1++ (Warmup)** > Using `Pair(parent, child)`, find **grandchildren of Joe**. ```SQL SELECT p2.child FROM Pair p1, Pair p2 WHERE (p1.parent = 'Joe') AND (p1.child = p2.parent); ``` > - **⭐ ++Question 2++ (Messedup)** > Using `Pair(parent, child)`, find **all descendants of Joe**. ```SQL SELECT p1.child FROM Pair p1 -- first generation WHERE (p1.parent == 'Joe'); UNION SELECT p2.child FROM Pair p1, Pair p2 -- second generation WHERE (p1.parent = 'JOE') AND (p1.child = p2.parent); UNION SELECT p3.child FROM Pair p1, Pair p2, Pair p3 -- third generation WHERE (p1.parent = 'JOE') AND (p1.child == p2.parent) AND (p2.child = p3.parent); UNION ... -- #sub-queries = #generations descending from Joe ``` > One sub-query per each generation, > but we don't know #generations descendinf from _Joe_. > Query must **==recursively== traverse** parentship relation to find all descendants. ## MySQL Common Table Expression (CTE) ### Theory of Limitations - **Basic SQL** does **not support recursive queries** - Recursive queries cannot be expressed in: - **Relational Algebra (RA)** - **Relational Calculus (RC)** - **Non-recursive Datalog** ### Beyond Basic SQL - More recent SQL standards introduce **limited recursion** - These features go **beyond RA and RC** - Support varies across systems: - MySQL: ==recursive `WITH` (CTE)== - Oracle: `CONNECT BY` --- ### Common Table Expression (CTE) 1. A **CTE** is a temporary relation (relation variable) 2. Exists **only within the scope of a single query** 3. Acts like a **virtual view** 4. Defined using `WITH` - ✰ Example ```sql WITH cte1 AS (SELECT a, b FROM table1) cte2 AS (SELECT a, b FROM table2) SELECT cte1.b, cte2.d FROM cte1 JOIN cte2 ON cte1.a = cte2.c; ``` > CTE can be referenced in the main query, but it disappears after the query ends. - ✰ Recursive Query Example ① ```sql WITH RECURSIVE NUM(n) AS ( SELECT 1 -- base UNION ALL SELECT n + 1 FROM NUM WHERE n < 5 -- recursion step ) SELECT * FROM NUM; ``` - ✰ Recursive Query Example ② > Using `Employ(id, manager_id)` produce management chain of every employee. > - Each employee ⇒ ids of his/her managers in the order of management ```sql WITH RECURSIVE CHAIN(id, path) AS ( -- Base case (find top managers) SELECT id, CAST(id AS CHAR(200)) FROM Employ WHERE manager_id IS NULL UNION ALL -- Recursive step (concat path) SELECT e.id, CONCAT(c.path, ' → ', e.id) FROM CHAIN c JOIN Employ e ON c.id = e.manager_id ) SELECT * FROM CHAIN; ``` - **Input Schema** | id | manager_id | | | --- | ---------- | ------------- | | 1 | NULL | ← top manager | | 5 | 1 | | | 6 | 1 | | | 101 | 5 | | | 102 | 5 | | | 103 | 6 | | - **Output Schema** | id | path | | --- | ------- | | 1 | 1 | | 5 | 1 → 5 | | 6 | 1 → 6 | | 101 | 1 → 5 → 101 | | 102 | 1 → 5 → 102 | | 103 | 1 → 6 → 103 | :::success - **Non-recursive CTE** → improves readability and modularity - **Recursive CTE** → enables queries beyond RA and RC ::: ### Recursion Support - **Most systems provide only limited recursion** - **MySQL**: - Only **one recursive reference** to the CTE - Recursive step can appear **only in FROM clause → linear recursion** - **Cannot use `GROUP BY` / Aggregation** in recursive step - **Other systems (e.g., SQL Server, Oracle)** - Allow more constructs - Can support more complex recursion | Type | Meaning | Example | |-|-|-| | Linear recursion | Recursive step references **one previous result** | Management chain | | Non-linear recursion | Recursive step references **multiple previous results** | Combinatorial functions like `n choose k` | ## Conclusion - 1️⃣ SQL is a **relational query language**, implementing RA, RC, and Datalog. - Basic query format: `SELECT … FROM … WHERE …` - `WHERE` filters tuples, `SELECT` chooses columns. - 2️⃣ Supports **single-table queries**, **multi-table joins**, **select all (`*`)**, and **condition**. - 3️⃣ Supports **subqueries** with `IN/NOT IN`, `EXISTS/NOT EXISTS`, `ALL/ANY`. - 4️⃣ Supports **aggregation functions**: `COUNT`, `SUM`, `AVG`, `MIN`, `MAX` - `DISTINCT` removes duplicates - `GROUP BY` groups tuples - `HAVING` filters groups **after aggregation** - 5️⃣ Supports **set operations**: `UNION`, `INTERSECT`, `EXCEPT` - Default removes duplicates; `ALL` keeps duplicates - `EXCEPT` can be simulated using nested queries with `NOT IN` - 6️⃣ **Common Table Expressions (CTE)**: temporary relations within a single query - Non-recursive: improves readability - Recursive: defines a relation in terms of itself - 7️⃣ **Recursive CTE**: base case + recursive step - Example: management chain or number sequences - `CAST` may be needed to convert numeric values for string concatenation - 8️⃣ **Limitations of basic SQL**: recursion is not supported - Advanced SQL standards provide **limited recursion** - MySQL: linear recursion, single recursive reference, no aggregation in recursive step - 9️⃣ **Recursion types**: - Linear recursion → one previous result produces the next - Non-linear recursion → multiple previous results produce the next - 🔟 Database systems vary in recursion support and may limit recursion complexity. :::spoiler Relevant Resource [SQL Tutorial](http://w3schools.com/sql/) [Ramakrishnan Database Management Systems](https://raw.githubusercontent.com/pforpallav/school/master/CPSC404/Ramakrishnan%20-%20Database%20Management%20Systems%203rd%20Edition.pdf) :::