07 窗口函數與子查詢

# 07 窗口函數與子查詢 ## 利用窗口函數 ROW_NUMBER 取得每個供應商最高價的產品依據前面所學，我們可以利用 MAX() + GROPU BY 取得每個供應商最高價的產品的價格 ```SQL= SELECT vendor_id, MAX(original_price) AS highest_price FROM vendor_inventory GROUP BY vendor_id ORDER BY vendor_id ``` ![image](https://hackmd.io/_uploads/BkLVHVV2p.png) 但是上述的查詢方式只能拿到各個供應商的 original_price 最高價格，但是無法得知最高價格對應到哪一個產品 (product_id) 解決方式 → 利用窗口函數步驟： 1. 將資料依據 vendor_id 做分區 - PARTITION BY vendor_id 2. 將分區內的資料依據 original_price 由大至小排序 - ORDER BY original_price DESC 3. 用窗口函數 ROW_NUMBER 將分區內已經排序好的資料賦予從 1 開始的序號，並將此序號放到 price_rank 欄位 ```sql= SELECT vendor_id, market_date, product_id, original_price, ROW_NUMBER() over ( PARTITION BY vendor_id ORDER BY original_price DESC ) AS price_rank FROM vendor_inventory ORDER BY vendor_id DESC, original_price DESC ``` 4. 將上面的查詢做為子查詢，利用 WHERE 篩選出 price_rank 為 1 的資料 ```sql= SELECT * FROM ( SELECT vendor_id, market_date, product_id, original_price, ROW_NUMBER() over ( PARTITION BY vendor_id ORDER BY original_price DESC ) AS price_rank FROM vendor_inventory ORDER BY vendor_id DESC, original_price DESC ) AS x WHERE x.price_rank = 1 ``` ![image](https://hackmd.io/_uploads/S12iL4Eha.png) ## SQL tips - 撰寫子查詢的技巧先把子查詢的 SQL 圈選起來，然後執行圈選起來的部分。這樣可以幫助你先查看 sub-query 結果是否如預期，以確保主查詢可以獲得正確的輸出 ![image](https://hackmd.io/_uploads/r1ULvEV36.png) ## 窗口函數的位置要正確 ![image](https://hackmd.io/_uploads/BJA3w4426.png) 不用子查詢，直接寫 `WHERE price_rank = 1` 可以嗎？答案是不行，因為 WHERE 的執行順序在 ROW_NUMBER() 之前，所以 WHERE 執行的時候 price_rank 根本還沒產生出來 ### SQL查詢執行順序 ![image](https://hackmd.io/_uploads/H1RM_VE36.png) https://medium.com/@cindy20303705/sql%E6%9F%A5%E8%A9%A2%E5%9F%B7%E8%A1%8C%E9%A0%86%E5%BA%8F-4d62584d372d ## 利用窗口函數 RNAK & DENSE_RANK 取得每個供應商最高價的產品 ```sql= SELECT vendor_id, market_date, product_id, original_price, RANK() over ( PARTITION BY vendor_id ORDER BY original_price DESC ) AS price_rank FROM vendor_inventory ORDER BY vendor_id DESC, original_price DESC ``` 使用 RANK：因為前面已經有 284 筆 18.00 都排第一，因此到了第二高價的 6.50 時排名就會從 285 開始 ![image](https://hackmd.io/_uploads/HJP7iEE3a.png) 若想要第二高價排名為 2，就可以改用 DENSE_RANK ![image](https://hackmd.io/_uploads/SyFrjV43a.png) ## 利用窗口函數 NTILE & 取得每個供應商前 10% 高價的產品 NTILE 功能：將列的資料分為 n 等分，譬如 NTILE(10) 就是將資料分為 10 等分。若列數不能剛好均分，某些等分就會比其他等分多一列資料譬如我們想知道原價排在前 10% 的產品 ```sql= SELECT * FROM ( SELECT vendor_id, market_date, product_id, original_price, NTILE(10) over (ORDER BY original_price DESC) AS price_ntile FROM vendor_inventory ORDER BY original_price DESC ) AS x WHERE x.price_ntile = 1 ``` ## 聚合窗口函數 ### 以 AVG 聚合函數計算個分區的平均單價供應商想知道哪些產品的單價高於市集日期 (market_date) 的平均單價，該怎麼做？步驟： 1. 使用 PARTITION BY 將資料以 market_date 做分區，並且依照 market_date 排序 2. 利用 AVG 函數計算各分區的平均值，並放入 avg_by_market_date 欄位 ```sql= SELECT vendor_id, market_date, product_id, original_price, AVG(original_price) over ( PARTITION BY market_date ORDER BY market_date ) AS avg_by_market_date FROM vendor_inventory ``` ![image](https://hackmd.io/_uploads/rysgaNVna.png) 3. 我們想讓分區平均單價顯示兩位小數，可以利用 ROUND 函數 4. 將經過聚合窗口得到的輸出視為內部查詢的結果集，並取表格別名為 x 5. 加上外部查詢；查詢某一個 vendor_id 且產品原價 (original_price) 大於分區平均單價 ```sql= SELECT * FROM ( SELECT vendor_id, market_date, product_id, original_price, ROUND(AVG(original_price) over ( PARTITION BY market_date ORDER BY market_date ), 2) AS avg_by_market_date FROM vendor_inventory ) AS x WHERE x.vendor_id = 8 AND x.original_price > avg_by_market_date ORDER BY x.market_date, x.original_price DESC ``` ![image](https://hackmd.io/_uploads/H1QL6VE3T.png) ### 既然我們只是要看供應商 8 的資料，那將 `WHERE vendor_id = 8` 放到子查詢結果會一樣嗎？ ![CleanShot 2024-02-21 at 16.01.13@2x](https://hackmd.io/_uploads/HJHh644hT.jpg) 答案是不一樣，因為當 `WHERE vendor_id = 8` 寫在子查詢內時，在計算平均的時候就已經將其他供應商篩掉了，所以平均單價就不是所有供應商的平均單價 ## 以 COUNT 聚合函數計算個分區的項目數取得每個供應商在每個市集日期 (market_date) 有多少品項 ```sql= SELECT vendor_id, market_date, product_id, original_price, COUNT(product_id) over ( PARTITION BY market_date, vendor_id ) AS vendor_product_count_per_market_date FROM vendor_inventory ORDER BY vendor_id, market_date, original_price DESC ``` ![image](https://hackmd.io/_uploads/rkuQyHEha.png) ## 用 SUM 聚合函數計算各分區的累計加總 1. 不使用 PARTITION BY 子句分區，也就是全部視為一區，最後的結果就是所有交易的加總消費額。 ```sql= SELECT customer_id, market_date, vendor_id, product_id, quantity * cost_to_customer_per_qty AS price, sum(quantity * cost_to_customer_per_qty) OVER ( ORDER BY market_date, transaction_time, customer_id, product_id ) AS running_total_purchases FROM customer_purchases ``` ![image](https://hackmd.io/_uploads/ByrJeS4np.png) 2. 使用 customer_id 欄位進行分區，用 SUM 函數累計每一位顧客的消費額 ```sql= SELECT customer_id, market_date, vendor_id, product_id, quantity * cost_to_customer_per_qty AS price, SUM(quantity * cost_to_customer_per_qty) OVER ( PARTITION BY customer_id ORDER BY market_date, transaction_time, product_id ) AS customer_spend_running_total FROM customer_purchases ``` ![image](https://hackmd.io/_uploads/SJ3lMr4na.png) 3. 如果只有 PARTITION BY 子句而沒有 ORDER BY 子句，結果會是什麼？ ```sql= SELECT customer_id, market_date, vendor_id, product_id, quantity * cost_to_customer_per_qty AS price, ROUND(SUM(quantity * cost_to_customer_per_qty) OVER ( PARTITION BY customer_id ), 2) AS customer_spend_total FROM customer_purchases ``` SUM 函數就不會對分區的資料逐筆累計，而是一次將各分區所有的資料加總 ![image](https://hackmd.io/_uploads/rJC4MSNhT.png) ## 窗口函數 LAG & LEAD 功能：將目前列與前後列進行比較和計算例如目前是在第 10 列資料，若使用 LAG(booth_number, 1) 就可以位移 1 列去取得第 9 列的 booth_number 欄位的值 ### 依照市集日期查看各供應商當次與前次的欄位分配 ```sql= SELECT market_date, vendor_id, booth_number, LAG(booth_number, 1) OVER ( PARTITION BY vendor_id ORDER BY market_date, vendor_id ) AS previous_booth_number FROM `vendor_booth_assignments` ORDER BY market_date, vendor_id, booth_number ``` ![CleanShot 2024-02-21 at 16.58.16@2x](https://hackmd.io/_uploads/HJ-TMHVn6.jpg) ### 找出攤位編號有異動的供應商步驟： 1. 查出供應商此次與上次的攤位編號，並取別名為 x 2. 從結果集 x 再篩選某個市集日期，查出攤位編號與上次分配不同的所有供應商 ```sql= SELECT * FROM ( SELECT market_date, vendor_id, booth_number, LAG(booth_number, 1) OVER ( PARTITION BY vendor_id ORDER BY market_date, vendor_id ) AS previous_booth_number FROM `vendor_booth_assignments` ORDER BY market_date, vendor_id, booth_number ) AS x WHERE x.market_date = '2019-04-10' AND (x.booth_number <> x.previous_booth_number OR previous_booth_number IS NULL) ``` 可以發現供應商 1 & 4 的攤位編號對調了 ![image](https://hackmd.io/_uploads/S1lfmSE3a.png) ## 比較本次與前次市集日期的銷售總額我們想知道在各市集日期的總銷售額是否高於或低於上一次的市集日期步驟 1. 使用 GROUP BY 和 SUM 函數來獲得每個市集日期的總銷售額 ```sql= SELECT market_date, sum(quantity * cost_to_customer_per_qty) AS market_date_total_sales FROM customer_purchases GROUP BY market_date ORDER BY market_date ``` ![image](https://hackmd.io/_uploads/SJP87BVhp.png) 2. 增加 LAG 函數，將前後次市集日期的銷售總額放在同一列資料中 ```sql= SELECT market_date, sum(quantity * cost_to_customer_per_qty) AS market_date_total_sales, LAG(sum(quantity * cost_to_customer_per_qty), 1) OVER ( ORDER BY market_date ) AS previous_market_date_toal_sales FROM customer_purchases GROUP BY market_date ORDER BY market_date ``` ![image](https://hackmd.io/_uploads/BJou7SV2T.png) 3. 計算當前市集日期與前一次的銷售差額 ```sql= SELECT market_date, SUM(quantity * cost_to_customer_per_qty) AS market_date_total_sales, LAG(sum(quantity * cost_to_customer_per_qty), 1) OVER ( ORDER BY market_date ) AS previous_market_date_toal_sales, SUM(quantity * cost_to_customer_per_qty) - LAG(sum(quantity * cost_to_customer_per_qty), 1) OVER (ORDER BY market_date) AS sales_growth FROM customer_purchases GROUP BY market_date ORDER BY market_date ``` ![image](https://hackmd.io/_uploads/rk_c7BV2p.png) 4. 提高查詢效率只計算一次銷售總額，從而提高查詢效率 ```sql= SELECT market_date, market_date_total_sales, LAG(market_date_total_sales, 1) OVER ( ORDER BY market_date ) AS previous_market_date_toal_sales, market_date_total_sales - LAG(market_date_total_sales, 1) OVER (ORDER BY market_date) AS sales_growth FROM ( SELECT market_date, SUM(quantity * cost_to_customer_per_qty) AS market_date_total_sales FROM customer_purchases GROUP BY market_date ORDER BY market_date ) AS sales ``` ## 由當前記錄往後位移列數的 LEAD 函數差別在 LAG 函數是往前位移列數，LEAD 是往後 ## 窗口命名 PostgreSQL & MySQL 8.0 有支援 ```sql= SELECT market_date, vendor_id, booth_number, LAG(booth_number,1) OVER w AS previous_booth_number FROM farmers_market.vendor_booth_assignments WINDOW w AS ( PARTITION BY vendor_id ORDER BY market_date, vendor_id ) ORDER BY market_date, vendor_id, booth_number ```