# 使用PostgreSQL做不定時段統計 這是延伸的問題,之前有固定間隔的,也介紹了date_bin(). [上一次的](https://ithelp.ithome.com.tw/questions/10214403) [這次的](https://ithelp.ithome.com.tw/questions/10214579) ## 苦工大師 ```sql= with t1 as ( select case when ts >= timestamp '2023-10-11 09:00:00' and ts < timestamp '2023-10-11 10:00:00' then timestamp '2023-10-11 09:00:00' when ts >= timestamp '2023-10-11 10:00:00' and ts < timestamp '2023-10-11 10:08:00' then timestamp '2023-10-11 10:00:00' when ts >= timestamp '2023-10-11 10:08:00' and ts < timestamp '2023-10-11 10:18:00' then timestamp '2023-10-11 10:08:00' when ts >= timestamp '2023-10-11 10:18:00' and ts < timestamp '2023-10-11 10:27:00' then timestamp '2023-10-11 10:18:00' when ts >= timestamp '2023-10-11 10:27:00' and ts < timestamp '2023-10-11 10:34:00' then timestamp '2023-10-11 10:27:00' when ts >= timestamp '2023-10-11 10:34:00' and ts < timestamp '2023-10-11 10:44:00' then timestamp '2023-10-11 10:34:00' when ts >= timestamp '2023-10-11 10:44:00' and ts < timestamp '2023-10-11 11:00:00' then timestamp '2023-10-11 10:44:00' when ts >= timestamp '2023-10-11 11:00:00' and ts < timestamp '2023-10-11 11:15:00' then timestamp '2023-10-11 11:00:00' when ts >= timestamp '2023-10-11 11:15:00' and ts < timestamp '2023-10-11 11:30:00' then timestamp '2023-10-11 11:15:00' else null end as bin , qty from booking ) select bin , count(*) , sum(qty) from t1 group by rollup(bin) order by bin nulls last; bin | count | sum ---------------------+-------+----- 2023-10-11 09:00:00 | 31 | 71 2023-10-11 10:00:00 | 11 | 28 2023-10-11 10:08:00 | 25 | 65 2023-10-11 10:18:00 | 18 | 46 2023-10-11 10:27:00 | 12 | 30 2023-10-11 10:34:00 | 13 | 33 2023-10-11 10:44:00 | 25 | 62 2023-10-11 11:00:00 | 13 | 36 2023-10-11 11:15:00 | 12 | 35 NULL | 160 | 406 (10 rows) ``` 麻煩,也容易出錯,而且當該時段沒資料時,不會顯示出來,這裡就不多做處理,因為不是主角. ## 輔助表的方式 ```sql= create temp table booking_aux ( sts timestamp , ets timestamp ); insert into booking_aux values (timestamp '2023-10-11 09:00:00', timestamp '2023-10-11 10:00:00'), (timestamp '2023-10-11 10:00:00', timestamp '2023-10-11 10:08:00'), (timestamp '2023-10-11 10:08:00', timestamp '2023-10-11 10:18:00'), (timestamp '2023-10-11 10:18:00', timestamp '2023-10-11 10:27:00'), (timestamp '2023-10-11 10:27:00', timestamp '2023-10-11 10:34:00'), (timestamp '2023-10-11 10:34:00', timestamp '2023-10-11 10:44:00'), (timestamp '2023-10-11 10:44:00', timestamp '2023-10-11 11:00:00'), (timestamp '2023-10-11 11:00:00', timestamp '2023-10-11 11:15:00'), (timestamp '2023-10-11 11:15:00', timestamp '2023-10-11 11:30:00'); select ba.sts as bin , count(*) , sum(qty) from booking b right join booking_aux ba on b.ts >= ba.sts and b.ts < ba.ets group by rollup(bin) order by bin nulls last; bin | count | sum ---------------------+-------+----- 2023-10-11 09:00:00 | 31 | 71 2023-10-11 10:00:00 | 11 | 28 2023-10-11 10:08:00 | 25 | 65 2023-10-11 10:18:00 | 18 | 46 2023-10-11 10:27:00 | 12 | 30 2023-10-11 10:34:00 | 13 | 33 2023-10-11 10:44:00 | 25 | 62 2023-10-11 11:00:00 | 13 | 36 2023-10-11 11:15:00 | 12 | 35 NULL | 160 | 406 (10 rows) ``` 方便一點了. ## 輔助表搭配PostgreSQL Range Data Type ```sql= create temp table booking_aux2 ( bin timestamp , timsrange tsrange ); insert into booking_aux2 select sts, tsrange(sts, ets) from booking_aux; select * from booking_aux2; bin | timsrange ---------------------+----------------------------------------------- 2023-10-11 09:00:00 | ["2023-10-11 09:00:00","2023-10-11 10:00:00") 2023-10-11 10:00:00 | ["2023-10-11 10:00:00","2023-10-11 10:08:00") 2023-10-11 10:08:00 | ["2023-10-11 10:08:00","2023-10-11 10:18:00") 2023-10-11 10:18:00 | ["2023-10-11 10:18:00","2023-10-11 10:27:00") 2023-10-11 10:27:00 | ["2023-10-11 10:27:00","2023-10-11 10:34:00") 2023-10-11 10:34:00 | ["2023-10-11 10:34:00","2023-10-11 10:44:00") 2023-10-11 10:44:00 | ["2023-10-11 10:44:00","2023-10-11 11:00:00") 2023-10-11 11:00:00 | ["2023-10-11 11:00:00","2023-10-11 11:15:00") 2023-10-11 11:15:00 | ["2023-10-11 11:15:00","2023-10-11 11:30:00") (9 rows) select ba.bin , count(*) , sum(qty) from booking b right join booking_aux2 ba on b.ts <@ ba.timsrange group by rollup(bin) order by bin nulls last; bin | count | sum ---------------------+-------+----- 2023-10-11 09:00:00 | 31 | 71 2023-10-11 10:00:00 | 11 | 28 2023-10-11 10:08:00 | 25 | 65 2023-10-11 10:18:00 | 18 | 46 2023-10-11 10:27:00 | 12 | 30 2023-10-11 10:34:00 | 13 | 33 2023-10-11 10:44:00 | 25 | 62 2023-10-11 11:00:00 | 13 | 36 2023-10-11 11:15:00 | 12 | 35 NULL | 160 | 406 (10 rows) ``` 這就PostgreSQL才有的資料型態,會比較方便,也會快一些,因為只要做一次比較運算 <@, 不用做兩種比較運算. ## 使用PostgreSQL 的Array 觀察上面的時段,頭尾銜接部分會出現重複輸入.之前我有其他分享,有介紹了Array.我們可以將輔助表轉為Array的形式. ```sql= with t1 (tsarray) as ( select '{"2023-10-11 09:00:00", "2023-10-11 10:00:00", "2023-10-11 10:08:00", "2023-10-11 10:18:00", "2023-10-11 10:27:00", "2023-10-11 10:34:00", "2023-10-11 10:44:00", "2023-10-11 11:00:00", "2023-10-11 11:15:00"}' ::timestamp[] ) select width_bucket(b.ts, t1.tsarray) as bin , count(*) as cnt , sum(qty) as sumqty from booking b , t1 group by rollup(bin); bin | cnt | sumqty ------+-----+-------- NULL | 160 | 406 4 | 18 | 46 6 | 13 | 33 2 | 11 | 28 9 | 12 | 35 7 | 25 | 62 3 | 25 | 65 1 | 31 | 71 5 | 12 | 30 8 | 13 | 36 (10 rows) -- 雖然有計算出來,但是透過 width_bucket()得到的分組是整數型態 -- 我們還可以透過 unnest() 再將 Array展開 with t1 (tsarray) as ( select '{"2023-10-11 09:00:00", "2023-10-11 10:00:00", "2023-10-11 10:08:00", "2023-10-11 10:18:00", "2023-10-11 10:27:00", "2023-10-11 10:34:00", "2023-10-11 10:44:00", "2023-10-11 11:00:00", "2023-10-11 11:15:00"}' ::timestamp[] ), bookingcount as ( select width_bucket(b.ts, t1.tsarray) as bin , count(*) as cnt , sum(qty) as sumqty from booking b , t1 group by rollup(bin) ) select ts_min , sn , coalesce(bc.cnt, 0) as cnt , coalesce(bc.sumqty, 0) as sumqty from t1 cross join unnest(t1.tsarray) with ordinality as c(ts_min, sn) full outer join bookingcount bc on c.sn = bc.bin order by c.sn nulls last; ts_min | sn | cnt | sumqty ---------------------+------+-----+-------- 2023-10-11 09:00:00 | 1 | 31 | 71 2023-10-11 10:00:00 | 2 | 11 | 28 2023-10-11 10:08:00 | 3 | 25 | 65 2023-10-11 10:18:00 | 4 | 18 | 46 2023-10-11 10:27:00 | 5 | 12 | 30 2023-10-11 10:34:00 | 6 | 13 | 33 2023-10-11 10:44:00 | 7 | 25 | 62 2023-10-11 11:00:00 | 8 | 13 | 36 2023-10-11 11:15:00 | 9 | 12 | 35 NULL | NULL | 160 | 406 (10 rows) ``` 這樣我們就透過Array,搭配width_bucket(), unnest() 來解決不定時段統計的問題.當然也不見得是時段,凡是有範圍的,都可以處理,來得到直方圖. 若是固定的,那就更好處理了.這樣就產生了. ```sql= with t1 as ( select array_agg(a) as tsarray from generate_series( timestamp '2023-10-11 09:00:00' , timestamp '2023-10-11 11:15:00' , interval '15 minutes') a ) ``` 這會與date_bin() 等效,但是沒有date_bin()方便. 不過width_bucket() 在PostgreSQL 很早就有了,date_bin()是14版之後才有. ## 結語 PostgreSQL的Array可以協助我們做反正規化,減少table,提供很多彈性. ## 感謝 感謝 海綿寶寶 (antijava) 大大.