時間序列資料管理 TimescaleDB (1) 功能體驗

# 時間序列資料管理 TimescaleDB (1) 功能體驗 * [官方文檔](https://docs.timescale.com) ## TimescaleDB 是什麼? 為了時間序列資料而生的 PostgreSQL Extension，提供基於時間的自動分區、資料壓縮等功能。 ### Hypertables TimescaleDB 實現分區的表稱為 [Hypertable](https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/)。一張 Hypertable 下包含多個稱為 Chunk 的分區，每個 Chunk 都被分配存放某一時間區間內的資料。除了以時間維度切分 Chunk，還可以**額外**以空間維度做近一步的切分，則一個 Chunk 下就會再依空間維度分成多個子集。 ### 自動新建分區當對 Hypertable `INSERT` 資料時，若該資料的時間尚無任何對應的 Chunk 可以存放，==TimescaleDB 會自動建立一個相應的 Chunk 存放該資料==。這點相較使用 Partitioned Tabled 便捷許多。 ### 查詢效能 Hypertable 提升查詢效能的原理是避免全表查詢，只查詢和 `WHERE` 條件的時間範圍相關的 Chunk，以提升命中資料的效率。預設啟用的 [Chunk Skipping](https://docs.timescale.com/use-timescale/latest/hypertables/improve-query-performance/#how-chunk-skipping-works) 功能可判斷跳過不需查詢的 Chunk，不一定需要將連續時間範圍內所有 Chunk 拉出來搜尋。此外，可利用 TimescaleDB 提供的 Hyperfunctions 來進行查詢。有非常多工具可利用，參閱 [Hyperfunctions 文檔](https://docs.timescale.com/api/latest/hyperfunctions/)。 ### 資料壓縮啟動[資料壓縮](https://docs.timescale.com/use-timescale/latest/compression/about-compression/)功能以節省資料儲存成本，號稱可將 Chunk 佔用容量縮小 90% 以上，資料經過壓縮還是可以查詢，甚至查詢效率更好。 ### 逾期資料清理設定資料保留策略，自動清除逾期的 Chunk。 ### 連續聚合 View 是一種 Materialized View，與 PostgreSQL 原生 Materialized View 的差異如下： 1. 排程定時刷新資料，只會用異動的資料做==增量運算==，不會每次刷新都要拿全部的 Raw Data 重算。 2. 雖然不是 Real-Time 刷新 View，但提供「即時聚合」功能，查詢時可以將連續聚合資料和最新原始資料結合提供最新結果。 ## 安裝在既有 PostgreSQL Container 上安裝 TimescaleDB 的配置可以參考[我的 GitHub](https://github.com/michelle0809/PostgresDockerCompose/tree/with-timescaledb)，這份配置是將 TimescaleDB 安裝在我既有的 Container 上。啟動 Container 後，先用 TimescaleDB 提供的參數調整工具 timescaledb-tune 直接調參。 ![image](https://hackmd.io/_uploads/HyJUIS7xgg.png) 記錄一下 timescaledb-tune 調整了些什麼。 ![image](https://hackmd.io/_uploads/rJtUvrmexl.png) 接下來連線到 PostgreSQL 即可安裝 timescaledb Extension。 <center class="half"> <img src="https://hackmd.io/_uploads/rJBmzPs1ee.png" width="300"/> </center> 也可以直接使用官方提供的[已安裝好 TimescaleDB 的 Image](https://hub.docker.com/r/timescale/timescaledb)。 ## 功能 ### Hypertable 1. 建立測試用的 Schema ```sql CREATE SCHEMA test; ``` 2. 建立 Normal Table ```sql CREATE TABLE test.test_timescaledb ( id uuid default gen_random_uuid(), type_id int not null, -- space scale key content varchar not null, created_at timestamp with time zone not null default now(), -- time scale key primary key (id, created_at) ); ``` 3. 塞入測試資料 ```sql INSERT INTO test.test_timescaledb (type_id, content, created_at) SELECT 1 + floor(random() * 3) as type_id, concat('content-', md5(random()::text)) as content, now() - (random() * interval '30 days') as created_at FROM generate_series(1, 1000000) as id; ``` 4. 將有資料的 Normal Table 轉為 Hypertable ```sql SELECT create_hypertable( relation := 'test.test_timescaledb'::regclass, -- 要轉換為 hypertable 的 table dimension := by_range('created_at', INTERVAL '1 day'), -- 維度 create_default_indexes := true, -- 建立預設索引 if_not_exists := true, -- 如果已經存在則不會建立 migrate_data := true -- 是否需遷移資料 ); ``` ![image](https://hackmd.io/_uploads/SJ5ytSXegl.png) 5. 查看 Hypertable 狀況查看資料表下共有 31 個 Chunk，且都放在 `_timescaldb_internal` Schema 下。 ```sql SELECT public.show_chunks('test.test_timescaledb') ``` ![image](https://hackmd.io/_uploads/H1yKGhXlgg.png) 每個 Chunk 大約 6000kb。 ```sql SELECT pg_size_pretty(total_bytes) as total_size, * FROM public.chunks_detailed_size('test.test_timescaledb'); ``` ![image](https://hackmd.io/_uploads/SkBeQ27lgx.png) 查看 Hypertable 資料表大小 ![image](https://hackmd.io/_uploads/H1BIm2Qxel.png) 6. 設定資料表壓縮 ```sql ALTER TABLE test.test_timescaledb SET ( timescaledb.compress = true, -- 啟用壓縮（必填） timescaledb.compress_orderby = 'created_at DESC', -- 壓縮排序（預設用 {time key} DESC 排序) ); ``` ![image](https://hackmd.io/_uploads/SJykCS7xeg.png) 7. 加入壓縮策略 ```sql SELECT add_compression_policy( hypertable := 'test.test_timescaledb'::regclass, -- 要加入壓縮策略的 hypertable compress_after := INTERVAL '5 days' -- 包含早於此時間資料的 chunk 會被壓縮 (與 compress_created_before 互斥) -- compress_created_before := INTERVAL '5 days' -- 創建時間早於此時間的 chunk 會被壓縮 (與 compress_after 互斥) ); ``` ![image](https://hackmd.io/_uploads/BJ-m0BQgle.png) 8. 查看設定結果用步驟 7 返回的 job id 可以到 `timescaledb_information` 下的 `jobs`、`job_stats` 等 View 表查看 job 資訊。如下圖，可以看到在步驟 7 加入壓縮策略時就執行了首次壓縮，下次執行時間為 12 小時後。 ```sql SELECT * FROM timescaledb_information.job_stats WHERE job_id = 1000; ``` ![image](https://hackmd.io/_uploads/SJDME3Xllg.png) 9. 確認壓縮結果可以看到資料表整體從 181MB -> 81MB，佔用容量下降約 55%。 ![image](https://hackmd.io/_uploads/r1IuVhXxeg.png) 10. 查看壓縮後 Chunk 清單壓縮後可以看到 `_timescaledb_internal` Schema 下多了很多 compress 開頭的 chunk，這些就是經過壓縮的 chunk。而這些 compress chunk 和 chunk 的關係，可在 `_timescaledb_catalog.chunk` 查到，如下圖，有 `compresses_chunk_id` 的就是已壓縮的 chunk，此時從 hypertable 查詢已壓縮時間的資料或直接查詢這個 chunk，資料就是從 compress chunk 中解壓縮取得的。 ![image](https://hackmd.io/_uploads/rkXtdp7lxg.png) 11. 查看已壓縮資料長什麼樣以上圖的 `_hyper_1_80_chunk` 和對應的 `compress_hyper_2_122_chunk` 為例，記錄一下兩個 chunk 分別長什麼樣。首先是 `_hyper_1_80_chunk`，這個 chunk 會返回從 `compress_hyper_2_122_chunk` 解壓的資料，資料規格就和前面建表定義的一樣。 ![image](https://hackmd.io/_uploads/r1LNRpQxlx.png) 而 `compress_hyper_2_122_chunk` 裡就只有 36 行資料，並且是看不懂的樣子XD ![image](https://hackmd.io/_uploads/rk7v0pmglg.png) 12. 測試塞入未來資料從 `chunks` View 表可以看到每個 Chunks 的時間點。 ![image](https://hackmd.io/_uploads/rkrennXele.png) 塞入未來的資料後，檢查 `chunks` View，可以看到確實自動新增了 Chunk。 ![image](https://hackmd.io/_uploads/HkI_2h7xgl.png) #### 以 biging 欄位為 time scale key 有時候資料庫規格可能不是用 timestamp，而是存為 bigint 型別的 unix timestamp。這樣也是可以設定為 Hypertable 的，但需要提供將時間轉換為數值的 Function 給 TimescaleDB。 :::spoiler 範例 1. 建立將時間轉換為 unixtimestamp 的 Function ```sql CREATE OR REPLACE FUNCTION public.to_unixtimestamp(timestamp at time zone) RETURNS BIGINT LANGUAGE plpgsql STABLE AS $$ BEGIN RETURN round(extract(epoch from $1) * 1000); END; $$; ``` 2. 建立 unix_now Function，供後續 Hypertable 設定使用 ```sql CREATE OR REPLACE FUNCTION public.unix_now() RETURNS BIGINT LANGUAGE plpgsql STABLE AS $$ BEGIN RETURN public.to_unixtimestamp(now()); END; $$; ``` 3. 建立 Normal Table ```sql CREATE TABLE test.test_timescaledb_bigint ( id uuid default gen_random_uuid(), type_id int not null, -- space scale key content varchar not null, created_at bigint not null default public.unix_now(), -- time scale key primary key (id, created_at) ); ``` 4. 存入測試資料 ```sql INSERT INTO test.test_timescaledb_bigint (type_id, content, created_at) SELECT 1 + floor(random() * 3) as type_id, concat('content-', md5(random()::text)) as content, public.unix_now() - (random() * 30 * 86400000) as created_at FROM generate_series(1, 1000000) as id; ``` ![image](https://hackmd.io/_uploads/rypALn7gxl.png) 5. 轉換為 Hypertable ```sql SELECT create_hypertable( relation := 'test.test_timescaledb_bigint'::regclass, -- 要轉換為 hypertable 的 table dimension := by_range('created_at', 86400000), -- 維度 create_default_indexes := true, -- 建立預設索引 if_not_exists := true, -- 如果已經存在則不會建立 migrate_data := true -- 是否需遷移資料 ); ``` ![image](https://hackmd.io/_uploads/Bkk-w3Xgxl.png) 6. 設定 hypertable 轉換當前時間的函數 ```sql SELECT set_integer_now_func( hypertable := 'test.test_timescaledb_bigint'::regclass, -- 要設定當前時間的函數的資料表 integer_now_func := 'public.unix_now'::regproc, -- 當前時間的函數 replace_if_exists := true -- 如果已經存在則覆蓋設定 ); ``` ![image](https://hackmd.io/_uploads/BkT_D2mxll.png) 7. 查看設定結果 ```sql SELECT * FROM timescaledb_information.dimensions; ``` ![image](https://hackmd.io/_uploads/BJrn_37geg.png) ::: ### Continuous Aggregation TimescaleDB 的 Continuous Aggregates Materialized View 不支援使用 CTE、Subquery、Window Functions、Set Returning Functions，目前看來是只能整理單純的統計資料為 Materialized View，再基於 Materialized View 進一步處理。 1. 建立 Continuous Aggregates Materialized View ```sql CREATE MATERIALIZED VIEW test.test_timescaledb_hourly_type_stats WITH (timescaledb.continuous) AS -- 設定為 Continuous Aggregates SELECT time_bucket('1 hour', created_at) as hour, type_id, count(*) as count FROM test.test_timescaledb GROUP BY hour, type_id ORDER BY hour DESC, type_id; ``` 2. 啟用 Real-Time Aggregates 功能 ```sql ALTER MATERIALIZED VIEW test.test_timescaledb_hourly_type_stats SET (timescaledb.materialized_only = false); ``` 3. 設定每小時刷新的 Continuous Aggregate 策略 ```sql SELECT add_continuous_aggregate_policy( continuous_aggregate := 'test.test_timescaledb_hourly_type_stats'::regclass, -- 要設定刷新策略的 Continuous Aggregate start_offset := INTERVAL '14 days', -- 應大於 end_offset，每次更新過去此時間內的資料 end_offset := INTERVAL '1 hour', -- 每次更新過去資料時，排除此時間內的資料 schedule_interval := INTERVAL '30 minutes' -- 刷新時間間隔 ); ``` ![image](https://hackmd.io/_uploads/r1QuOFEgex.png) 4. 查看刷新策略可以看到下次將於 30 分鐘後刷新。 ```sql SELECT * FROM timescaledb_information.jobs WHERE job_id = 1002; ``` ![image](https://hackmd.io/_uploads/ByILYYVeex.png) ![image](https://hackmd.io/_uploads/r1KHFtVlee.png) 5. 查看 Continuous Aggregate 的資料 ```sql SELECT * FROM test.test_timescaledb_hourly_type_stats; ``` <center class="half"> <img src="https://hackmd.io/_uploads/BkZjYYExgx.png" width="300"/> </center> 6. 基於 Continuous Aggregate 的資料做所需的資料利用 ![image](https://hackmd.io/_uploads/rJBVcKVelg.png) 7. 測試增刪改資料後做查詢 ```sql INSERT INTO test.test_timescaledb (type_id, content, created_at) VALUES (1, concat('content-', md5(random()::text)), now()); DELETE FROM test.test_timescaledb WHERE created_at >= '2025-05-03 21:00:00' AND created_at < '2025-05-03 22:00:00'; UPDATE test.test_timescaledb SET type_id = 4 WHERE created_at >= '2025-05-03 22:00:00' AND created_at < '2025-05-03 23:00:00' AND type_id = 1; ``` 查詢發現 ==INSERT 的資料有立即反應，但 UPDATE、DELETE 的結果沒有立即反應。== ![image](https://hackmd.io/_uploads/rkEbstVlxe.png) 8. 觸發刷新 Continuous Aggregate 後再次查詢 ```sql CALL refresh_continuous_aggregate( continuous_aggregate := 'test.test_timescaledb_hourly_type_stats'::regclass, -- 要刷新的 Continuous Aggregate window_start := now() - INTERVAL '1 day', -- 開始時間 window_end := now() -- 結束時間 ); ``` ![image](https://hackmd.io/_uploads/rkt12KVglx.png) 查詢後 UPDATE、DELETE 的結果也有反應到 View 了。 ![image](https://hackmd.io/_uploads/ByRf3Y4gxl.png) ## 階段性使用感想沒想到光是體驗功能就花了蠻多時間，還來不及比較 TimescaleDB 和 Partition 在效能、Lock 用量的差異，現階段先就功能面記錄心得XD TimescaleDB 提供許多功能，會吸引我從 Partition 轉換到 TimescaleDB 的部分有： 1. 一般資料表可直接轉換為 Hypertable，無論表內有無資料。 2. 自動新增分區，不需要預先建立或準備 DEFAULT 分區。 3. 資料壓縮的壓縮比比預想的還要好，且壓縮後仍可CRUD，可以作為一種冷資料保存策略。使用後覺得沒預期好的部分 1. HyperFunctions 的 `time_bucket_gapfill()` 使用限制挺多。 :::spoiler 案例用 `time_bucket()` 執行以下 SQL 沒有問題。 ```sql SELECT time_bucket('1 hour', created_at) as hour, type_id, count(*) as count FROM test.test_timescaledb WHERE created_at >= current_date AND created_at < now() GROUP BY 1, 2 ``` <center class="half"> <img src="https://hackmd.io/_uploads/BkBUJqEgxx.png" width="300"/> </center> 改成 `time_bucket_gapfill()` 就會報錯。 ```sql SELECT time_bucket_gapfill('1 hour', created_at) as hour, type_id, count(*) as count FROM test.test_timescaledb WHERE created_at >= current_date AND created_at < now() GROUP BY 1, 2 ``` <center class="half"> <img src="https://hackmd.io/_uploads/BkPiy5Eexx.png" width="300"/> </center> 官方文檔是建議説用 `WHERE` 提供時間範圍優於使用 `start`、`finish` 參數，但上圖的錯誤訊息看起來無法識別或無法解析 `WHERE` 條件的時間設定? ![image](https://hackmd.io/_uploads/Bye6ecEllx.png) 依文檔說的改成輸入 timestamptz 型別時間到 start、finish，但說只能用 simple expression? ![image](https://hackmd.io/_uploads/BymfQqNele.png) 最後改成寫死的時間，才可以運作。 ![image](https://hackmd.io/_uploads/SJ4HX94elx.png) ::: 2. Continuous Aggregates Materialized View 不支援使用 CTE、Subquery、Window Functions、Set Returning Functions，非常可惜。 3. Real-Time Aggregate 功能只能即時反應新增的資料，更新、刪除的資料不會即時反應，要等下次刷新到才會呈現。雖然有點可惜，但如果是 IoT 資料通常只會不斷新增而不會更新刪除，我想還是挺適合這類資料的。