PostgreSQL複製schema下table結構到另一schema的procedure及取樣的研討

# PostgreSQL複製schema下table結構到另一schema的procedure及取樣的研討在資料庫的日常使用,會有將schema下的table結構複製到另一schema下,供做測試,或是做移轉等需求. 為方便說明,將建立三個 schema 1. myfunc 用來放我們的 function , procedure 2. s0822 用來放source tables 3. t0822 用來放target tables ## 複製的方法 ```sql= create schema myfunc; create schema s0822; create schema t0822; ``` ### 建立 source tables ```sql= create table s0822.t001 ( id int generated always as identity primary key , txt text not null ); create table s0822.t002 ( id int generated always as identity primary key , t001_id int not null references s0822.t001(id) , txt text not null ); ``` ### 建立一個 function 產生指定 schema 下一般tables 的名稱 ```sql= create or replace function myfunc.f_get_schema_tables ( in in_schema text , out relname text ) returns setof text language sql as $code$ select c.relname::text from pg_class c join pg_namespace n on c.relnamespace = n.oid where n.nspname::text = in_schema and c.relkind = 'r' order by 1; $code$; ``` ### 建立複製結構的 procedure ```sql= create or replace procedure myfunc.p_clone_schmea_tables ( in source_schema text , in target_schema text ) language plpgsql as $code$ declare curs cursor for select relname from myfunc.f_get_schema_tables(source_schema); rel text; begin open curs; loop fetch curs into rel; exit when not found; execute 'create table if not exists ' || target_schema || '.' || rel || ' (like ' || source_schema || '.' || rel || ' including all)'; end loop; close curs; end; $code$; ``` ### 使用 procedure 來複製 ```sql= call myfunc.p_clone_schmea_tables('s0822', 't0822'); ``` ### 觀察結果 ```sql= \dt s0822.* List of relations Schema | Name | Type | Owner --------+------+-------+------- s0822 | t001 | table | nana s0822 | t002 | table | nana (2 rows) \dt t0822.* List of relations Schema | Name | Type | Owner --------+------+-------+------- t0822 | t001 | table | nana t0822 | t002 | table | nana (2 rows) \d s0822.t001 Table "s0822.t001" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+------------------------------ id | integer | | not null | generated always as identity txt | text | | not null | Indexes: "t001_pkey" PRIMARY KEY, btree (id) Referenced by: TABLE "s0822.t002" CONSTRAINT "t002_t001_id_fkey" FOREIGN KEY (t001_id) REFERENCES s0822.t001(id) \d t0822.t001 Table "t0822.t001" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+------------------------------ id | integer | | not null | generated always as identity txt | text | | not null | Indexes: "t001_pkey" PRIMARY KEY, btree (id) \d s0822.t002 Table "s0822.t002" Column | Type | Collation | Nullable | Default ---------+---------+-----------+----------+------------------------------ id | integer | | not null | generated always as identity t001_id | integer | | not null | txt | text | | not null | Indexes: "t002_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "t002_t001_id_fkey" FOREIGN KEY (t001_id) REFERENCES s0822.t001(id) \d t0822.t002 Table "t0822.t002" Column | Type | Collation | Nullable | Default ---------+---------+-----------+----------+------------------------------ id | integer | | not null | generated always as identity t001_id | integer | | not null | txt | text | | not null | Indexes: "t002_pkey" PRIMARY KEY, btree (id) ``` 注意到除了 **Foreign key constraints** 是都複製了. 因為我們是使用了 like INCLUDING ALL 細節可以參考官網文件 https://www.postgresql.org/docs/current/sql-createtable.html ## 取樣的方法 ### 先建立一個放亂數資料的table ```sql= create table s0822.t ( a integer , b text , c boolean ); select setseed(0.5); insert into s0822.t with r as ( select id , chr((32+random()*94)::integer) , random() < 0.01 from generate_series(1, 1e5) as s(id) ) select * from r order by random(); analyze s0822.t; select attname , correlation from pg_stats where schemaname = 's0822' and tablename = 't'; attname | correlation ---------+-------------- a | 0.0044361027 b | 0.0063023297 c | 0.9826269 (3 rows) ``` ### 建立四個 table 用來放取樣資料使用 like 方法, 但不使用 including all ```sql= create table t0822.t1 ( like s0822.t ); create table t0822.t2 ( like s0822.t ); create table t0822.t3 ( like s0822.t ); create table t0822.t4 ( like s0822.t ); ``` ### 取樣的方式在 select 的 tablesample 子句 https://www.postgresql.org/docs/current/sql-select.html TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] ### 取樣10% 使用 system method ```sql= insert into t0822.t1 select * from s0822.t tablesample system(10); ``` ### 使用 bernoulli method ```sql= insert into t0822.t2 select * from s0822.t tablesample bernoulli(10); INSERT 0 9987 Time: 17.633 ms ``` ### 比較 ```sql= analyze t0822.t1; analyze t0822.t2; select attname , correlation from pg_stats where schemaname = 't0822' and tablename = 't1' order by 1; attname | correlation ---------+--------------- a | -0.0044052443 b | 0.0011416585 c | 0.9800325 (3 rows) select attname , correlation from pg_stats where schemaname = 't0822' and tablename = 't2' order by 1; attname | correlation ---------+--------------- a | 0.00066867325 b | 0.023340156 c | 0.98007065 (3 rows) ``` a 是 id,遞增, 取樣後相關性就差異很大,是正常現象. b, c 欄位的相關性, c 原本是 random() < 0.01 方式產生, b 原本是 chr((32+random()*94)::integer) 方式產生, 可以觀察到兩種取樣方式與原始table在統計值上的接近程度. 但是要知道 pg_stat 的相關性,也是抽樣的,並不是 100%. 以官方文件的說法, bernoulli method 會慢一些,但是效果會好一些.另外可以參考官網文件以及古博士的分享. https://island.postgresql.tw/2018/07/24/tablesample.html 另外還有 REPEATABLE, 這是讓我們可以設定取樣時的亂數種子,進而可以重現. ### 使用 bernoulli取樣 20%,亂數種子為 100 ```sql= insert into t0822.t3 select * from s0822.t tablesample bernoulli(20) REPEATABLE(100); INSERT 0 20075 Time: 23.151 ms insert into t0822.t4 select * from s0822.t tablesample bernoulli(20) REPEATABLE(100); INSERT 0 20075 Time: 22.981 ms --- analyze t0822.t3; analyze t0822.t4; select attname , correlation from pg_stats where schemaname = 't0822' and tablename = 't3' order by 1; attname | correlation ---------+-------------- a | 0.0013938936 b | 0.002332309 c | 0.98037195 (3 rows) select attname , correlation from pg_stats where schemaname = 't0822' and tablename = 't4' order by 1; attname | correlation ---------+-------------- a | 0.0013938936 b | 0.002332309 c | 0.98037195 (3 rows) ``` 可以觀察到,使用了同樣亂數種子,會取樣相同. ## 結論透過今天介紹的功能,對我們要做資料庫的資料移轉測試,驗證查詢方式的改善測試,結合運用,能夠方便及有效驗證.