###### tags: `小講堂` `COSCUP` `COSCUP2020` # PostgreSQL Anonymizer 介紹 - 張友謙 {%youtube 99yIhu6HtcA %} --- 關於資料保護的幾種方式 第一種就是清掉,壓掉! 例如使用 ```sql update tbl set name = NULL; update tbl set points = 0; update tbl set password = "這是假的" ``` 好處是 簡單粗暴. 但是雙面刃! 不利於 rdbms 的 integrity constraints (資料完整性) check / not null 被破壞. 不利於做功能測試 第二種進化了一下,貍貓換太子 使用亂數 md5 等來替代. ```sql update tbl set name = md5(random()::text); update tbl set points = floor(100 * random()); ``` 好處是 簡單快速 避免了 null對 not null 的破壞. 但還是不利於做功能測試 第三種 加料 增加雜訊 例如 ```sql update tbl set points = points * ((random() * 2 - 1) * 0.3 + 1); ``` 這樣會飄移 +/- 30% 缺點是 僅對 數值 或 日期 型態 可以運算的有效. 第四種 加密 PostgreSQL 有 pgcrypto extension 可以安裝使用. ```sql create extension pgcrypto; select crypt('小島南', gen_salt('md5')) as grl1 , crypt('初川南', gen_salt('md5')) as grl2; +------------------------------------+------------------------------------+ | grl1 | grl2 | +------------------------------------+------------------------------------+ | $1$uEJExxcX$e8dtb974Qr3TWfOyYeIsH0 | $1$HecNaMyL$GMxmIhC2IJafea0bO8Gbj1 | +------------------------------------+------------------------------------+ (1 row) Time: 16.205 ms ``` 優點是 關聯性或 unique 限制可以保持 缺點是 不利於功能測試 若key被偷,資料可能會被還原. 第五種 洗牌 ```sql with t1 as ( select row_number() over(order by random()) n , points as points1 from tbl ), t2 as ( select row_number() over(order by random()) n , id as id2 from tbl ) update tbl set points = points1 from t1 join t2 on t1.n = t2.n where id = t2.id2; ``` 優點是 資料是有意義的,不是亂碼型態. 對FK限制會比較方便. 缺點是 SQL指令較複雜. 若資料型態是 boolean 時,洗牌效果不佳. 第六種 建立產生假資料的函數 缺點是 不好開發 第七種 部分遮掩 ```sql select '******' || substring(phone, 6) as nphone from (values ('0900123456'), ('0900234567')) as t(phone); +-------------+ | nphone | +-------------+ | ******23456 | | ******34567 | +-------------+ (2 rows) ``` 優點是簡單好上手 缺點是 還是會被認出來 僅適合 char / text 型態 第八種 改變為概略描述 利用 Postgresql 特有的 range data type ```sql select age , int4range(age/10*10, (age/10+1)*10) as agerang from (values (28), (25), (36)) as t(age); +-----+---------+ | age | agerang | +-----+---------+ | 28 | [20,30) | | 25 | [20,30) | | 36 | [30,40) | +-----+---------+ (3 rows) ``` 優點是 有利於做資料分析 缺點是 改變了資料型態. PG Only. ------------------------------------ 接著來介紹主角, PostgreSQL Anonymizer 有提供許多方便好用的函數 亂數的 ```sql select anon.random_date_between('2001-08-23', current_date); +-------------------------------+ | random_date_between | +-------------------------------+ | 2016-12-31 17:30:30.330456+08 | +-------------------------------+ select anon.random_phone() , anon.random_phone('0900'); +--------------+---------------+ | random_phone | random_phone | +--------------+---------------+ | 0650844424 | 0900142016179 | +--------------+---------------+ (1 row) ``` 增加雜訊的 ```sql create table it0729 ( dt timestamp not null , val integer not null ); insert into it0729 values (timestamp '2010-03-25 13:45:27', 200); select anon.add_noise_on_datetime_column('it0729', 'dt', '5 years'); +------------------------------+ | add_noise_on_datetime_column | +------------------------------+ | t | +------------------------------+ 改變了 table 裡的內容. select * from it0729; +--------------------------+-----+ | dt | val | +--------------------------+-----+ | 2006-07-03 22:38:49.7328 | 200 | +--------------------------+-----+ select anon.add_noise_on_numeric_column('it0729', 'val', 0.33); +-----------------------------+ | add_noise_on_numeric_column | +-----------------------------+ | t | +-----------------------------+ select * from it0729; +--------------------------+-----+ | dt | val | +--------------------------+-----+ | 2006-07-03 22:38:49.7328 | 158 | +--------------------------+-----+ ``` 洗牌的 ```sql create table it0729b ( id integer generated always as identity , val integer not null ); insert into it0729b (val) values (234), (789), (123); format:anon.shuffle_column(shuffle_table, shuffle_column, primary_key) select anon.shuffle_column('it0729b', 'val', 'id'); +----------------+ | shuffle_column | +----------------+ | t | +----------------+ select * from it0729b; +----+-----+ | id | val | +----+-----+ | 1 | 123 | | 2 | 234 | | 3 | 789 | +----+-----+ 注意,這會改變table, 不要用在 PK 欄位,可以用在FK欄位. ``` 產生假(虛擬)資料的 ```sql select anon.fake_first_name() , anon.fake_last_name() , anon.fake_email(); +-----------------+----------------+-----------------------+ | fake_first_name | fake_last_name | fake_email | +-----------------+----------------+-----------------------+ | reuven | Gilcreast | tcoombe3v@arizona.edu | +-----------------+----------------+-----------------------+ select anon.fake_country(); +--------------+ | fake_country | +--------------+ | Brazil | +--------------+ select anon.fake_city(); +-----------+ | fake_city | +-----------+ | Andes | +-----------+ select anon.fake_city_in_country('Taiwan'); +----------------------+ | fake_city_in_country | +----------------------+ | Nantou | +----------------------+ select anon.fake_city_in_country('Japan'); +----------------------+ | fake_city_in_country | +----------------------+ | Tateyama | +----------------------+ select anon.fake_company() , anon.fake_company(); +------------------+--------------+ | fake_company | fake_company | +------------------+--------------+ | Beyond Cool Limo | Chute Crazy | +------------------+--------------+ ``` text 型態的 ```sql select anon.lorem_ipsum( words := 5); +----------------------------------+ | lorem_ipsum | +----------------------------------+ | Sed vel ullamcorper ante. Nullam | +----------------------------------+ select anon.lorem_ipsum( characters := 8 ); +-------------+ | lorem_ipsum | +-------------+ | Ut viver | +-------------+ ``` 這會產生 5段,3段,4段. 在此不列出 ```sql select anon.lorem_ipsum(); , anon.lorem_ipsum(3) , anon.lorem_ipsum( paragraphs := 4 ); ``` 資料是放在此路徑下, load 時 會讀取裡面的 csv 到套件相關的 table, 我們也可以自行補充. ``` [16:06:26] /usr/local/share/postgresql/extension/anon |=> ls anon--0.4.1.sql first_name.csv identifiers_fr_FR.csv anon--0.5.0.sql iban.csv last_name.csv anon--0.6.0.sql identifier_en_US.csv lorem_ipsum.csv city.csv identifier_fr_FR.csv siret.csv company.csv identifiers_category.csv email.csv identifiers_en_US.csv ----- ``` 部分遮掩 ```sql select anon.partial('abcdefgh',1,'xxxx',3); +----------+ | partial | +----------+ | axxxxfgh | +----------+ select anon.partial_email('daamien@gmail.com'); +-----------------------+ | partial_email | +-----------------------+ | da******@gm******.com | +-----------------------+ ``` 改變為概略描述 ```sql 數值型 generalize_int4range generalize_int8range generalize_numrange select val , anon.generalize_int4range(val, 150) from it0729; +-----+----------------------+ | val | generalize_int4range | +-----+----------------------+ | 158 | [150,300) | +-----+----------------------+ ``` 日期型 ```sql generalize_daterange generalize_tsrange generalize_tstzrange select dt , anon.generalize_tsrange(dt, 'year') , anon.generalize_tsrange(dt, 'decade') from it0729; +--------------------------+-----------------------------------------------+-----------------------------------------------+ | dt | generalize_tsrange | generalize_tsrange | +--------------------------+-----------------------------------------------+-----------------------------------------------+ | 2006-07-03 22:38:49.7328 | ["2006-01-01 00:00:00","2007-01-01 00:00:00") | ["2000-01-01 00:00:00","2010-01-01 00:00:00") | +--------------------------+-----------------------------------------------+-----------------------------------------------+ ``` -- 查詢現有安裝版本的函數 ```sql select p.proname -- , p.prosrc from pg_namespace n join pg_proc p on n.oid = p.pronamespace where n.nspname = 'anon' order by 1; +------------------------------+ | proname | +------------------------------+ | add_noise_on_datetime_column | | add_noise_on_numeric_column | ... | unload | | unmask_role | +------------------------------+ (86 rows) --------- ``` 來點比較不一樣的應用方式 前幾天介紹 cursor 的一個 table ```sql table it0721; -- 這是 psql 的方便的 meta command +----+------------+ | id | girl | +----+------------+ | 1 | 小島南 | | 2 | 初川南 | | 3 | 相澤南 | | 4 | 山岸逢花 | | 5 | 奥田咲 | | 6 | 夢乃あいか | | 7 | 凛音とうか | | 8 | 深田えいみ | | 9 | 永井マリア | | 10 | 紗々原ゆり | | 11 | 架乃ゆら | | 12 | 七沢みあ | | 13 | 小野六花 | | 14 | 八木奈々 | +----+------------+ (14 rows) ``` 使用以下的指令搭配 Anonymizer 的函數 ```sql SECURITY LABEL FOR anon ON COLUMN it0721.girl IS 'MASKED WITH FUNCTION anon.fake_last_name()'; 接著讓他生效,要這樣動手 SELECT anon.anonymize_database(); +--------------------+ | anonymize_database | +--------------------+ | t | +--------------------+ (1 row) Time: 197.381 ms select * from it0721; +----+-------------+ | id | girl | +----+-------------+ | 1 | Ottesen | | 2 | Farry | | 3 | Jannetti | | 4 | Challis | | 5 | Skees | | 6 | Bour | | 7 | Debrie | | 8 | Hobin | | 9 | Tesreau | | 10 | Breitenbach | | 11 | Bong | | 12 | Abrecht | | 13 | Syer | | 14 | Sawina | +----+-------------+ (14 rows) ``` 將 security label 清除. ```sql SECURITY LABEL FOR anon ON COLUMN it0721.girl IS NULL; ``` 注意! 這樣做 資料是變更了,回不去了! 這是屬於 In-Place Anonymization 同樣的方式還有上面有介紹過的 ``` anon.add_noise_on_numeric_column(table, column, ratio) anon.add_noise_on_datetime_column(table, column, interval) ```