###### tags: `小講堂` `COSCUP` `COSCUP2020`
# PostgreSQL Anonymizer 介紹 - 張友謙
{%youtube 99yIhu6HtcA %}
---
關於資料保護的幾種方式
第一種就是清掉,壓掉!
例如使用
```sql
update tbl
set name = NULL;
update tbl
set points = 0;
update tbl
set password = "這是假的"
```
好處是 簡單粗暴.
但是雙面刃!
不利於 rdbms 的 integrity constraints (資料完整性)
check / not null 被破壞.
不利於做功能測試
第二種進化了一下,貍貓換太子
使用亂數 md5 等來替代.
```sql
update tbl
set name = md5(random()::text);
update tbl
set points = floor(100 * random());
```
好處是 簡單快速
避免了 null對 not null 的破壞.
但還是不利於做功能測試
第三種 加料
增加雜訊
例如
```sql
update tbl
set points = points * ((random() * 2 - 1) * 0.3 + 1);
```
這樣會飄移 +/- 30%
缺點是
僅對 數值 或 日期 型態 可以運算的有效.
第四種 加密
PostgreSQL 有 pgcrypto extension 可以安裝使用.
```sql
create extension pgcrypto;
select crypt('小島南', gen_salt('md5')) as grl1
, crypt('初川南', gen_salt('md5')) as grl2;
+------------------------------------+------------------------------------+
| grl1 | grl2 |
+------------------------------------+------------------------------------+
| $1$uEJExxcX$e8dtb974Qr3TWfOyYeIsH0 | $1$HecNaMyL$GMxmIhC2IJafea0bO8Gbj1 |
+------------------------------------+------------------------------------+
(1 row)
Time: 16.205 ms
```
優點是
關聯性或 unique 限制可以保持
缺點是
不利於功能測試
若key被偷,資料可能會被還原.
第五種 洗牌
```sql
with
t1 as (
select row_number() over(order by random()) n
, points as points1
from tbl
),
t2 as (
select row_number() over(order by random()) n
, id as id2
from tbl
)
update tbl
set points = points1
from t1
join t2
on t1.n = t2.n
where id = t2.id2;
```
優點是
資料是有意義的,不是亂碼型態.
對FK限制會比較方便.
缺點是
SQL指令較複雜.
若資料型態是 boolean 時,洗牌效果不佳.
第六種 建立產生假資料的函數
缺點是
不好開發
第七種 部分遮掩
```sql
select '******' || substring(phone, 6) as nphone
from (values ('0900123456'), ('0900234567')) as t(phone);
+-------------+
| nphone |
+-------------+
| ******23456 |
| ******34567 |
+-------------+
(2 rows)
```
優點是簡單好上手
缺點是
還是會被認出來
僅適合 char / text 型態
第八種 改變為概略描述
利用 Postgresql 特有的 range data type
```sql
select age
, int4range(age/10*10, (age/10+1)*10) as agerang
from (values (28), (25), (36)) as t(age);
+-----+---------+
| age | agerang |
+-----+---------+
| 28 | [20,30) |
| 25 | [20,30) |
| 36 | [30,40) |
+-----+---------+
(3 rows)
```
優點是 有利於做資料分析
缺點是 改變了資料型態. PG Only.
------------------------------------
接著來介紹主角, PostgreSQL Anonymizer
有提供許多方便好用的函數
亂數的
```sql
select anon.random_date_between('2001-08-23', current_date);
+-------------------------------+
| random_date_between |
+-------------------------------+
| 2016-12-31 17:30:30.330456+08 |
+-------------------------------+
select anon.random_phone()
, anon.random_phone('0900');
+--------------+---------------+
| random_phone | random_phone |
+--------------+---------------+
| 0650844424 | 0900142016179 |
+--------------+---------------+
(1 row)
```
增加雜訊的
```sql
create table it0729 (
dt timestamp not null
, val integer not null
);
insert into it0729 values
(timestamp '2010-03-25 13:45:27', 200);
select anon.add_noise_on_datetime_column('it0729', 'dt', '5 years');
+------------------------------+
| add_noise_on_datetime_column |
+------------------------------+
| t |
+------------------------------+
改變了 table 裡的內容.
select * from it0729;
+--------------------------+-----+
| dt | val |
+--------------------------+-----+
| 2006-07-03 22:38:49.7328 | 200 |
+--------------------------+-----+
select anon.add_noise_on_numeric_column('it0729', 'val', 0.33);
+-----------------------------+
| add_noise_on_numeric_column |
+-----------------------------+
| t |
+-----------------------------+
select * from it0729;
+--------------------------+-----+
| dt | val |
+--------------------------+-----+
| 2006-07-03 22:38:49.7328 | 158 |
+--------------------------+-----+
```
洗牌的
```sql
create table it0729b (
id integer generated always as identity
, val integer not null
);
insert into it0729b (val) values
(234), (789), (123);
format:anon.shuffle_column(shuffle_table, shuffle_column, primary_key)
select anon.shuffle_column('it0729b', 'val', 'id');
+----------------+
| shuffle_column |
+----------------+
| t |
+----------------+
select *
from it0729b;
+----+-----+
| id | val |
+----+-----+
| 1 | 123 |
| 2 | 234 |
| 3 | 789 |
+----+-----+
注意,這會改變table, 不要用在 PK 欄位,可以用在FK欄位.
```
產生假(虛擬)資料的
```sql
select anon.fake_first_name()
, anon.fake_last_name()
, anon.fake_email();
+-----------------+----------------+-----------------------+
| fake_first_name | fake_last_name | fake_email |
+-----------------+----------------+-----------------------+
| reuven | Gilcreast | tcoombe3v@arizona.edu |
+-----------------+----------------+-----------------------+
select anon.fake_country();
+--------------+
| fake_country |
+--------------+
| Brazil |
+--------------+
select anon.fake_city();
+-----------+
| fake_city |
+-----------+
| Andes |
+-----------+
select anon.fake_city_in_country('Taiwan');
+----------------------+
| fake_city_in_country |
+----------------------+
| Nantou |
+----------------------+
select anon.fake_city_in_country('Japan');
+----------------------+
| fake_city_in_country |
+----------------------+
| Tateyama |
+----------------------+
select anon.fake_company()
, anon.fake_company();
+------------------+--------------+
| fake_company | fake_company |
+------------------+--------------+
| Beyond Cool Limo | Chute Crazy |
+------------------+--------------+
```
text 型態的
```sql
select anon.lorem_ipsum( words := 5);
+----------------------------------+
| lorem_ipsum |
+----------------------------------+
| Sed vel ullamcorper ante. Nullam |
+----------------------------------+
select anon.lorem_ipsum( characters := 8 );
+-------------+
| lorem_ipsum |
+-------------+
| Ut viver |
+-------------+
```
這會產生 5段,3段,4段. 在此不列出
```sql
select anon.lorem_ipsum();
, anon.lorem_ipsum(3)
, anon.lorem_ipsum( paragraphs := 4 );
```
資料是放在此路徑下, load 時 會讀取裡面的 csv 到套件相關的 table,
我們也可以自行補充.
```
[16:06:26] /usr/local/share/postgresql/extension/anon
|=> ls
anon--0.4.1.sql first_name.csv identifiers_fr_FR.csv
anon--0.5.0.sql iban.csv last_name.csv
anon--0.6.0.sql identifier_en_US.csv lorem_ipsum.csv
city.csv identifier_fr_FR.csv siret.csv
company.csv identifiers_category.csv
email.csv identifiers_en_US.csv
-----
```
部分遮掩
```sql
select anon.partial('abcdefgh',1,'xxxx',3);
+----------+
| partial |
+----------+
| axxxxfgh |
+----------+
select anon.partial_email('daamien@gmail.com');
+-----------------------+
| partial_email |
+-----------------------+
| da******@gm******.com |
+-----------------------+
```
改變為概略描述
```sql
數值型
generalize_int4range
generalize_int8range
generalize_numrange
select val
, anon.generalize_int4range(val, 150)
from it0729;
+-----+----------------------+
| val | generalize_int4range |
+-----+----------------------+
| 158 | [150,300) |
+-----+----------------------+
```
日期型
```sql
generalize_daterange
generalize_tsrange
generalize_tstzrange
select dt
, anon.generalize_tsrange(dt, 'year')
, anon.generalize_tsrange(dt, 'decade')
from it0729;
+--------------------------+-----------------------------------------------+-----------------------------------------------+
| dt | generalize_tsrange | generalize_tsrange |
+--------------------------+-----------------------------------------------+-----------------------------------------------+
| 2006-07-03 22:38:49.7328 | ["2006-01-01 00:00:00","2007-01-01 00:00:00") | ["2000-01-01 00:00:00","2010-01-01 00:00:00") |
+--------------------------+-----------------------------------------------+-----------------------------------------------+
```
-- 查詢現有安裝版本的函數
```sql
select p.proname
-- , p.prosrc
from pg_namespace n
join pg_proc p
on n.oid = p.pronamespace
where n.nspname = 'anon'
order by 1;
+------------------------------+
| proname |
+------------------------------+
| add_noise_on_datetime_column |
| add_noise_on_numeric_column |
...
| unload |
| unmask_role |
+------------------------------+
(86 rows)
---------
```
來點比較不一樣的應用方式
前幾天介紹 cursor 的一個 table
```sql
table it0721; -- 這是 psql 的方便的 meta command
+----+------------+
| id | girl |
+----+------------+
| 1 | 小島南 |
| 2 | 初川南 |
| 3 | 相澤南 |
| 4 | 山岸逢花 |
| 5 | 奥田咲 |
| 6 | 夢乃あいか |
| 7 | 凛音とうか |
| 8 | 深田えいみ |
| 9 | 永井マリア |
| 10 | 紗々原ゆり |
| 11 | 架乃ゆら |
| 12 | 七沢みあ |
| 13 | 小野六花 |
| 14 | 八木奈々 |
+----+------------+
(14 rows)
```
使用以下的指令搭配 Anonymizer 的函數
```sql
SECURITY LABEL FOR anon ON COLUMN it0721.girl
IS 'MASKED WITH FUNCTION anon.fake_last_name()';
接著讓他生效,要這樣動手
SELECT anon.anonymize_database();
+--------------------+
| anonymize_database |
+--------------------+
| t |
+--------------------+
(1 row)
Time: 197.381 ms
select * from it0721;
+----+-------------+
| id | girl |
+----+-------------+
| 1 | Ottesen |
| 2 | Farry |
| 3 | Jannetti |
| 4 | Challis |
| 5 | Skees |
| 6 | Bour |
| 7 | Debrie |
| 8 | Hobin |
| 9 | Tesreau |
| 10 | Breitenbach |
| 11 | Bong |
| 12 | Abrecht |
| 13 | Syer |
| 14 | Sawina |
+----+-------------+
(14 rows)
```
將 security label 清除.
```sql
SECURITY LABEL FOR anon ON COLUMN it0721.girl IS NULL;
```
注意! 這樣做 資料是變更了,回不去了!
這是屬於 In-Place Anonymization
同樣的方式還有上面有介紹過的
```
anon.add_noise_on_numeric_column(table, column, ratio)
anon.add_noise_on_datetime_column(table, column, interval)
```