# Text De-Identification ## 1. Abstract ## 2. Related Works - [ The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization ](https://direct.mit.edu/coli/article/48/4/1053/112770/The-Text-Anonymization-Benchmark-TAB-A-Dedicated) - ## 3. Datasets 3.1. [Privy](https://huggingface.co/datasets/beki/privy) - a syntactic dataset. generally structured data (e.g. SQL, JSON format)  3.2. [Text Anonymization Benchmark (TAB)](https://github.com/NorskRegnesentral/text-anonymization-benchmark) ## 4. Preliminary Dataset with GPT4 and TAB - intel'in optimize ettigi bir mistral modeli var, ona bakalim ### swedish: - https://huggingface.co/neph1/bellman-7b-mistral-instruct-v0.2 - sw3 models ### 4.1. Fine-Tuning Phi-2 w/ TAB - [V1](https://huggingface.co/dcipheranalytics/phi-2-tab-v1) - [V2](https://huggingface.co/dcipheranalytics/phi-2-tab-v2) | model | epoch | prec | recall | f1 | |-------|-------|-------|--------|-------| | v1 | 5 | 0.805 | 0.617 | 0.663 | | v2 | 3 | 0.40 | 0.439 | 0.40 | - results are on full gpt4 data. - v2 is not fit enough w/ 3 epochs ### 4.2. Fine Tuning Phi-2 w/ TAB+GPT4GenData | data | rep_pen | steps | prec | recall | f1 | |------------|---------|-------|--------|--------|-------| | tab+gpt4/2 | none | 330 | 0.754 | 0.723 | 0.73 | | tab+gpt4/2 | 1.1 | 330 | 0.752 | 0.73 | 0.73 | | tab+gpt4/2 | none | 396 | 0.77 | 0.734 | 0.744 | | tab+gpt4/2 | 1.1 | 396 | 0.76 | 0.727 | 0.736 | | tab+gpt4/2 | 1.2 | 396 | 0.7269 | 0.697 | 0.698 | - results are on gpt4/2-test data ### 4.3. Fine Tuning Phi-2 w/ TAB(cherry-picked) + GPT4GenData - Fixed `[maildomain]` regex seperation problem on GPT4 dataset. - TAB data and GPT4 datasets have different text length and label count stats. Stats are depicted at below. #### 4.3.1. Dataset 1. TAB Number of labels | Text length (char) :-------------------------:|:-------------------------:  |  2. GPT4 generated Number of labels | Text length (char) :-------------------------:|:-------------------------: |  - We filtered TAB dataset with both of below conditions: - get less then 4000 char texts, - get less then 23 labels. Finally we got 199 training sample from TAB dataset. And merged with GPT4 generated to fine-tune phi2 model. Results: | data | rep_pen | steps | prec | recall | f1 | |-------------------|---------|-------|--------|--------|---------| | tab-filted+gpt4/2 | none | 168 | 0.7942 | 0.7248 | 0.75048 | | tab-filted+gpt4/2 | 1.1 | 168 | 0.7942 | 0.7306 | 0.7513 | | tab-filted+gpt4/2 | 1.2 | 168 | | | | ### 5. GPT4 Dataset Experiments #### 5.1. TAB + 100 Banking Topics, 4 examples each [done in 4.3] #### 5.2. TAB + 100 Banking Topics, 8 examples each ``` precision 0.835063 recall 0.751692 f1 0.785094 ``` #### 5.3. 100 Banking Topics, 8 examples each ``` precision 0.810754 recall 0.781420 f1 0.788129 ``` #### 5.4. 100x8, 100x4 banking, 100x4 insurance ``` precision 0.836223 recall 0.781132 f1 0.801837 ```  on TAB-test rep-pen: 1.0 ``` precision 0.506118 recall 0.350976 f1 0.391614 ```
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up