# FactChecking
## 2024/3/25
- 嘗試使用gemini在fever test data中測試Label Accuracy
- 並分為使用document retriavel 預測出來的整份wiki page作為證據,以及使用evidence retriavel 預測出來的所有evidence串再一起做為證據,還有甚麼都不給,直接問claim是否正確,分別得到如下效能:
| | gemini with evidence| gemini with 1 hold document |gemini with nothing|roberta fine-tune|
| -------- | -------- | -------- | -------- |-------- |
| Label Accuracy | 0.6241 | 0.5483 |0.5419| 0.6992 |
- 使用prompt分別如下
- gemini with evidence
```
Please verify the claim based on the evidence provided.
If the claim is true, please respond with "SUPPORTS"
If it is not true, please respond with "REFUTES"
If unable to determine, please respond with "NOT ENOUGH INFO"
The claim is as follows:{claim}
The evidence is as follows:{evidences}
```
- gemini with 1 hold document
```
Please verify the claim based on the document provided.
If the claim is true, please respond with "SUPPORTS"
If it is not true, please respond with "REFUTES"
If unable to determine, please respond with "NOT ENOUGH INFO"
The claim is as follows:{claim}\nThe document is as follows:{document}
```
- gemini with nothing
```
Please verify the claim.
If the claim is true, please respond with "SUPPORTS"
If it is not true, please respond with "REFUTES"
If unable to determine, please respond with "NOT ENOUGH INFO"
The claim is as follows:{claim}
```
- 這部分可以做為label verification 的 base line。
<!-- ## 2024/3/18
- 論文部分暫時拿掉Prompt learning -->
## 2023/12/26
- 季儒 document retriavel 結果(不包含NEI,所以總資料筆數13,332)
| | Top10 | Top20 | Top20_rerank10 | Top100 | Top100_rerank10_large | Top100_rerank10_finetune_base | Top100_rerank10_finetune_large |
| --- | --- | --- | --- | --- | --- | --- | --- |
| dataset_size | 13332 | 13332 | 13332 | 13332 | 13332 | 13332 | 13332 |
| total_match | 7859 | 8799 | 8782 | 11054 | 10976 | 10852 | 10919 |
| partial_match | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| strict_recall | 58.9% | 66% | 65.9% | 83% | 82.3% | 81.4% | 82% |
- 取document retriavel的Top 10,用sentenceBert進行sentence retriavel後結果(有含NEI,總資料筆數19,998)
- 因為到此步驟還沒有label,無法用fever套件計算,所以手動計算每一筆data,分別計算tp、fp、fn,最後再做平均。但是會遇到precision==0 或 recall==0,導致f1 score無法計算。

| | document Top10 |
| -------- | -------- |
| dataset_size | 19,998|
| avg_recall | 31.12% |
| avg_precision | 9.01% |
| avg_f1 (6,902) | 29.53% |
| avg_f1(all)| 13.97%|
- avg_f1會拉高,是因為有多筆recall和precision皆為0,f1無法計算排除導致,f1有效筆數(6,902)。
- 但如果直接使用avg_recall和avg_precision計算f1,f1-score則會降低。
- 進行claim verify時,如果原本sentence retriavel出來為空值,直接認定為NOT ENOUGH INFO,非空值則進行多數決投票,並預測出label
- 由於驗證需要還未跑完,下面為1,000筆的數據
| | document Top10 |
| -------- | -------- |
| dataset_size | 1,000|
|label_accuracy| 49.5%|
| strict_score | 27.4% |
| recall | 22.08% |
| precision | 20.95%|
| f1 | 21.50% |
## 2023/11/14
### To-Do
1. 爬問題版中網友提問問題(蘇鈺璇)
2. 由chatGPT標註關鍵詞(季儒)
3. 中文的Query Generation,資料集,輸出多個Query(程偉)
4. 增強document retrieval,擷取多個網頁內容,可用email寄送結果(季儒)
5. 中文sentenceBert的,Similarity 負值問題(恩柔)(程偉)
6. FEVER 效能實驗(季儒)
## 2023/10/31
- equery API目前已接上google翻譯,中英文皆可使用。
- 目前cverify API,由於中文模型留下的checkpoint暫時無法使用,已先使用翻譯方式完成
```
@app.get("/cverify/")
GET api_003(claim: str, url:str)
->{label: int,
evidence: [str],
state: str
}
```
- 正在重train學姊中文模型,使用新的checkpoint,訓練完成後再更改cquery API
- 使用Query Generation方式recall
-
| | Top5 |Top10 | Top20 | Top100 |
| -------- | -------- |-------- |-------- |-------- |
| 總筆數(扣除NEI) | 13332 | 13332 | 13332 | 13332 |
| 至少對一筆 | 9631 | 9771 | 10172|10235|
| 全都有包含 | 8330 | 8438 |8802| 8862|
| strict-recall | 62.48% | 63.29% |66.02% |66.47%|
- 由於ElasticSearch的方式,雖然設定Top100,但實際大部分搜尋出來不到100
## 2023/10/24
- equery 和 everify API和完成
- equery API
- equery: str
- 異常狀態equery = None
- state: str
- "1" 為正常狀態、異常則回傳Exception內容
```
@app.route('/equery/', methods=['GET'])
GET api_001(claim)
->{equery: str,
state: str
}
```
- everify API
- label: int
- "SUPPORTS": 0 , "REFUTES": 1, "NOT ENOUGH INFO": 2
- label為sentenceBert篩選出的evidences,並做多數決
- evidence:[str]
- 為多數決後,符合最終label的evidences
- state: str
```
@app.route('/everify/', methods=['GET'])
GET api_002(claim,url)
->{label: int,
evidence: [str],
state: str
}
```
## 2023/10/16
- 使用Flask架構建立API專案:
- 取得Query,中英文分成兩個api,前端需先判斷input為中文還是英文。
```
@app.route('/equery/', methods=['GET'])
GET api_001(claim)
->{equery: str, # 異常狀態equery = None
state: int
}
```
~~@app.route('/cquery/', methods=['GET'])
GET api_002(claim)
->{equery: str,
state: int
}~~
- 取得verification,這邊一次只處理一個網站的內容,也就是如果要驗證的網站數量取決前端,如果要驗證5個網站就request 5次。如果label為0、1,代表SUP和REF,前端可將該網站link顯示於detail。
```
@app.route('/everifi/', methods=['GET'])
GET api_002(claim,url)
->{label: int,
evidence: [str],
state: int
}
@app.route('/cverifi/', methods=['GET'])
GET api_003(claim,url)
->{label: int,
evidence: [str],
state: int
}
```
- 目前進度
- 整合sentence retrieval 和 claim verification程式碼,當初兩邊程式各寫各的,但verification API 需要兩邊串起來執行。
- chef提供的title並非entity,中文的queryGeneration model的訓練資料會無法生成預期的query。
## 2023/10/03
- 使用T5-Generation生成Query效果
| | Top5 |Top10 | Top20 |
| -------- | -------- |-------- |-------- |
| 總筆數(扣除NEI) | 13332 | 13332 | 13332 |
| 至少對一筆 | 9631 | 9771 | 10172|
| 全都有包含 | 8330 | 8438 |8802|
| strict-recall | 62.48% | 63.29% |66.02% |
- 方法比較
| Top10 | 直接使用claim搜尋 | 加入詞頻停用詞 | 加入外部停用詞 |使用生成模型 |
| -------- | -------- | -------- | -------- | -------- |
| strict-recall | 11.62% | 11.72% | 13.22% |63.29% |
- Top5實際效果

- query為生成模型的output
- evidence為人工標記的title,在此為label值
- Top5為經過elasticsearch篩選後的title
- 目前改善方式,想讓生成模型產生多個query,並使用多個query的Top5,增加其搜尋的廣度。
## 2023/09/25
- 使用elasticsearch在validation取pages效果
| | Top5 | Top10 |
| -------- | -------- | -------- |
| 總筆數(扣除NEI) | 13332 | 13332 |
| 至少對一筆 | 1547 | 1780 |
| 全都有包含 | 1348 | 1549 |
| strict-recall | 10.11% | 11.62% |
- 會造成這原因有很大一部分是elasticsearch在搜尋時,沒辦法知道claim中每個字的重要性,以下舉例:
- claim: "Heartland is a canadian TV series."
- Top5:
- id : Is-a
- id : Heartland
- id : IS-IS
- id : .is
- id : IS
- label: "Heartland_-LRB-Canadian_TV_series-RRB-"
- claim: "Ryan Seacrest is a person."
- Top5:
- id : Is-a
- id : Person/a
- id : IS-IS
- id : RYAN
- id : Ryan
- label: "Ryan_Seacrest"
- 嘗試加入使用elasticsearch的詞頻設定停用詞後,提升不明顯。
| | Top10 |
| -------- | -------- |
| 總筆數(扣除NEI) | 13332 |
| 至少對一筆 | 1798 |
| 全都有包含 | 1562 |
| strict-recall | 11.72% |
- 嘗試使用外部停用詞列表測試
| | Top10 |
| -------- | -------- |
| 總筆數(扣除NEI) | 13332 |
| 至少對一筆 | 2038 |
| 全都有包含 | 1762 |
| strict-recall | 13.22% |
- 多數的fever實驗解決此問題都是透過NER任務找出entity後,再用該entity去query。
- 目前做法是透過prompt方式,訓練T5 Generation model,使其生成query,再透過elasticsearch查找。
## 2023/09/19
- 將5,416,537筆wiki資料放入elasticsearch
- mapping of wiki data in elasticsearch
```mapping = {
"properties": {
"id": {
"type": "integer"
},
"text": {
"type": "text"
},
"lines": {
"type": "text"
},
}
}
```
- train
```
{
"id": 62037,
"label": "SUPPORTS",
"claim": "Oliver Reed was a film actor.",
"evidence": [
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 0]
],
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 3],
[<annotation_id>, <evidence_id>, "Gladiator_-LRB-2000_film-RRB-", 0]
],
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 2],
[<annotation_id>, <evidence_id>, "Castaway_-LRB-film-RRB-", 0]
],
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 1]
],
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 6]
]
]
}
```
- "evidence": [Annotation ID, Evidence ID, Wikipedia URL, sentence ID]
- 訓練SentenceBert資料來源
- 使用elasticsearch直接找出id和Wikipedia URL完全相同的wiki data
- 取其lines中指定sentence ID 標註為 True
- 同一個lines中不同的sentence ID標註為 False
- test and evaluate
- 使用elasticsearch query方式找出id和claim最相近的5筆data
- 再透過訓練好的SentenceBert預測sentence ID
- 若預測sentence ID為True,則"predicted_evidence": [Wikipedia URL, sentence ID]
-
## 2023/09/12
- wiki page預測:
- FEVER 原始數據集:
- | | train | Shared Task Development Dataset (Labelled) | Shared Task Blind Test Dataset (Unlabelled) |
| --------------- | ------ | ----------| ---- |
| 數量 | 145,449 | 19,998 | 19,998 |
- 但由於test data是Unlabelled,所以論文實驗都是把Development切割成兩部分Paper Development Dataset和Paper Test Dataset 各9,999筆
- 由於FEVER Score需要預測wiki page和sentence,原本想做classifier去做wiki page,但因為FEVER的wiki page數量共有 5,416,537 筆,分類數量龐大。
- 目前使用Semantic_Ranker去計算claim和page title 的 cos Similarity,並取出前五名的pages
- Semantic_Ranker部分只針對test data去預測,所以實際計算cos Similarity次數為 9,999 X 5,416,537
- 使用Elastic search或Lamchang
## 2023/09/08
- FEVER 數據集的Retrieval:
- 需要從所有的維基百科資料中檢索出該維基百科頁面,且檢索出golden evidence該行
- FEVER SCORE:
```
from fever.scorer import fever_score
instance1 = {"label": "REFUTES", "predicted_label": "REFUTES",
"predicted_evidence": [ #is not strictly correct - missing (page2,2)
["page1", 1] #page name, line number
],
"evidence":
[
[
[None, None, "page1", 1], #[(ignored) annotation job, (ignored) internal id, page name, line number]
[None, None, "page2", 2],
]
]
}
instance2 = {"label": "REFUTES", "predicted_label": "REFUTES",
"predicted_evidence": [
["page1", 1],
["page2", 2],
["page3", 3]
],
"evidence":
[
[
[None, None, "page1", 1],
[None, None, "page2", 2],
]
]
}
predictions = [instance1, instance2]
strict_score, label_accuracy, precision, recall, f1 = fever_score(predictions)
print(strict_score) #0.5
print(label_accuracy) #1.0
print(precision) #0.833 (first example scores 1, second example scores 2/3)
print(recall) #0.5 (first example scores 0, second example scores 1)
print(f1) #0.625
```
- 問題:
- 對於NEI的資料,之前在計算Retrieval效能時,都有做計算。但原始的FEVER在NEI的部分並無golden envidence。
- 我們使用的golden envidence數據集是使用[TeamPapelo:TransformerNetworksatFEVER](https://aclanthology.org/W18-5517/)這篇論文中,用TFIDF、named entity retrieval等檢索方式找出NEI的evidence。
## 2023/07/25
- | | train | validation | test |
| --------------- | ------ | ---------- | ---- |
| 1.Golden Evidence筆數 | 228,277 | 15,935 |16,039 |
| 2.重複claim數 | 92,732 | 6,192 |6,185 |
| 3.重複claim但不同label | 595 | 62 |20 |
| 4.沒有原始document | 167 | 20 |23 |
|merge後筆數(1-2+3-4)| 135,973 | 9,749 |9,851 |
- Golden Evidence Data Set
- | | train | validation | test |
| --------------- | ------ | ---------- | ---- |
| Golden Evidence Data Set中,Evidence平均數量 |1.165 | 1.116 |1.107 |
| merge後資料集,Evidence平均數量(不含document切割的) | 1.698 | 1.674 |1.669 |
- P-Tuning v1 Verification :

## 2023/07/19
- Verification 的 P-Tuning v1效能比較,train、validation、test都是Retrieval結果

- Retrieval資料集
| | train | validation | test |
| --------------- | ------ | ---------- | ---- |
| SUPPORTS | 73,919 | 3,253 |3,296 |
| REFUTES | 28,310 | 3,254 |3,280 |
| NOT ENOUGH INFO | 33,744 | 3,242 |3,275 |
| total | 135,973 | 9,749 |9,851 |
- Golden Evidence資料集
| | train | validation | test |
| --------------- | ------ | ---------- | ---- |
| SUPPORTS | 114,801 | 4638 | 4694 |
| REFUTES | 47,096 | 4887 | 4889 |
| NOT ENOUGH INFO | 66,380 | 6410 | 6456 |
| total | 228,277 | 15935 | 16039|
## 2023/07/12
- 之前的test data和評估的golden data在做前處理的時候順序亂掉了,所以評估值都異常的低。更正後如下:

- 如果remove gold=0 or pred=0則結果如下:

- Roberta部分,因為中文數據集的基底模型使用Roberta,但實際測試英文dataset的SentenceBert後,發現效果並沒有比原本Bert-base還要好。
- Verification部分,目前 train data的部分還沒生成完畢
- | | train | validation | test |
| --------------- | ------ | ---------- | ---- |
| SUPPORTS | 73,919 | 3,253 |3,296 |
| REFUTES | 28,310 | 3,254 |3,280 |
| NOT ENOUGH INFO | 33,744 | 3,242 |3,275 |
| total | 135,973 | 9,749 |9,851 |
## 2023/07/05
- PromptBert:
- train Data來自於:claim、golden evidence、切割原文後的evidence
- train Data總句數: 1,226,536 筆
- SentenceBert
- train Data來自於:golden evidence、切割原文後的evidence
- train Data總筆數: 817,587 筆
- 兩者標點符號切割皆相同(?<=[.?!;]),只有當標點符號後面有一個空格或多個空格時,才進行分割。
- 並設定大於50字才有效(中文數據集設定為5字),因為英文5字,並無法表達有效內容,且英文數據及夠多,可以做篩選。
- 如果golden evidence大於5句,只取前5句。
- 若原始文章大於5篇,只取前5篇(**這裡不確定這麼做用意**)
- 因此,評估指標的golden evidence,最多只取前五句。
- test data: 9,851筆
- 
## 2023/06/28
- 重新切割英文斷句
``text = re.split(r'(?<=[.?!;])\s+', text)``
這個正則表達式使用了正向後行斷言 (?<=[.?!:]),表示只有當標點符號後面有一個空格或多個空格時,才進行分割。
- Semantic Ranker效果出奇的低原因:
- 因為英文數據集的gold_evidences平均只有1.8個,因此取Top5沒有意義,
- 當初模型的max_length設定512,但英文的字句很短,所以模型覺得相似的都是padding
- | max_length | 平均取得的evidences數量 |Precision|Recall| F1 |
| -------- | ------ |------ |------ |------ |
|512 | 4.95 | 0.59%|1.47%|0.79%|
| 64 | 4.86 |0.61% |1.43%| 0.78%|
| 16 | 2.02 |0.83%| 1.01%|0.78%|
## 2023/06/21
- 用來做Claim Veridation 的 Original gold-evidence Datasets
| | train | validation | test |
| --------------- | ------ | ---------- | ---- |
| SUPPORTS | 114,801 | 4,638 | 4,694|
| REFUTES | 47,096 | 4,887 | 4,889|
| NOT ENOUGH INFO | 66,380 | 6,410 | 6,456|
| total | 228,277 | 15,935 |16,039|
- 
- 將所有gold evidence和原始維基百科文章mapping,並將claim和label相同的資料做合併,並刪除沒有原始文章的資料。
- 維基百科mapping資料來源(https://huggingface.co/datasets/mwong/fever-evidence-related)
- mapping 後的新數據集
| | train | validation | test |
| --------------- | ------ | ---------- | ---- |
| SUPPORTS | 73,919 | 3,253 |3,296 |
| REFUTES | 28,310 | 3,254 |3,280 |
| NOT ENOUGH INFO | 33,744 | 3,242 |3,275 |
| total | 135,973 | 9,749 |9,851 |
- 
- PromptBert的訓練數據,一樣由claim、gold evidence、evidence text切割而來,切割後發現重複資料很多,未去除前有13M多筆,去除重複資料後2,247,303
- 目前訓練資料正規化部分有點問題,訓練loss會出現nan
## 2023/06/06
- 目前進度:
## 2023/05/31
- 目前進度:
- 有調整字串長度從512降至128
## 2023/05/24
- 目前進度:

- 中文部分Prompt有用APE生成,英文資料集目前是用人工生成,是否需要做APE生成

- 暫時不用APE
## 2023/05/16
- [目前論文指標](https://paperswithcode.com/sota/fact-
ification-on-fever)


- 實驗Data Augmentation
- model:BERT-base
- Original Datasets
| | train | validation | test |
| --------------- | ------ | ---------- | ---- |
| SUPPORTS | 114,801 | 4638 | 4694 |
| REFUTES | 47,096 | 4887 | 4889 |
| NOT ENOUGH INFO | 66,380 | 6410 | 6456 |
| total | 228,277 | 15935 | 16039|
### 做錯了~~

- Different Methods
- ERNIE和PromptCLUE主要為中文資源,英文的資料及需要用甚麼取代比較好做為比較

### To-Do
- 這個資料集資料量夠多,不必做Data Augmentation
- 英文資料集用Bart或T5取代Erine
- InstructDial取代promptCLUB
## 2023/05/10

- ***Hard Prompt***
- ***Template:***```{"evidences"} Question: {"claims"}? Is it correct? {"mask"}.```
- ***ManualVerbalizer:***```["yes", "right"], ["no"], ["maybe"]```
- ***P-tuning v1***
- ***Template:***```{"evidences"} {"soft":"according to the above content, the following description:"} {"claims"} {"soft":"whether it is correct? please answer, yes or no or maybe."} {"mask"}```
- ***ManualVerbalizer:***```["yes", "right"], ["no"], ["maybe"]```
## 2023/05/03
### Fact Checking – English Data
- 資料來源:https://huggingface.co/datasets/copenlu/fever_gold_evidence
- training data:228277筆
- validation data: 15935
- test data: 1000

- 調整learning Rate再執行