# TREC CAsT 2021
## Checklist
> [PowerPoint Slides](https://nccuedutw-my.sharepoint.com/:p:/g/personal/108356024_nccu_edu_tw/EWBZeKbJKkNJpnIRkkfcMmkBdylc2Oo5RZf-4WatRkMwQA?rtime=EwMRZLmT2Ug)
## Notebook Paper
> [Overleaf](https://www.overleaf.com/project/617214a22481d4dd95d1fc01)
---
## Experiments :ledger:
> [Google Sheets](https://docs.google.com/spreadsheets/d/1nruRW-G_wJGVYum-_wB7CpPwk7V5JOsQOt2si0UyU8w/edit?fbclid=IwAR0o1DOanIf88vv_owi3PKElvqpoW5p1_BuUWe-S19LMP436kYIUAbBAC94#gid=716024942)
> [New Link](https://docs.google.com/spreadsheets/d/1nruRW-G_wJGVYum-_wB7CpPwk7V5JOsQOt2si0UyU8w/edit#gid=1673221561)
---
## Repository :house:
> [Our private repo](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA)
---
## ✨ Slides
> [Meeting with CJ (2021/07/05)](https://docs.google.com/presentation/d/1GCjpwwHW4eN27lbtFd3WJneCUaj76DY5QoJnnQAKrIk/edit?usp=sharing)
> [Meeting with CJ (2021/07/19)](https://docs.google.com/presentation/d/1rNCDIh7LMnpUzJxUmbCkQhJxq4Np9FUBONlDv4eOvwU/edit?usp=sharing)
> [Meeting with CJ (2021/07/26)](https://docs.google.com/presentation/d/12t6ZVHFbynUHPZ-6pdQxUCfyLnE2vPUsceBGVJUAmKY/edit?usp=sharing)
> [Group Meeting (2021/08/01)](https://nccuedutw-my.sharepoint.com/:p:/g/personal/108356024_nccu_edu_tw/EepbP4qb1IxGt9qMwYt-GXMB5l_8BisGd_c5zZ77gHxJzw?rtime=sWrQyLpU2Ug)
> [Meeting with CJ (2021/08/10)](https://nccuedutw-my.sharepoint.com/:p:/g/personal/108356024_nccu_edu_tw/EepbP4qb1IxGt9qMwYt-GXMB5l_8BisGd_c5zZ77gHxJzw?rtime=NWyhDwNf2Ug)
> [Meeting with CJ (2021/08/17)](https://nccuedutw-my.sharepoint.com/:p:/g/personal/108356024_nccu_edu_tw/ETZsQh2-nw1FjLkl1n6Ys68B3uiH34O-ZwaSlz7aOQZ9oA?rtime=b770JuBg2Ug)
---
## :star2: Listed Ideas
> :new::fire: [<span style="color:gold-blue; font-size:2rem;font: bold">**TREC ideas**</span>](https://hackmd.io/@jacky18008/TREC_Idea)
---
## Timeline :calendar:
| [Test topics](https://github.com/daltonj/treccastweb/tree/master/2021) released | Runs due | Results released to participants |
| :------------------: | :---------------------: | :------------------------------: |
| 3th July, 2021 | August 18th, 2021 (AOE) | November, 2021 |
---
## Information for Participants :information_source:
> **[TREC 2021 Active Participants](https://trec.nist.gov/act_part/act_part.html)**
Login: trec2021
Password: 30Something
>
> **`Team Name`**: **CFDA_CLIP**
> - [`Official guidelines`](https://docs.google.com/document/d/1Eo0IqQedYc_rfTw-YxbvUGTpoYSmejU0iDlUzQWj3_w/edit)
>
> - [`Trecweb GitHub`](https://github.com/daltonj/treccastweb)
>
>
> 
---
## Public Datasets for Year3 :package:
>- [TREC Washington Post Corpus](https://ir.nist.gov/wapo/)
>
>>- Username: TRECWaPoSt
>>- Password: PdrG9U}>4W
>>- Download it on servers:
>>`wget --user TRECWaPoSt --ask-password https://ir.nist.gov/wapo/WashingtonPost.v4.tar.gz`
>>- WashingtonPost.v4.tar.gz
>>`cfda3:/tmp2/ctyeh/anserini/collections/washingtonpost-doc/WashingtonPost.v4.tar.gz`
>>- **Dir@Server:**
>>`/home/ctyeh/trec_2021/dataset/wapo_v4/WashingtonPost.v4/data/TREC_Washington_Post_collection.v4.jl`
>>- **Duplicate file**
>>`/home/ctyeh/trec_2021/dataset/wapo_v4/WashingtonPost.v4/data/wapo_duplicate_list_v1.0.txt`
>>
>- [KILT Wikipedia](https://github.com/facebookresearch/KILT/)
>> - **Dir@Server:** `/home/ctyeh/trec_2021/dataset/kilt/kilt_knowledgesource.json`
> - [MS MARCO (Documents)](https://github.com/microsoft/MSMARCO-Document-Ranking)
>> - `wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz`
>> - **Dir@Server:**
`/home/ctyeh/trec_2021/dataset/msmarco-doc/msmarco-docs.tsv`
---
## Data Preprocessing (CT & CW) :construction_worker: :computer:
### :dark_sunglasses: [`TREC-tools`](https://github.com/grill-lab/trec-cast-tools/tree/master/src/main/python) :eyes:
::: info
### 1. Directly chunk datasets into passages
:::
### :pick: Preprocess .py files
`cfda3:/home/cwlin/TREC2021/anserini/trec-cast-tools/src/main/python/` :eyes::eyes:
- `{marco,kilt,wapo}_trecweb.py` converts data to .xml
- adding functions into `python/src/helpers.py`
- You may change data format here, currently added 3
- `converts_to_json`
- `{"id": " ", "contents": " "}`
- `converts_to_json_title`
- `{"id": " ", "title": " ", "contents": " "}`
- `converts_to_tsv`
- `id \t title \t contents`
- :fire: `{marco,kilt,wapo}_json.py` converts data to json format
- takes 2 more sys.argv `batch_size` & `batch_num` to parallelize data processing
- by using `{marco,kilt,wapo}-batch-preprocess.sh` to run 10 batches at the same time (`batch_size=100, batch_num=0~9` for small dataset)
### :+1: Passages location
~~- `.tsv`
- `/home/cwlin/TREC2021/datasets/all_collections/`~~
- `.json` in cfda3
- `/tmp2/cwlin/trec-2021/json/collections/collection_jsonl`
::: info
### 2. Chunk Retrieved Documents into passages
:::
### :pick: Preprocess .py files
:open_file_folder:`cfda3: /tmp2/ctyeh/trec-cast-tools/src/main/python/doc2psg.py`
>This script creates monoT5 input files by taking corpus,
queries and the retrieval run file for the queries and then
create files for monoT5 input. Each line in the monoT5 input
file follows the format:
f'**Query: {query} Passage: {passage} Relevant:\n**'
### :+1: Document location
- :open_file_folder:`/tmp2/cwlin/trec-2021/doc/collections/collection_jsonl`
### :+1: Passages location
- :open_file_folder:` /tmp2/ctyeh/trec_dev/t5-input/`
---
## Document Expansion :dango:
Location of expanded documents
- :open_file_folder:
`
/tmp2/cytsao/trec_cast/docTTTTTquery/collections/expanded_doc/`
experiment result
* :shell:
`/tmp2/ctyeh/anserini/doct5query_BM25_experiment.sh`
final indexes
* :open_file_folder:
`cfda3: /tmp2/ctyeh/anserini/indexes/spacy-process-doc/test/doc-expansion-MARCO+KILT+WAPO`
---
## 1-st Satge Retrieval :dart:
### :one: Passage Retrieval
<span style="color:Blue; font-size:rem">**BM25**(Anserini)</span>
* see `/home/cwlin/TREC2021/anserini/experiments_{1,2}.sh`
* query files under: `/tmp2/cwlin/trec-2021/json/collections`
* **k1=0.82, b=0.68**
Experiments | File Name | MAP | Recall@1000
--- | --- | --- | ---
Raw Queries (utterance) | `run.test-all-collections_utterances.dev.trec` | 0.0250 | 0.2092
Rewritten Queries (query) | `run.test-all-collections.dev.trec` | **0.0730** | **0.5439**
queries_pred_canard.dev.tsv | `run.test-all-collections_canard.dev.trec` | 0.0366 | 0.3849
History utterance pair | `run.test-all-collections_history.dev.trec` | 0.0180 | 0.2803
### 0. Collections (Combine 3 datasets)
- `.tsv`
- `/home/cwlin/TREC2021/datasets/all_collections/`
- `.json` in cfda3
- `/tmp2/cwlin/trec-2021/json/collections/collection_jsonl`
### 1. indexes
- `/tmp2/cwlin/trec-2021/json/indexes`
### 2. retrieval
- `/tmp2/cwlin/trec-2021/anserini/runs`
### 3. evaluation
### Small Dataset
Small datasets are available on
- `cfda3:/tmp2/cwlin/trec-2021`
- `/small`
- `/small_tsv`
- `/small_anserini`
- There are 3 folders {`kilt`, `wapo`, `marco`} which containing batches of json files.
- You may check elapsed time for each batch in `logs`.
- Concat 3 folders json files to `docs.json` into `collection_jsonl` folder with `json_batch_to_collection.py`
### :two: Document Retrieval
### 1. indexes
- `/tmp2/cwlin/trec-2021/doc/indexes`
<span style="color:Blue; font-size:rem">**BM25**(Anserini)</span>
**Script for the experiment**:
`/tmp2/ctyeh/anserini/bm25_doc_retrieve.sh`
:telescope: **k1=4.46, b=0.82**
Experiments | File Name(under /tmp2/ctyeh/anserini/runs/) | MAP | Recall@1000
--- | --- | --- | ---
Raw Queries (utterance) | `run.test-all-collections_utterances.dev.trec` | 0.0136 | 0.2510
Rewritten Queries (query) | `run.test-all-collections.dev.trec` | **0.0568** | **0.5230**
queries_pred_canard.dev.tsv | `run.test-all-collections_canard.dev.trec` | 0.0237 | 0.3766
<span style="color:Blue; font-size:rem">**BM25+RM3**(Pyserini)</span>
:fire: **Script for the experiment**: `/tmp2/ctyeh/anserini/bm25+rm3_doc_retrieve.sh`
:telescope: **k1=4.46, b=0.82, rm3**
Experiments | File Name(under /tmp2/ctyeh/anserini/runs/) | MAP | Recall@1000
--- | --- | --- | ---
Raw Queries (utterance) | `run.test-all-collections_utterances.dev.trec` | 0.0136 | 0.2510
Rewritten Queries (query) | `run.test-all-collections.dev.trec` | **0.0568** | **0.5230**
queries_pred_canard.dev.tsv | `run.test-all-collections_canard.dev.trec` | 0.0237 | 0.3766
---
## Data and formating :file_cabinet:
Description | Available | File Name | Format
--- | --- | --- | ---
Full collections | :white_check_mark: | [@server](#) | docid-pid, (title, body)
Raw Queries Dev| :heavy_check_mark: | [utterances.dev.tsv](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/blob/main/dev/utterances.dev.tsv) | topic_turn id, utterance
Predicted Queries Dev| :heavy_check_mark: | [queries_pred_canard.dev.tsv](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/blob/main/dev/utterances.dev.tsv) | topic_turn id, rewritten query
Truth Queries Dev| :heavy_check_mark: | [queries.dev.tsv](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/blob/main/dev/queries.dev.tsv) | topic_turn id, manual query
Urels Dev | :heavy_check_mark: | [urels.dev.tsv](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/blob/main/dev/urels.dev.tsv) | topic_turn id, docid-pid
Runs Dev | :white_check_mark: | [run_bm25.dev.tsv](#) | topic_turn id, docid-pid, rank
Runs Dev <br>(w/ NTR) | :white_check_mark: | [run_ntr+bm25.dev.tsv](#) | topic_turn id, docid-pid, rank
TREC Submission | :white_check_mark: | [xxx.tsv](#) | **topic_turn id**, "Q0", **docid-pid**, **rank**, score, xxx
- Ref: [Collection Index format](https://www.treccast.ai/) | [Query Index format](https://docs.google.com/document/d/1Eo0IqQedYc_rfTw-YxbvUGTpoYSmejU0iDlUzQWj3_w/edit#)
## Y3 Test Topics :microscope:
> [**`Dev topics`**](https://github.com/daltonj/treccastweb/tree/master/2021)
> [**`Dev data`**](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/tree/main/dev)
## Y3 Baseline
> [<span style="color:grey;">Automatic Runs</span>](https://github.com/daltonj/treccastweb/tree/master/2021/baselines)
> 
> Performance
>
>Experiments | MAP | Recall@1000
>--- | --- | ---
>Baseline | 0.1367 | 0.7992
>
> [Web UI](http://3.142.108.201:3500/)
## TODO:
- [x] Data Preprocess (`c & c`)
- [x] BM25 Baseline (`ctyeh` & `cwlin`):
- [x] JM's Keyword Extraction & QQP (hhchen):
- [X] T5 Re-ranker(MSMARCO) & T5-NTR(CANARD) (JHJu)
- [x] doc2query-T5
- [x] RM3
## Last Year Legacy Code
> - [Handover **(Last Section)**](https://hackmd.io/6wc7zBdRTCKAhdXfgd_6Aw?fbclid=IwAR2CKtfe3brTQmm1GKefQccujHxPfs9-mqH8SeZrP1oZNn-gHuKJjjkAYlM#TREC-CAsT-2020)
> - [TREC 2020 **(Last year's HackMD)**](https://hackmd.io/A6IPFBc3QP61CoXJedRI-w?both)
> - [06/03 **(Anserini BM25)**](https://hackmd.io/m8EQqy0VScWaEx2Aiv-uCw)
> - [06/19 **(What is indexing?)**](https://hackmd.io/2UKrJa06TKWNAMsHvo55XA?view)
> - [06/24 **(Anserini BM25)**](https://hackmd.io/@WBBxL3DCQo6KfLdG__cflw/H1qxdXkCU)
> - [07/17 **(BERT rerank)**](https://hackmd.io/xUErkqYMQ96lpoigYUfOQA)
## Important Topics ...
1st-Stage Retrieval
- [Anserini](https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage.md)
- [Pyserini](https://github.com/castorini/pyserini)
- [Pyserini DPR](https://github.com/castorini/pyserini/blob/master/docs/experiments-dpr.md)
- [TPU Quick Start](https://cloud.google.com/tpu/docs/quickstart)
Document Expansion
* [docT5query(improved version)](https://github.com/castorini/docTTTTTquery)
* [doc2query](https://github.com/nyu-dl/dl4ir-doc2query)
Query Rewritten
* [TF CANARD](https://huggingface.co/castorini/t5-base-canard) (used to address with coreference)
`References`
* [Harnessing Evolution of Multi-Turn Conversations
for Effective Answer Retrieva](https://arxiv.org/pdf/1912.10554.pdf?fbclid=IwAR1tTyFxMMokpyCrsD4pN7l_rYMPI6AqDmooGUpyi2qRBuGEmzYseHVcN8g)
* [Query Resolution for Conversational Search
with Limited Supervision](https://arxiv.org/pdf/2005.11723.pdf?fbclid=IwAR1P9yJXbfxfz-koMQRs_zX_D0znV45GArirv08x-t3CpZAkAxODbQzqVDE)
* [Question Rewriting for Conversational Question Answering](https://arxiv.org/pdf/2004.14652.pdf?fbclid=IwAR1UiyP-CBlFqZZD07Yy7kuCYlsiSsZ23-S6uU_ZOrIbUmahOrf34nNs_rU)
* [TREC CAsT 2020 Overview](https://trec.nist.gov/pubs/trec29/papers/OVERVIEW.C.pdf)
* [TREC CAsT 2019 Overview](https://arxiv.org/pdf/2003.13624.pdf)
* [AS_CFDA 2020](https://trec.nist.gov/pubs/trec29/papers/ASCFDA.C.pdf?fbclid=IwAR0SMVIBd9P0D0ljh375za1LY36KcMEIhvAPMZLV-LZ3GIN2u4XP9sCZ0OE)
* [Multi-Stage Retrieval with BERT(2019)](https://arxiv.org/pdf/1910.14424.pdf?fbclid=IwAR2R3tIVsybZ1bVgkL0mC8N5sGcLYpEQtzmI5L0qHA5Pt8-WflhPv7DZU_w)
## Anserini:
[BM25 Passage Ranking Example](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md)
## Jack & Matt 's paper:
- Query Expansion:
- Extract **topic** and **subtopic** words with BM25
- Measure the ambiguity of each utterance with BM25.
- concat the topic & subtopic words to ambiguous utterances.
- Neural Transfer Reformulation:
- Rewrite the current utterance into the standalone query.
- Ranking Fusion:
- Assembles the ranking of BM25 & rerank model.
[Conversational IR pipeline (Jack & Matt)](https://arxiv.org/pdf/2005.02230.pdf)
## References:
- Trec 2020 summary
- https://trec.nist.gov/pubs/trec29/papers/OVERVIEW.C.pdf
- Matt & Jack
- https://trec.nist.gov/pubs/trec29/papers/h2oloo.C.pdf
- Possible approaches for 1st stage retrieval
- [Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval (DeepCT)](https://arxiv.org/abs/1910.10687)
- [Dense passage retrieval for open-domain question answering (DPR)](https://arxiv.org/abs/2004.04906)
- [A Replication Study of Dense Passage Retriever](https://arxiv.org/abs/2104.05740)
- [KILT: a Benchmark for Knowledge Intensive Language Tasks](https://aclanthology.org/2021.naacl-main.200.pdf)
- [Efficient Passage Retrieval with Hashing for Open-domain Question Answering](https://arxiv.org/pdf/2106.00882.pdf)
- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/pdf/2104.08663.pdf)(https://github.com/UKPLab/beir)
- [Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval](https://arxiv.org/pdf/2009.12756.pdf)(https://github.com/facebookresearch/multihop_dense_retrieval)
- [Expando-Mono-Duo](https://arxiv.org/pdf/2101.05667.pdf)
2019 ans & collections:
/tmp2/yhtseng/trec/treccastweb/2019
## :email: Email List:
致廷: justinyeh1995@gmail.com
程緯: chengweilin.1997@gmail.com
家輝: dylanjootw@gmail.com
俊恩: m072040013@g-mail.nsysu.edu.tw
韋辰: pigoodDD@gmail.com
佳穎: akuelvish7@gmail.com
昱君: nuko7055@g.ncu.edu.tw
武峰: tcfshcos8@gmail.com
先灝: jacky18008@gmail.com