TREC CAsT 2021

# TREC CAsT 2021 ## Checklist > [PowerPoint Slides](https://nccuedutw-my.sharepoint.com/:p:/g/personal/108356024_nccu_edu_tw/EWBZeKbJKkNJpnIRkkfcMmkBdylc2Oo5RZf-4WatRkMwQA?rtime=EwMRZLmT2Ug) ## Notebook Paper > [Overleaf](https://www.overleaf.com/project/617214a22481d4dd95d1fc01) --- ## Experiments :ledger: > [Google Sheets](https://docs.google.com/spreadsheets/d/1nruRW-G_wJGVYum-_wB7CpPwk7V5JOsQOt2si0UyU8w/edit?fbclid=IwAR0o1DOanIf88vv_owi3PKElvqpoW5p1_BuUWe-S19LMP436kYIUAbBAC94#gid=716024942) > [New Link](https://docs.google.com/spreadsheets/d/1nruRW-G_wJGVYum-_wB7CpPwk7V5JOsQOt2si0UyU8w/edit#gid=1673221561) --- ## Repository :house: > [Our private repo](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA) --- ## ✨ Slides > [Meeting with CJ (2021/07/05)](https://docs.google.com/presentation/d/1GCjpwwHW4eN27lbtFd3WJneCUaj76DY5QoJnnQAKrIk/edit?usp=sharing) > [Meeting with CJ (2021/07/19)](https://docs.google.com/presentation/d/1rNCDIh7LMnpUzJxUmbCkQhJxq4Np9FUBONlDv4eOvwU/edit?usp=sharing) > [Meeting with CJ (2021/07/26)](https://docs.google.com/presentation/d/12t6ZVHFbynUHPZ-6pdQxUCfyLnE2vPUsceBGVJUAmKY/edit?usp=sharing) > [Group Meeting (2021/08/01)](https://nccuedutw-my.sharepoint.com/:p:/g/personal/108356024_nccu_edu_tw/EepbP4qb1IxGt9qMwYt-GXMB5l_8BisGd_c5zZ77gHxJzw?rtime=sWrQyLpU2Ug) > [Meeting with CJ (2021/08/10)](https://nccuedutw-my.sharepoint.com/:p:/g/personal/108356024_nccu_edu_tw/EepbP4qb1IxGt9qMwYt-GXMB5l_8BisGd_c5zZ77gHxJzw?rtime=NWyhDwNf2Ug) > [Meeting with CJ (2021/08/17)](https://nccuedutw-my.sharepoint.com/:p:/g/personal/108356024_nccu_edu_tw/ETZsQh2-nw1FjLkl1n6Ys68B3uiH34O-ZwaSlz7aOQZ9oA?rtime=b770JuBg2Ug) --- ## :star2: Listed Ideas > :new::fire: [**TREC ideas**](https://hackmd.io/@jacky18008/TREC_Idea) --- ## Timeline :calendar: | [Test topics](https://github.com/daltonj/treccastweb/tree/master/2021) released | Runs due | Results released to participants | | :------------------: | :---------------------: | :------------------------------: | | 3th July, 2021 | August 18th, 2021 (AOE) | November, 2021 | --- ## Information for Participants :information_source: > **[TREC 2021 Active Participants](https://trec.nist.gov/act_part/act_part.html)** Login: trec2021 Password: 30Something > > **`Team Name`**: **CFDA_CLIP** > - [`Official guidelines`](https://docs.google.com/document/d/1Eo0IqQedYc_rfTw-YxbvUGTpoYSmejU0iDlUzQWj3_w/edit) > > - [`Trecweb GitHub`](https://github.com/daltonj/treccastweb) > > > ![](https://i.imgur.com/csagI7l.jpg) --- ## Public Datasets for Year3 :package: >- [TREC Washington Post Corpus](https://ir.nist.gov/wapo/) > >>- Username: TRECWaPoSt >>- Password: PdrG9U}>4W >>- Download it on servers: >>`wget --user TRECWaPoSt --ask-password https://ir.nist.gov/wapo/WashingtonPost.v4.tar.gz` >>- WashingtonPost.v4.tar.gz >>`cfda3:/tmp2/ctyeh/anserini/collections/washingtonpost-doc/WashingtonPost.v4.tar.gz` >>- **Dir@Server:** >>`/home/ctyeh/trec_2021/dataset/wapo_v4/WashingtonPost.v4/data/TREC_Washington_Post_collection.v4.jl` >>- **Duplicate file** >>`/home/ctyeh/trec_2021/dataset/wapo_v4/WashingtonPost.v4/data/wapo_duplicate_list_v1.0.txt` >> >- [KILT Wikipedia](https://github.com/facebookresearch/KILT/) >> - **Dir@Server:** `/home/ctyeh/trec_2021/dataset/kilt/kilt_knowledgesource.json` > - [MS MARCO (Documents)](https://github.com/microsoft/MSMARCO-Document-Ranking) >> - `wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz` >> - **Dir@Server:** `/home/ctyeh/trec_2021/dataset/msmarco-doc/msmarco-docs.tsv` --- ## Data Preprocessing (CT & CW) :construction_worker: :computer: ### :dark_sunglasses: [`TREC-tools`](https://github.com/grill-lab/trec-cast-tools/tree/master/src/main/python) :eyes: ::: info ### 1. Directly chunk datasets into passages ::: ### :pick: Preprocess .py files `cfda3:/home/cwlin/TREC2021/anserini/trec-cast-tools/src/main/python/` :eyes::eyes: - `{marco,kilt,wapo}_trecweb.py` converts data to .xml - adding functions into `python/src/helpers.py` - You may change data format here, currently added 3 - `converts_to_json` - `{"id": " ", "contents": " "}` - `converts_to_json_title` - `{"id": " ", "title": " ", "contents": " "}` - `converts_to_tsv` - `id \t title \t contents` - :fire: `{marco,kilt,wapo}_json.py` converts data to json format - takes 2 more sys.argv `batch_size` & `batch_num` to parallelize data processing - by using `{marco,kilt,wapo}-batch-preprocess.sh` to run 10 batches at the same time (`batch_size=100, batch_num=0~9` for small dataset) ### :+1: Passages location ~~- `.tsv` - `/home/cwlin/TREC2021/datasets/all_collections/`~~ - `.json` in cfda3 - `/tmp2/cwlin/trec-2021/json/collections/collection_jsonl` ::: info ### 2. Chunk Retrieved Documents into passages ::: ### :pick: Preprocess .py files :open_file_folder:`cfda3: /tmp2/ctyeh/trec-cast-tools/src/main/python/doc2psg.py` >This script creates monoT5 input files by taking corpus, queries and the retrieval run file for the queries and then create files for monoT5 input. Each line in the monoT5 input file follows the format: f'**Query: {query} Passage: {passage} Relevant:\n**' ### :+1: Document location - :open_file_folder:`/tmp2/cwlin/trec-2021/doc/collections/collection_jsonl` ### :+1: Passages location - :open_file_folder:` /tmp2/ctyeh/trec_dev/t5-input/` --- ## Document Expansion :dango: Location of expanded documents - :open_file_folder: ` /tmp2/cytsao/trec_cast/docTTTTTquery/collections/expanded_doc/` experiment result * :shell: `/tmp2/ctyeh/anserini/doct5query_BM25_experiment.sh` final indexes * :open_file_folder: `cfda3: /tmp2/ctyeh/anserini/indexes/spacy-process-doc/test/doc-expansion-MARCO+KILT+WAPO` --- ## 1-st Satge Retrieval :dart: ### :one: Passage Retrieval **BM25**(Anserini) * see `/home/cwlin/TREC2021/anserini/experiments_{1,2}.sh` * query files under: `/tmp2/cwlin/trec-2021/json/collections` * **k1=0.82, b=0.68** Experiments | File Name | MAP | Recall@1000 --- | --- | --- | --- Raw Queries (utterance) | `run.test-all-collections_utterances.dev.trec` | 0.0250 | 0.2092 Rewritten Queries (query) | `run.test-all-collections.dev.trec` | **0.0730** | **0.5439** queries_pred_canard.dev.tsv | `run.test-all-collections_canard.dev.trec` | 0.0366 | 0.3849 History utterance pair | `run.test-all-collections_history.dev.trec` | 0.0180 | 0.2803 ### 0. Collections (Combine 3 datasets) - `.tsv` - `/home/cwlin/TREC2021/datasets/all_collections/` - `.json` in cfda3 - `/tmp2/cwlin/trec-2021/json/collections/collection_jsonl` ### 1. indexes - `/tmp2/cwlin/trec-2021/json/indexes` ### 2. retrieval - `/tmp2/cwlin/trec-2021/anserini/runs` ### 3. evaluation ### Small Dataset Small datasets are available on - `cfda3:/tmp2/cwlin/trec-2021` - `/small` - `/small_tsv` - `/small_anserini` - There are 3 folders {`kilt`, `wapo`, `marco`} which containing batches of json files. - You may check elapsed time for each batch in `logs`. - Concat 3 folders json files to `docs.json` into `collection_jsonl` folder with `json_batch_to_collection.py` ### :two: Document Retrieval ### 1. indexes - `/tmp2/cwlin/trec-2021/doc/indexes` **BM25**(Anserini) **Script for the experiment**: `/tmp2/ctyeh/anserini/bm25_doc_retrieve.sh` :telescope: **k1=4.46, b=0.82** Experiments | File Name(under /tmp2/ctyeh/anserini/runs/) | MAP | Recall@1000 --- | --- | --- | --- Raw Queries (utterance) | `run.test-all-collections_utterances.dev.trec` | 0.0136 | 0.2510 Rewritten Queries (query) | `run.test-all-collections.dev.trec` | **0.0568** | **0.5230** queries_pred_canard.dev.tsv | `run.test-all-collections_canard.dev.trec` | 0.0237 | 0.3766 **BM25+RM3**(Pyserini) :fire: **Script for the experiment**: `/tmp2/ctyeh/anserini/bm25+rm3_doc_retrieve.sh` :telescope: **k1=4.46, b=0.82, rm3** Experiments | File Name(under /tmp2/ctyeh/anserini/runs/) | MAP | Recall@1000 --- | --- | --- | --- Raw Queries (utterance) | `run.test-all-collections_utterances.dev.trec` | 0.0136 | 0.2510 Rewritten Queries (query) | `run.test-all-collections.dev.trec` | **0.0568** | **0.5230** queries_pred_canard.dev.tsv | `run.test-all-collections_canard.dev.trec` | 0.0237 | 0.3766 --- ## Data and formating :file_cabinet: Description | Available | File Name | Format --- | --- | --- | --- Full collections | :white_check_mark: | [@server](#) | docid-pid, (title, body) Raw Queries Dev| :heavy_check_mark: | [utterances.dev.tsv](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/blob/main/dev/utterances.dev.tsv) | topic_turn id, utterance Predicted Queries Dev| :heavy_check_mark: | [queries_pred_canard.dev.tsv](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/blob/main/dev/utterances.dev.tsv) | topic_turn id, rewritten query Truth Queries Dev| :heavy_check_mark: | [queries.dev.tsv](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/blob/main/dev/queries.dev.tsv) | topic_turn id, manual query Urels Dev | :heavy_check_mark: | [urels.dev.tsv](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/blob/main/dev/urels.dev.tsv) | topic_turn id, docid-pid Runs Dev | :white_check_mark: | [run_bm25.dev.tsv](#) | topic_turn id, docid-pid, rank Runs Dev (w/ NTR) | :white_check_mark: | [run_ntr+bm25.dev.tsv](#) | topic_turn id, docid-pid, rank TREC Submission | :white_check_mark: | [xxx.tsv](#) | **topic_turn id**, "Q0", **docid-pid**, **rank**, score, xxx - Ref: [Collection Index format](https://www.treccast.ai/) | [Query Index format](https://docs.google.com/document/d/1Eo0IqQedYc_rfTw-YxbvUGTpoYSmejU0iDlUzQWj3_w/edit#) ## Y3 Test Topics :microscope: > [**`Dev topics`**](https://github.com/daltonj/treccastweb/tree/master/2021) > [**`Dev data`**](https://github.com/jacky18008/Trec_CAsT_2021_ASCFDA/tree/main/dev) ## Y3 Baseline > [Automatic Runs](https://github.com/daltonj/treccastweb/tree/master/2021/baselines) > ![](https://i.imgur.com/Ht0EBSp.jpg) > Performance > >Experiments | MAP | Recall@1000 >--- | --- | --- >Baseline | 0.1367 | 0.7992 > > [Web UI](http://3.142.108.201:3500/) ## TODO: - [x] Data Preprocess (`c & c`) - [x] BM25 Baseline (`ctyeh` & `cwlin`): - [x] JM's Keyword Extraction & QQP (hhchen): - [X] T5 Re-ranker(MSMARCO) & T5-NTR(CANARD) (JHJu) - [x] doc2query-T5 - [x] RM3 ## Last Year Legacy Code > - [Handover **(Last Section)**](https://hackmd.io/6wc7zBdRTCKAhdXfgd_6Aw?fbclid=IwAR2CKtfe3brTQmm1GKefQccujHxPfs9-mqH8SeZrP1oZNn-gHuKJjjkAYlM#TREC-CAsT-2020) > - [TREC 2020 **(Last year's HackMD)**](https://hackmd.io/A6IPFBc3QP61CoXJedRI-w?both) > - [06/03 **(Anserini BM25)**](https://hackmd.io/m8EQqy0VScWaEx2Aiv-uCw) > - [06/19 **(What is indexing?)**](https://hackmd.io/2UKrJa06TKWNAMsHvo55XA?view) > - [06/24 **(Anserini BM25)**](https://hackmd.io/@WBBxL3DCQo6KfLdG__cflw/H1qxdXkCU) > - [07/17 **(BERT rerank)**](https://hackmd.io/xUErkqYMQ96lpoigYUfOQA) ## Important Topics ... 1st-Stage Retrieval - [Anserini](https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage.md) - [Pyserini](https://github.com/castorini/pyserini) - [Pyserini DPR](https://github.com/castorini/pyserini/blob/master/docs/experiments-dpr.md) - [TPU Quick Start](https://cloud.google.com/tpu/docs/quickstart) Document Expansion * [docT5query(improved version)](https://github.com/castorini/docTTTTTquery) * [doc2query](https://github.com/nyu-dl/dl4ir-doc2query) Query Rewritten * [TF CANARD](https://huggingface.co/castorini/t5-base-canard) (used to address with coreference) `References` * [Harnessing Evolution of Multi-Turn Conversations for Effective Answer Retrieva](https://arxiv.org/pdf/1912.10554.pdf?fbclid=IwAR1tTyFxMMokpyCrsD4pN7l_rYMPI6AqDmooGUpyi2qRBuGEmzYseHVcN8g) * [Query Resolution for Conversational Search with Limited Supervision](https://arxiv.org/pdf/2005.11723.pdf?fbclid=IwAR1P9yJXbfxfz-koMQRs_zX_D0znV45GArirv08x-t3CpZAkAxODbQzqVDE) * [Question Rewriting for Conversational Question Answering](https://arxiv.org/pdf/2004.14652.pdf?fbclid=IwAR1UiyP-CBlFqZZD07Yy7kuCYlsiSsZ23-S6uU_ZOrIbUmahOrf34nNs_rU) * [TREC CAsT 2020 Overview](https://trec.nist.gov/pubs/trec29/papers/OVERVIEW.C.pdf) * [TREC CAsT 2019 Overview](https://arxiv.org/pdf/2003.13624.pdf) * [AS_CFDA 2020](https://trec.nist.gov/pubs/trec29/papers/ASCFDA.C.pdf?fbclid=IwAR0SMVIBd9P0D0ljh375za1LY36KcMEIhvAPMZLV-LZ3GIN2u4XP9sCZ0OE) * [Multi-Stage Retrieval with BERT(2019)](https://arxiv.org/pdf/1910.14424.pdf?fbclid=IwAR2R3tIVsybZ1bVgkL0mC8N5sGcLYpEQtzmI5L0qHA5Pt8-WflhPv7DZU_w) ## Anserini: [BM25 Passage Ranking Example](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md) ## Jack & Matt 's paper: - Query Expansion: - Extract **topic** and **subtopic** words with BM25 - Measure the ambiguity of each utterance with BM25. - concat the topic & subtopic words to ambiguous utterances. - Neural Transfer Reformulation: - Rewrite the current utterance into the standalone query. - Ranking Fusion: - Assembles the ranking of BM25 & rerank model. [Conversational IR pipeline (Jack & Matt)](https://arxiv.org/pdf/2005.02230.pdf) ## References: - Trec 2020 summary - https://trec.nist.gov/pubs/trec29/papers/OVERVIEW.C.pdf - Matt & Jack - https://trec.nist.gov/pubs/trec29/papers/h2oloo.C.pdf - Possible approaches for 1st stage retrieval - [Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval (DeepCT)](https://arxiv.org/abs/1910.10687) - [Dense passage retrieval for open-domain question answering (DPR)](https://arxiv.org/abs/2004.04906) - [A Replication Study of Dense Passage Retriever](https://arxiv.org/abs/2104.05740) - [KILT: a Benchmark for Knowledge Intensive Language Tasks](https://aclanthology.org/2021.naacl-main.200.pdf) - [Efficient Passage Retrieval with Hashing for Open-domain Question Answering](https://arxiv.org/pdf/2106.00882.pdf) - [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/pdf/2104.08663.pdf)(https://github.com/UKPLab/beir) - [Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval](https://arxiv.org/pdf/2009.12756.pdf)(https://github.com/facebookresearch/multihop_dense_retrieval) - [Expando-Mono-Duo](https://arxiv.org/pdf/2101.05667.pdf) 2019 ans & collections: /tmp2/yhtseng/trec/treccastweb/2019 ## :email: Email List: 致廷： justinyeh1995@gmail.com 程緯： chengweilin.1997@gmail.com 家輝： dylanjootw@gmail.com 俊恩： m072040013@g-mail.nsysu.edu.tw 韋辰： pigoodDD@gmail.com 佳穎： akuelvish7@gmail.com 昱君： nuko7055@g.ncu.edu.tw 武峰： tcfshcos8@gmail.com 先灝： jacky18008@gmail.com