Open-Domain Question-Answering System regarding COVID-19 FAQ

# Open-Domain Question-Answering System regarding COVID-19 FAQ ## Introduction During the COVID-19 epidemic, many people are unsure about what to do with regards to questions of severe respiratory-related health problems. Therefore, it becomes critical to retrieve trustworthy information to conclude a proper response. In this project, I will try to generate a chatbot using Question-Answering System in NLP to anwser FAQ about COVID-19. ## Question-Anwering System in NLP - QA system is a system that gives appropriate answers to questions which are expressed in natural human languages such as English - There are 2 major paradigms: knowledge-based and information retrieval-based (IR-based). In this particular project, I will use **IR-based QA system to build a model that will be able to answer novel questions at test time and choose an answer from the set of answers it has seen during training**. ## Proposal In this project, I will try to develop an open-domain QA system where end-user can ask a question related to COVID-19 healthcare inquiries and get a proper answer. The IR-based QA system has 3 stages: 1. Step 1: Question Processing 2. Step 2: Document and Passage 3. Step 3: Answer Extracting ![](https://i.imgur.com/kIVlDBv.png) ### 1. Retriever Model I will use the classic IR: the non-learning-based TF-IDF, where every query and document is modelled as bag-of-word vector, then each term is weighted by Term Frequency x Inverse Document Frequency (TF-IDF). Particularly, the retrieved text segments will be ranked by BM25, a classic TF-IDF-based retrieval scoring function. ### 2. Reader Model The objective of the reader model is to extract an answer for a given question from a given context document. In this sense, I will use BERT solve this task. Basically, the question and retrieved passage become input to BERT and BERT will output the asnwer offset with score. ![](https://i.imgur.com/2wssCzb.png) ## Pipeline 1. Step 1: Question Processing Dataset includes 1690 COVID-related questions which have been annotated into 15 broad category (e.g. Transmission, Prevention) and a more specific class such that questions in the same class are all asking the same thing. This dataset will be classified by BERT to decide which response category to use. 2. Step 2: Document Reetrievel Using Google API to retrieve documents related to input query 3. Step 3: Passage Retrievel Using BM25 to select passages that are similar to the question 4. Step 4: Answer Extraction using BERT Expected output is top 5 related anwsers which is retrieved from Google Search. ## References [How to Build an Open-Domain Question Answering System?](https://lilianweng.github.io/lil-log/2020/10/29/open-domain-question-answering.html) [Building an application of question answering system from scratch](https://towardsdatascience.com/building-an-application-of-question-answering-system-from-scratch-2dfc53f760aa) [Vietnamese Question-Answering using BERT](https://github.com/mailong25/bert-vietnamese-question-answering) [COVID Questions Dataset](h[ttps://](https://github.com/JerryWei03/COVID-Q))