# Team 6: What do we know about diagnostic and surveillance ?
## Team members:
Abhinav Kumar Ray - 170010022
Priyanshu Singh - 170010030
Chetan Rajput - 170010021
Mehul Saxena - 170010017
Rahul Salunke - 170010002
Anusha Devulapally - 1913101
Varun Parekh - 1913108
## Abstract:
We aim to look at previous article on diagnostic and surveillance. Basic models such as bag of words and tf-idf have been implemented. Google's universal sentence encoder (USE) has been implemented. The ouputs of the models are compared.
## Introduction:
We start with analysing the problem statement by finding out keywords from the given subtasks. Then we apply preprocessing on given data. Next we filter the aricles based on various paramters and finally apply above mentioned 3 models. The results i.e the relevant article related to a query are then compared.
## Data Set:
Data set is taken from COVID-19 Open Research Dataset Challenge (CORD-19) by Allen Institute For AI. The dataset consists of around 60k articles.The Metadata of the papers are taken from bioRxiv, medRXiv, PMC, CZI and json files are present for each.The dataset consists of around 60k articles. Metadata.csv file consists of 60,000 rows x 19 columns i.e., cord_uid, sha, source_x,pubmed_uid, pmcid, license, url, doi, title, abstract, publish_time, authors, journal, Microsoft Academic Paper ID, WHO, has_pdf_parse, has_pmc_xml_parse, full_text_file.
## Analysing the problem statement:
### Keyword extraction:
Fundamental task in analysing the problem statements are finding out the relevant keywords related to the given tasks. Keyword extraction (also known as keyword detection or keyword analysis) is a text analysis technique that consists of automatically extracting the most important words and expressions in a text. It helps summarize the content of a text and recognize the main topics which are being discussed. We have used the “keywords extraction” process to find relevant words from queries. And with the help of these keywords we can find the relevant article based on given queries.
Processes of Keyword extraction:
1. First we have stored all the queries in a single string - “text”.
2. Tokenization is a very common task in NLP, it is basically a task of chopping a character into pieces, called a token, and throwing away the certain characters at the same time, like punctuation. We have applied this tokenization technique using spacy_tokenizer to break down string into words.
3. After removal of punctuations, stopwords and custom stop words, we got the keywords.Then we have reduced thes number of keywords by taking only those which have count less than 100. This we have done to filter more precise and relevant articles.
> 
fig 1. Keyword extraction process flow
> 
fig 2. Extracted Keywords
## Pre-Processing
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. The clean data should have features in such a way that it can be easily interpreted by the subsequent algorithm. We need data in a more systematic way of a dataframe to apply algorithms on it. There we total 59311 articles related to covid-19 which were stored in json files. We used glob module to find files with .json extension recursively. Then we created a dictionary having features like paper id, abstract, body text, authors, title, journal, and abstract summary using the json files. At the end we converted our dictionary to the dataframe format.
So, after creating the dataframe, we filtered the dataframe. We deleted rows having columns with null values and also deleted rows with duplicate body text or abstract summary.The dataframe information after preprocessing the data is shown below:
> 
Fig 3. Dataframe information post preprocessing
## Filtering
### 1. English language
We need articles which are written in only english language. For this, we used langdetect library for the detection of languages in which the articles are written. From every article, we took the first 50 words from the “Body Text” or “Abstract Summary” and applied a detect function on the string created by those 50 words. The language detection algorithm used by this detect function is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results every time you run it. Finally we decided to take 50 words as the word limit. In this library, each language is defined using two lowercase letters. There are 12 different languages in which the articles are written.
>Total number of articles with their language

Fig 4. 32653 articles were found to be written in english language
### 2. Covid related articles
After filtering only english articles, we were left with about 32653 articles.Now with these articles we filtered those articles that contained the word “covid-19” and its synonyms. So, we searched and read some articles on the internet to find the synonyms of the word “covid-19”. We get about 22 synonyms.
>
Fig 5. Synonyms of "Covid-19"
For filtering of articles that contain the synonyms we used the boolean indexing method and used the .contains() function. The .contains() returns a boolean value based on whether a given word is contained within a string. So, first, we created a list of indexes for all the 32653 articles with all values false in it. Then with the help of .contains(), we get the articles that contain the synonyms by getting true value. And updated the list of indexes and replaced false with true of those indexes. Then using boolean indexing we selected only those articles that have true value in the list. After filtering, we were left with about 1820 articles that contained these synonyms.
## Basic text mining methods
We have used basic text mining techniques, namely bag of words and TF-IDF, and implemented them on our dataset.
### 1. Bag of Words
In this technique, we created a bag of words from all data we have in the abstract column(processed with ntlk) and vectorized the column to produce feature vectors. The query vector is built from the same bag of words, and we computed it’s distance with each feature vector and returned the closest one as output to our query string.
>
Fig 6. Bag of Words block diagram
> 
Fig 7. output: bag of words
### 2. TF-IDF
In this technique,we don’t only focus on term frequency in documents as done in a bag of words but also its importance in all the documents indicated by tf-idf values. We used the TfidVectorizer() from sklearn to implement this technique and the columns used are the title and abstract columns after processing them with ntlk.
> 
Fig 8. tf-idf block diagram
> 
Fig 9. output: tf-idf
## Universal Sentence Encoder (USE)
The Universal Sentence Encoder is a text embedding model. It encodes text into high-dimensional vectors used in text classification, semantic similarity. The input is a variable-length (i.e., words, sentence, small paragraph), and the output is a fixed 512- dimensional vector.
We have seen so far, a bag of words and TF-IDF model on the articles provided for the task. The disadvantage of the above models is that they don’t maintain the semantic information of the text.
Here, the USE model maintains semantic information while training. We used two approaches to check the output of the USE model on the same articles.
1. With lemmatization.
2. Without lemmatization.
> 
Fig 10. input: filtered articles
Steps:
1. Import the necessary packages.
2. Load the USE model using the TensorFlow hub, i.e., "https://tfhub.dev/google/universal-sentence-encoder/4"
3. Keep a copy of the useful_info as train_data.
4. Keep all the sub-tasks or queries mentioned under the main task in a list.
query=["What do we know about diagnostics and surveillance?",
"How widespread current exposure is to be able to make immediate policy recommendations on mitigation measures. Denominators for testing and a mechanism for rapidly sharing that information, including demographics, to the extent possible. Sampling methods to determine asymptomatic disease (e.g., use of serosurveys (such as convalescent samples) and early detection of disease (e.g., use of screening of neutralizing antibodies such as ELISAs)"]
:::info
Note: considered only two sub-tasks in the query list
:::
### With lemmatization:
Here, we will apply the lemmatization on the useful_info data column.
1. Remove the punctuations, new line, double white space, and also apply the left and right trim of white spaces on the useful_info column.
2. Convert the entire text data into lower case
3. Apply word tokenization on the above-processed data.
4. Store the tokenized words in info_tokenize
5. For lemmatization, we used a wordnet lemmatizer for lemmatizing the info_tokenize. (wordnet is a lexical database for the English language, and also it aims at establishing the structures semantic relationships between words.)
6. The wordnet lemmatizer takes parts-of-speech tag as the second argument. The default is a noun.
7. During the lemmatization, remove the stop words and store the remaining keywords.
8. Once the lemmatization is done, we train the model on the lemmatized words and store the trained model in a variable for further use.
> 
Fig 11. final_keywords obtained after lemmatization
10. We train the USE model on each query present on the list.
11. We use a linear kernel function to compute the similarity scores for each query and the trained model on the info_tokenize.
12. The similarity scores are sorted, and we consider the top k relevant articles for each query.
13. The above 8-10th steps are repeated for each query in the list.
>
Fig 12. Output: usemodel with lemmatization
#### Without Lemmatization
1. We train the USE model on the train_data column.
2. We calculate the similarity scores and retrieve top k relevant articles as we did in the above approach.
>
Fig 13. Output: usemodel without lemmatization
## Comparision
### 1. Comparing the USE model with lemmatization and without lemmatization
1. The output of both approaches are same.
2. The only difference between the two methods is time taking for training the model.
3. The model without lemmatization is taking more time to train.But the lemmatization itself will take many hours to compute. On the entire article, it will take around 24 hours. As we filtered data in the preprocessing, the number of articles is reduced. So it takes very less on the filtered data.
### 2. Comparing the Bag of words, TF-IDF and USE model outputs
Sample query for which comparisions have been done:
Query: "How widespread current exposure is to be able to make immediate policy recommendations on mitigation measures. Denominators for testing and a mechanism for rapidly sharing that information, including demographics, to the extent possible. Sampling methods to determine asymptomatic disease (e.g., use of serosurveys (such as convalescent samples) and early detection of disease (e.g., use of screening of neutralizing antibodies such as ELISAs)"
The outputs of the models are
>
Fig 14. Bag of words
>
Fig 15. TF-IDF
>
Fig 16. USE model
## Conclusion
We conclude that the Universal Sentence Encoder gave better results for our use case than Bag of Words and TF-IDF in terms of desired output.