COVID-19 Project
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Help
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title : COVID_19 NLP Project Report author : Team 2 --- # COVID-19 Report ## Team Members Kalyani Goyal - 170010009 Pulaksh Garg - 170010015 Rahul Baviskar - 170010004 Raj Hansini Khoiwal - 170010014 Raj Jagtap - 170030004 Rajat Kumar - 1913105 ## Table Of Contents - [Introduction](#Introduction) - [Preprocessing](#Pre-processing) - [Reading the Dataset](#Reading-the-Dataset) - [Cleaning up Dataset](#Cleaning-up-Dataset) - [Filtering Text in Datset](#Filtering-Text-in-Datset) - [Filtering English Documents](#Filtering-English-Documents) - [Removing small papers from Dataset](#Removing-small-papers-from-Dataset) - [Generating Stopwords](#Generating-Stopwords) - [Sentence Transformers](#Sentence-Transformers) - [TF-IDF](#TF-IDF) - [Topic Modelling : LDA](#Latent-Dirichlet-Allocation) - [Word as Vector](#Word-as-Vector) - [GloVe Vectors](#GloVe-Vectors) - [Document to Vector](#Document-to-Vector) - [Question to Vector](#Question-to-Vector) - [Cosine Similarity](#Cosine-Similarity) - [KNN Approach](#KNN-Approach) - [1D CNN](#1D-CNN) - [Text Summarisation](#Text-Summarisation) - [Analysis](#Analysis) - [TF-IDF](#TF-IDF1) - [Word Vectors](#Word-to-Vectors-Cosine-Similarity-amp-KNN-Approach) - [1D CNN](#1D-CNN1) - [Sentence Transformers](#Sentence-Transformers-BERT) - [LDA](#LDA) - [Experimental Results Comparison](#Experimental-Results-Comparison) - [Related Work](#Related-Work) - [Conclusion](#Conclusion) - [References](#References) ## Introduction > The task chosen is "What do we know about virus genetics, origin, and evolution? What do we know about Real-time tracking of whole genomes?" The aim is to take the former mentioned query and apply a machine learning technique to retrieve the answer from the corpus (group of documents). We have applied five different methods for information retrieval. Each method has been explained and lastly analysis and comparison has been done for each method to give a qualitative approach in regards to the merits and demerits of each method. ## Pre-processing We used multiple functions to process the json files with raw data to convert it into standard usable format. ### Reading the Dataset > all json paths ---> read_dataset ---> dataframe > We use ```read_dataset``` to read all the json files sequentially and store thier *title*, *authors*, *abstract*, *body_text*, etc. in *pandas* dataframe. ### Cleaning up Dataset > dataframe ---> data_cleanup ---> dataframe We use ```data_cleanup``` function to remove all duplicates and empty entries from dataframe. Duplicates are found based on pairs (*abstract*, *body_text*) or (*abstract*, *filtered_body_text*). ### Filtering Text in Datset > dataframe ---> text_filtering ---> dataframe We use ```text_filtering``` function to 1. Remove text inside [] and () as it is most likely to be references and comments. 2. Filter only alphanumeric characters in text as we cannot use <img src="https://render.githubusercontent.com/render/math?math=$\alpha , \beta, \sum$" > etc. special characters while applying methods below. 3. Remove \n and \t. 4. Remove extra spaces. 5. Lowercase all the characters. ### Filtering English Documents > dataframe ---> filter_english ---> dataframe We use ```filter_english``` function to select only english documents from the dataset. We use a module *langdetect* to detect the language of the filtered body text. If the language is detectable and is English then only select the document. ### Removing small papers from Dataset > dataframe ---> remove_small_papers ---> dataset We use ```remove_small_papers``` function to remove small papers from the dataset. We check if the filtered body text is less than 500 chars, if yes then do not take select the document. ### Generating Stopwords > The stop words in the model comes from three sources namely: * Customized stop words like 'doi', 'preprint' etc * Stopwords imported from spacy.lang.en.stop_words library * Stopwords imported from library A final set with all the above was generated and then applied on the corpus. In the function words_freq_atleast_2 the stop words are removed from the text of the document passed in the arguemenr of the function. ## Sentence Transformers > **Idea** : This method explores an encoding method for the data known as *sentence transformer* library. * First and foremost data preprocessing included removing square bracket contents, content in parenthesis, punctuation but hyphen, extra spaces and making everthing lower case. Also filtered out non-english papers and papers with less than 500 words. * Using *nltk tokenize* we split the data part of the research paper(single data loci). * Use *sentence tranformers* to generate a matrix with 768 components for each sentence. * Then we calculate average over a paper and also find the norm(root mean square) of the matrix. * Using these embeddings we calculate the cosine similarity with our questions (sentences) and list the top similar results. The problem encountered was that this method was computationally quite taxing. ## TF-IDF >**Idea** : It is the principal technique used for removing stop words from the corpus apart from the normalization techniques learned in the course. It works on the term-document matrix initially generated. * TF-IDF helps you to see how important a word is to your document. The matrix comprises of D document vectors of vocabulary size V. In the TF_IDF matrix each document has it's own V sized vector. * The query(question) is converted to a similar V sized vector by doing pre-processing of the query. * Then using Cosine Similarity top N most relevant document's indexes are stores in a list. * Then this list is passed to the function generate_summary() which performs text summarization and will retrieve the K most relevant sentences with concerning the query and display the final answer. * Note that N and K both are input parameters to be set by the user. The library imported for TF-IDF is * sklearn.feature_extraction.text * TfidfVectorizer * sklearn.metrics.pairwise * linear_kernel Formula for calculation of TF_IDF matrix is given below: <center> <img src="https://render.githubusercontent.com/render/math?math=$tf\text{-}idf=tf_{dt} * log(\frac{D}{d_{ft}})$" > </center> <br> * where <img src="https://render.githubusercontent.com/render/math?math=$tf_{dt}$" > is the term frequency * *D* is the total number of documents * <img src="https://render.githubusercontent.com/render/math?math=$d_{ft}$" > is the total number of documents in the corpus in which the term occurs. ### Example run * The query given to the model is "What is the evidence that livestock could be infected (field surveillance, genetic sequencing, receptor binding) Evidence of whether farmers are infected, and whether farmers could have played a role in the origin." * The results generated are as follows: > * Summary for article: A questionnaire-based survey on the uptake and use of cattle vaccines in the UK Eighty-six per cent (n=229/266) of respondents indicated that they had vaccinated their cattle in the past year, with more dairy farmers vaccinating compared to beef farmers (Table 3 ). Timing of vaccination may also be important. > * Summary for article: Environmental Economics and Policy Studies Environmental effects of intensification of agriculture: livestock production and regulation As the distance from the city center increases, rents from livestock production may decline because of higher transportation costs, until rents from livestock and forestry are equal at X. > * Summary for article: Pandemic shows deep vulnerabilities Third, we need to extinguish for good the ugly inequity of healthy food for the wealthy and junk food for poor people. And these programs can purchase all or nearly all of their food from local farmers using sustainable farming practices, to strengthen city-region links and support struggling farmers. ## Latent Dirichlet Allocation This method uses topic modelling to find the hidden topics in the documents and then comparing their relative scores with the question. * **Finding Topics** : The first step is to find the topics in the given set of documents. This is done by first creating a bag of words and then using the gensim library function LdaMulticore and providing the parameters such as number of topics, dictionary etc. The function gives the different topics in the document. For our use we have chosen number of topics as 10 * **Relative Scoring of Topics** : In this step we score the topics based on their similarity with the question keyword and obtain the topic that best describes the question * **Per Document Topic Distribution** : We find out the per document topic distribution of the dataset which gives us a matrix where each row describes the relative proprtion of each topic in a document. * **Find the most similar documents** : Now that we have the topic that best describes the question, we select the first ten documents which have highest proportion of that topic in the per document topic distribution. These documents best describe the topic which best describes our question - LDA consumes a lot of computing time and power but gives good results. - Scores generated for topics with respect to the question " What do we know about virus genetics, origin and its evolution ?" ```python3 ['real-time', 'evolution', 'tracking', 'genome', 'know', 'virus', 'origin', 'genetics'] Score: 0.8712934851646423 Topic: 0.014*"sequence" + 0.013*"sample" + 0.010*"patient" + 0.008*"gene" + 0.008*"result" Score: 0.014302533119916916 Topic: 0.041*"cell" + 0.015*"gene" + 0.013*"mouse" + 0.010*"protein" + 0.010*"disease" Score: 0.01430114172399044 Topic: 0.024*"cell" + 0.014*"antibody" + 0.012*"protein" + 0.009*"1" + 0.009*"mouse" Score: 0.01430098619312048 Topic: 0.017*"patient" + 0.016*"disease" + 0.014*"health" + 0.013*"case" + 0.009*"outbreak" Score: 0.014300837181508541 Topic: 0.042*"protein" + 0.021*"cell" + 0.011*"rna" + 0.009*"viral" + 0.008*"sequence" Score: 0.014300500974059105 Topic: 0.020*"cell" + 0.013*"patient" + 0.012*"influenza" + 0.012*"calf" + 0.010*"group" Score: 0.01430040318518877 Topic: 0.010*"data" + 0.009*"cell" + 0.009*"case" + 0.008*"bat" + 0.008*"model" Score: 0.014300144277513027 Topic: 0.019*"patient" + 0.017*"disease" + 0.011*"vaccine" + 0.010*"clinical" + 0.009*"respiratory" Score: 0.014300143346190453 Topic: 0.014*"cell" + 0.011*"protein" + 0.010*"sample" + 0.010*"patient" + 0.009*"viral" Score: 0.014299794100224972 Topic: 0.051*"cell" + 0.016*"viral" + 0.013*"protein" + 0.010*"activity" + 0.010*"response" - The per document topic scores show the following distribution for four documents [[(1, 0.2968149), (4, 0.5133625), (7, 0.102317214), (9, 0.08552837)], [(0, 0.8750857), (2, 0.11611986)], [(6, 0.99899405)], [(3, 0.20245415), (9, 0.79017067)], [(0, 0.71342075), (7, 0.28082848)], [(3, 0.983916)]] ``` ## Word as Vector > **Idea** : If we represent words as vectors then we can find similarity of words using vector distance metric. In this method we convert words to vectors in *n*-dimensional space. This allows us to find simiarity and difference between words mathematically based on meaning of words. For this project we will make use of GloVe vectors mapping to convert words to vectors. ### GloVe Vectors > [GloVe Vectors website](https://nlp.stanford.edu/projects/glove/) GloVe stands for Global Vectors for Word Representation. This is an algorithm designed by Stanford University's team to find word similarities. It was trained on Wikipedia articles which consist of nearly 6 billion words. The algorithm uses word-word co-occurrences to find similarity index of 2 words. Let's take an example of pairs of words : (*dog*, *puppy*), (*dog*, *chair*) * There can be many articles found which have *dog* and *puppy* coxisting. Hence *dog* and *puppy* will be relatively close to each other in their respective vector representations. * On the other hand, *dog* and *chair* will not be coexisting in many articles. So, *dog* and *chair* will be far from each other. If we check the cosine similarity between (*dog*, *puppy*) and (*dog*, *chair*) then, * cosine_sim (*dog*, *puppy*) = 0.7236376216894685 * cosine_sim (*dog*, *chair*) = 0.26111826072490374 we can observe that *dog* and *puppy* are very close to each other as compared to *dog* and *monkey*. ### Document to Vector We use a multistep sequential process to convert document to vector. > #### Steps to make Document Vector 1. Tokenize and Lemmatize filtered body text and get list of keywords. 2. Remove stopwords. 3. Filter keywords with atleast 2 ocurences in document. 4. Convert all selected keywords to vectors. 5. Take weighted mean of these vectors over document vocabulary to get vector for document. ### Question to Vector We convert question to vectors as well using similar process. > #### Steps to make Query Vector 1. Tokenize and Lemmatize the question and get list of keywords 2. Remove stopwords 3. Convert all selected keywords to vectors. 4. Take weighted mean of these vectors over question vocabulary to get vector for question. ## Cosine Similarity > **Idea** : Use Cosine Similarity to find relevance of topic with question. We use document vectors and query vector to find similarity of document with question. For this we can use Cosine Similarity as a metric of closeness. <center> <img src="https://render.githubusercontent.com/render/math?math=$CosineSimilarity(\vec{u}, \vec{v}) = \frac{\vec{u}.\vec{v}}{||\vec{u}||*||\vec{v}||}$" > </center> <br> If keywords of question and document do not match then we will get vectors which are far apart. So, they will have CosineSimilarity index <img src="https://render.githubusercontent.com/render/math?math=$\approx 0$" > On the other hand, if keywords of question and document match then we will get vectors which are close. So, they will have CosineSimilarity index <img src="https://render.githubusercontent.com/render/math?math=$\approx 1$" > Using this approach we can find similarity index *(largest to smallest)* between question and document based on their vectors. Then, we can rank each document based on the similarity score and show top *k* documents with highest cosine similarity. ## KNN Approach > **Idea** : Using Euclidean distance to find *k* nearest neighbours of query vector. We use document vectors and query vector to find closeness of documents with question. For this we can use Euclidean Distance as a metric of distance. <center> <img src="https://render.githubusercontent.com/render/math?math=$EuclideanDistance(\vec{u}, \vec{v}) = ||(\vec{u} - \vec{v})||_2^2$" > </center> <br> If we have document and query vector far apart, their euclidean distance will be big. If we have document and query vector close to each other their euclidean distance will be small. Using this approach we can rank the documents based on distance *(smallest to largest)* between question and document vectors. Then, we can show top *k* documents. ## 1D CNN *1D Convolutional Neural Network works well for text classification. So, use that idea here to get document embedding using output of it's Average Pooling Layer. Output of Average Pooling layer is low dimensional feature representation of document for classifying it's cluster. Cluster will have similar document, so in that way it can act as document embedding when we are searching for a particular article. 1D CNN should work better than plain averaging of word vectors of a document because it will consider all possible phrases of filter size with given stride parameter.* - Load data into DataFrame. - Work on a slice of DataFrame for the experiment. - Clean the body text. - Filter articles which are in English and remove those which are likely noise. - Load the pre-trained GloVe vectors. - Apply K-Means Clustering to get labels. - Learn a simple 1D Convolutional Network to predict cluster. - Pick the output AveragePooling Layer of 1D Convolutional Neural Network as embedding for the document. - Get the norm of each document embedding and store that in a vector. - Get the keywords from the question after tokenizing the question. - Get the keywords embedding using AveragePooling Layer output of 1D Convolutional Neural Network. - Get the norm of keywords embedding. - Compute the cosine similarity using matrix vector multiplication and norms of document embeddings and keyword embeddings. - Get the top 5 articles that have the highest cosine similarity with the keywords of the question. ## Text Summarisation > **Idea** : We rank the sentences according to a score and get top 8 sentences as our summary. - First step is to get a list of top ranked documents. - Then we need to tokenize sentences to get a list of sentences from original document. - From formatted documents we generate a list of all words, removing stopwords. - Now we measure weighted frequency of all the words after removing stopwords. - We use these weights for each words to calculate score for a sentence. - If the sentence has that word we add the weighted frequency to the score of that sentence. - Each sentence will have some score based of the method above. - Now we just find the top 8 sentences with high scores, these sentences will form our final summary. > **Example Run** Article: "Arctic animals on land include small plant-eaters like ground squirrels, hares, lemmings, and voles; large plant-eaters like moose, caribou/reindeer, and musk ox; and meat-eaters like weasels, wolverine, wolf, fox, bear, and birds of prey. Climate-related changes are likely to cause cascading impacts involving many species of plants and animals. Compared to ecosystems in warmer regions, arctic systems generally have fewer species filling similar roles. Thus when arctic species are displaced, there can be important implications for species that depend upon them. For example, mosses and lichens are particularly vulnerable to warming. Because these plants form the base of important food chains, providing primary winter food sources for reindeer/caribou and other species, their decline will have far-reaching impacts throughout the ecosystem. A decline in reindeer and caribou populations will affect species that hunt them (including wolves, wolverines, and people) as well as species that scavenge on them (such as arctic foxes and various birds).... > **Weighted Word Frequencies** - species 1.0 - populations 0.7777777777777778 - lemmings 0.6666666666666666 - arctic 0.6666666666666666 - decline 0.6666666666666666 - animals 0.5555555555555556 - snow 0.5555555555555556 - ground 0.4444444444444444 - voles 0.4444444444444444 - plants 0.4444444444444444 - predators 0.4444444444444444 - land 0.3333333333333333 - birds 0.3333333333333333 - prey 0.3333333333333333 - impacts 0.3333333333333333 - particularly 0.3333333333333333 - food 0.3333333333333333 - winter 0.3333333333333333 - ice 0.3333333333333333 - result 0.3333333333333333 - musk 0.2222222222222222 - ox 0.2222222222222222 - weasels 0.2222222222222222 - fox 0.2222222222222222 ... We can see the words which are most relevant to the document appear most frequently, so they have more weight than others which helps us find the most important sentences in the document. > **Sentence Scores** - Ice crust formation resulting from freeze-thaw events affects most arctic land animals by encapsulating their food plants in ice, severely limiting forage availability and sometimes killing the plants. [5.333333333333332] - Thus, a decline in lemmings can also indirectly result in a decline in populations of other prey species such as waders and other birds. [5.222222222222221] - A decline in lemming populations would be very likely to result in an even stronger decline in populations of these specialist predators. [4.555555555555555] - Because these plants form the base of important food chains, providing primary winter food sources for reindeer/caribou and other species, their decline will have far-reaching impacts throughout the ecosystem. [4.5555555555555545] - More generalist predators, such as the arctic fox, switch to other prey species when lemming populations are low. [4.0] - In winter, lemmings and voles live and forage in the space between the frozen ground of the tundra and the snow, almost never appearing on the surface. [3.666666666666667] - Thus when arctic species are displaced, there can be important implications for species that depend upon them. [3.3333333333333335] - Climate-related changes are likely to cause cascading impacts involving many species of plants and animals. [3.333333333333333] - Compared to ecosystems in warmer regions, arctic systems generally have fewer species filling similar roles. [2.666666666666667] ... >**Summary Generated** Ice crust formation resulting from freeze-thaw events affects most arctic land animals by encapsulating their food plants in ice, severely limiting forage availability and sometimes killing the plants. Thus, a decline in lemmings can also indirectly result in a decline in populations of other prey species such as waders and other birds. A decline in lemming populations would be very likely to result in an even stronger decline in populations of these specialist predators. Because these plants form the base of important food chains, providing primary winter food sources for reindeer/caribou and other species, their decline will have far-reaching impacts throughout the ecosystem. More generalist predators, such as the arctic fox, switch to other prey species when lemming populations are low. In winter, lemmings and voles live and forage in the space between the frozen ground of the tundra and the snow, almost never appearing on the surface. Thus when arctic species are displaced, there can be important implications for species that depend upon them. Climate-related changes are likely to cause cascading impacts involving many species of plants and animals. ## Analysis We analysed our models on a small subset of the dataset that consists of around 10,000 papers due to limited time and computing resources, as most of these methods consume a lot of time and computational power. ### TF-IDF #### Pros 1. Uses better techniques to remove stop words. It removes the words which are common to all the documents as they are not going to play any role in deciding the document similarities rather than just fixing the stopwords first and rmoving them. 2. Estimates importance of a word to a document. #### Cons 1. As number of documents increase and vocabulary increase the computation complexity of the TF-IDF matrix becomes very high. 2. The matrix is usually sparse and there is also noise involved. 3. This method does not take into account the statistical view that there is distribution of words in the corpus. ### Word to Vectors (Cosine Similarity & KNN Approach) #### Pros 1. Find topic relevance based on meaning of words not number of occurences. #### Cons 1. Computationally heavy as it needs to calculate norm with each document. 2. Takes time when number of documents gets larger. (currently >100k). 3. Takes time to calculate document vectors as GloVe only has vectors for words. Hence need to take average over all words. Both TF-IDF and word vectors approach use cosine similarity to rank documents but word vectors is better than TF-IDF. This is because of TF-IDF only takes only number of occurences of a word into account to give TF-IDF score. On the other hand, word vectors uses meaning of words as vectors. Word vectors method is less computationally expensive than TF-IDF because matrix size in TF-IDF is <img src="https://render.githubusercontent.com/render/math?math=$D * V$" > but in word vectors it is <img src="https://render.githubusercontent.com/render/math?math=$D * n$" > where <img src="https://render.githubusercontent.com/render/math?math=$D$" > is number of documents, <img src="https://render.githubusercontent.com/render/math?math=$V$" > is size of vocabulary and <img src="https://render.githubusercontent.com/render/math?math=$n$" > is a decidable constant. In COVID-19 dataset, <img src="https://render.githubusercontent.com/render/math?math=$D \approx 100,000$" >, <img src="https://render.githubusercontent.com/render/math?math=$V \approx 300,000$" >, <img src="https://render.githubusercontent.com/render/math?math=$n = 100$" >. So, as <img src="https://render.githubusercontent.com/render/math?math=$n \ll V$" >, time taken to run word vectors approach is much less than TF-IDF. ### 1D CNN #### Pros 1. Considers word vectors for every *h* word phrase, where *h* is filter size. So, it captures more detail than plain word averaging. 2. With multiple *channels* (filter sizes), it can capture very good details of sentences in document. 3. Less computationally expensive as compared to *LSTM* and *RNN*. #### Cons 1. Does not consider word order for very long sentences. 2. Not very linguistic. ### Sentence Transformers (BERT) #### Pros 1. Find topic relevance semantically. 2. Based on state of the art BERT. #### Cons 1. Computationally heavy as it needs to calculate norm with each document. 2. Takes a lot of time when number of documents gets larger. 3. Only first 768 tokens are used for every sentence. 4. Not multilingual, works only for English sentences. ### LDA - There are many benefits of using LDA, it is a soft grouping method that is any document can belong to any number of topics and hence the relationships between words and documents is more fairly captured as compared to other methods. It will take into account synonymy, polysemy and contexts. - By changing parameters to the LDA function, we can observe the changes in the topics that are generated and arrive at the parameters that are best fit for our model. We tested the dataset on number of topics, where we found that as we increase the number of topics, we arrive at a better description for our dataset but the overlaps between the topics increase with the increasing number of topics. - Decreasing the number of dictionary words in the corpus dramatically decreases the computational time of the model. #### Pros 1. Finds topic relationships between different documents 2. LDA is a probabilistic model capable of expressing uncertainty about the placement of topics across texts and the assignment of words to topics 3. LDA is good in identifying coherent topics. #### Cons 1. Computationally expensive 2. Cannot be used for hard clustering documents into topics ## Experimental Results Comparison Highest ranked Research Papers for a question, "What do we know about virus genetics, origin, and evolution? What do we know about Real-time tracking of whole genomes?" using different methods were found to be: - GloVe Averaging: - Quasispecies and the implications for virus persistence and escape - KNN: - Functional genomics as a tool in virus research - LDA: - Virus survey in populations of two subspecies of bent-winged bats (Miniopterus orianae bassanii and oceanensis) in south-eastern Australia reveals a high prevalence of diverse herpesviruses - Transformers: - Thermal-induced unfolding-refolding of a nucleocapsid COVN protein - 1D-CNN: - Sequencing the whole genome of infected human cells obtained from diseased patients ## Related Work - Kim, Convolutional Neural Networks for Sentence Classification, 2014 > A series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks - MaksimEkin Notebook on COVID-19 Literature Clustering > Used TF-IDF vectorizer and PCA to find clusters of documents using K-Means Clustering - Mohit Iyer et al., Deep Unordered Composition Rivals Syntactic Methods for Text Classification, 2015 > This model, the deep averaging network (DAN), works in three simple steps: > - take the vector average of the embeddings associated with an input sequence of tokens > - pass that average through one or more feedforward layers > - perform (linear) classification on the final layer’s representation ## Conclusion Different methods that were taught in the course were applied and on careful analysis, pros and cons of each method were studied upon. Some methods were principle techniques while some were statistical based methods. All of these were helpful in retrieval of information from the raw corpus and give us the relevant answers to the corresponding questions. In terms of comparison we can't state that a specific model is the solution approach to each problem. Every method has it's own merits which are unique to the problem. ## References - Mohit Iyer et al., Deep Unordered Composition Rivals Syntactic Methods for Text Classification, 2015 - Kim, Convolutional Neural Networks for Sentence Classification, 2014 - Jeffrey Pennington et al., GloVe: Global Vectors for Word Representation, 2014 - Representations for Language: From Word Embeddings to Sentence Meanings, Christopher Manning, Stanford University, 2017 - https://www.kaggle.com/maksimeren/covid-19-literature-clustering - https://www.kaggle.com/ivanegapratama/covid-eda-initial-exploration-tool - https://towardsdatascience.com/comparing-text-summarization-techniques-d1e2e465584e

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully