SDG-ACE Hack Group 1

# SDG-ACE Hack Group 1 --- ### Group members Jason Verrall, Hayden B, Jaspal Panesar, Zhengyang Jin --- ### Introduction Two different approaches, NLP and a ML model --- ### NLP Approach * Consolidate keywords from various sources * Find keyword matches in journal titles * Expert identified 'known good' journals are separated * From the 'known good' journals, a Bag of Words (BoW) is created * Articles from title-matched journals are searched using top BoW terms * The resulting journals are returned and ranked by similarity --- ### ML Approach * Supervised * KNeighborsClassifier for classification * SVD * Unsupervised * Using expert knowledge and word embedding * Cosine Similarity between document and category --- ### NLP Tools 1 #### Tokenization Each document is then tokenised at the word level. Puctuation and stop words removed as well as being made all lower case. --- #### Lemmatization This is done to extract the root of most words All the tokens are then counted up in our processed corpus and which ones are the mosr common. Then we delete some useless words such as "development", "country", "report", "also". --- ### NLP Tools 2 #### TF-IDF Tf-idf by weighting the count of each word in a documentation with the inverse of its frequency across all documents. We can weaken high-frequency words and strengthen low frequncy words, so that all words produce a quantitaive score. #### Cosine Similarity --- ### Why journal titles * Previous approaches looked at classifying the entire corpus based on article content. * Value in allowing researchers to select potentially relevant journals for a field of interest * Complements a whole-corpus approach as well as a quick knowledge discovery tool --- ### Possible uses * Researcher who wants to identify potentially relevant journals for their field, e.g. to get ToCs as they are published for situational awareness * Expand scope of journal content, e.g. going from region-specific publications to global or other regional publications * Training expert systems --- ### Worked example Our user is an energy industry analyst/researcher * Journal titles in dataset = **12,912** * SDG7 keywords consolidated from Bergen, Elsevier, Sirius etc. = **511** * Matched tokenised journal titles to keywords = **71** * Top 5 most frequent words in 71 journals = **8,404** --- List of 6 journal titles provided by user: * 'Energy Policy', 'Energy Research & Social Science', 'Journal Of Cleaner Production', 'Nature Climate Change', 'Renewable Energy', 'Solar Energy' * All 6 are in the list of 71 'possibly relevant' journals * Possibly relevant journals = **65** --- ### Simple term frequency Top words used in the 6 expert journals = 5,191 Use **cosine similarity** to match the group of 6 expert journals, to the most frequent terms in abstracts in the remaining 65 journals individually --- ![](https://i.imgur.com/TGo2qxs.png) So results aren't great! The mean H score for the top 5 journals here is **64.8** (https://www.scimagojr.com/journalrank.php) --- Use cosine similarity to match as before, based on the **TF-IDF score** for each term ![](https://i.imgur.com/Te3AGp0.png) This looks better even to a non-expert; mean H score for the top 5 journals is **88.2** --- ## Classifying ### Methods #### Supervised learning Firstly we used labeled data of the extended data set for supervised learning. <br> We tried the KNeighborsClassifier and SVD techniques. --- ##### Test data preprocessing As there is a lot of defective data in the csv file due to 34,492 pieces of data lacking an abstract. We decided to use the article title as the default value. --- We counted the number of journals, there are 12,912 in total, which is too large and will affect the clasification accuracy. --- ##### Article titles ![](https://github.com/BlinkingStalker/SDG-ACE-2020/blob/master/graph/output_18_0.png?raw=true) ##### Labeled data ![](https://github.com/BlinkingStalker/SDG-ACE-2020/raw/master/graph/output_21_0.png) --- #### Cleaning the traing data As the data is not balanced, the data needed to be further processed. Data that was particularly large or small was removed. As SDG is a goal that has been produced in recent years, older papers were excluded. --- ### Applying the classification model We decided to conduct multi-class training so that the model has more classification capabilities. We tried to convert the file into a binary matrix and directly use the KNClassifier in sklearn as the training model. Split the dataset into 20% test data and 80% training data. The final results are as follows: ![](https://github.com/BlinkingStalker/SDG-ACE-2020/blob/master/graph/OUtput212.png?raw=true) *Thanks to the NESTA Team for sharing code* --- ## Unsupervised Learning Manually labeling data can is ofen very large. We need some unsupervised ways to solve the problem, which can solve a lot ofresources for other tasks. The main idea: ![](https://github.com/BlinkingStalker/SDG-ACE-2020/raw/master/graph/unspvzd_model.png) *Thanks to the NESTA Team for sharing code* --- #### 3.2.1 Data Cleaning It is basically similar to the previous preprocessing method. --- #### 3.2.2 Enrichment This step is carried out for label, and its main purpose is to expand the category thesaurus through four specific methods, specifically: 1. Use experts or search engines to provide 3-5 representative words for each category; 2. Use WordNet to add synonyms and synonyms corresponding to the word found in the previous step into the thesaurus; 3. Use the existing category thesaurus to find the representative documents of each category (threshold 70%), and add the words in the document to the category thesaurus; 4. Using Word Embedding, find some similar words to add to the thesaurus; PS: The words found in each step must have appeared in the document; --- #### 3.2.3 Consolidation Consolidation refers to filtering out some non-obvious words in the category thesaurus found in the enrichment step, and leaving high-quality words. The filtering standard is judged by the following formula: ![](https://i.imgur.com/kf5mkaS.png) TF(w,c) is the frequency of word w in category c, the right side of the numerator is the average frequency of word w in all categories, and the denominator represents the variance of word w in the category other than c. When FAC(w,c) is lower than a certain threshold, the word w is deleted from the category. --- #### 3.2.4 similarity The last step is to calculate the cosine similarity between document d and category l. In vectorization, the LSA method is used to perform singular value decomposition using word-document and word-label matrices to generate their respective latent semantic spaces. Then use the respective generated vectors for cosine similarity calculation --- ## Summary --- ### Next steps Journal publication quality Use of language e.g. differences in term usage between different professions ___