Final Report, Notes

# Final Report, Notes ## Title Natural language classifier based on Reddit posts for information retrieval ## Abstract (tarek) this work aims at trying to interpret input from user by deriving benefit from textual clues embedded in the input by using callected datasets from many thousands of reddit posts in three different subreddits: (politics, datascience, listentothis). We also discuss the implication concluded from the picked classifier for training the datasets and how it was optimal for this work in comparison with other classifiers. **(Is this how research questions can be approached?)** ## Introduction (tarek) It's not an easy task to understand how we communicate, The amount of textual information available on the internet has grown exponentially. Natural Language Processing has not yet been wholly perfected, and it is faced with many challenges. NLP is also challenged by the fact that language, and the way it's being used, is constantly changing. Consider being able to interact with a machine which can *somewhat* interpret the input and categorize the approached subject. This work approaches this core idea. ## Background (Kevin) ## Method ### Collection (mosh) The data-collection was done with the python-library `praw` (refrence here), allowing to access the data through the Reddit-API -for a registered user. The collected dataset consists of nealry 28600 Reddit-posts, scraped from 3 subreddits, which are `politics`, `datascience` and `listentothis`. Each post has its title, selftext, top-level-comments and the related subreddit. Due to the fact the reddit is actually a link-agregator, it's quite rare to find selftext for many posts. That's why the top-level-comments were also taken in consideration while shaping attributes of the dataset. It also appeared that the posts often had html-tags, URLs and other trailing charackters (not to mension stopwords and other NLP-needing componenets). So we cleaned the selftext and top-level-comments with the python-submodule `re`, which stands for regular expressions. Furthermore, and as we went on with evaluating different models, there were no extra benifits to be found of considering selftexts as attribute in our dataset. So we joined them with the top-level-comments under the column text. - [x] ToDo: Distirbution in subreddit ![](https://i.imgur.com/kEZeQvC.png) ### Model, Data (Niko) - Google Docs to my text: https://docs.google.com/document/d/1LiN7AXg5syasaavoFc-aMau4EJ6Qd8wMLdmpnHwudSE/edit?usp=sharing - Modelling approach: Mixture betweeen experimentation and reasoning -> Using the model that was most promising based on scoring results. - Introduction of the models we used. - Basically an overview on how we worked with the models and what our goal was - Ethical Considerations here? ## Results (Niko) Google Docs to my text: https://docs.google.com/document/d/1LiN7AXg5syasaavoFc-aMau4EJ6Qd8wMLdmpnHwudSE/edit?usp=sharing #### First: Results from presentation - See final presentation #### Then: New Insights: - Imbalanced Dataset - Small number of datascience entries - Datascience is often misclassified as politics - Accuracy probably misinformes because of imbalance: Politics has the highest accuracy but also the highest number of entries - Future: Better balanced dataset. Also: Different subreddit? Datascience seems not very talkative... ### Discussion (Niko, Mosh, Kevin, Tarek)  - Ethical Considerations here? ## Conclusion ()