# MUSA 550 Final Project: Karma Mine
## Guiding Questions
- are there superusers? and are they mostly robots or real people?
- what do front page super users have in common? (k-means?)
- what do front page posts have in common?
- does reddit's formula for front page content publish more robots than humans?
## Things We'll need to know
- there is an equation that gets things on the front page - what is it?
## Requirements (At least 3) - Are they possible with this?
- Data is collected through a means more sophisticated than downloading
- YES, through the Reddit API
- At least one of the datasets contains more than 1,000,000 rows
- YES, there are 130,000 active subreddits, in 2020 303 million posts in the year, and 430 million active users, so chances of 1 million rows is high
- It combines data collected from 3 or more different sources
- Source 1: Reddit API, Source 2: Data set showing countries that post the most, link it to countries that have the most bots??
- The analysis of the data is reasonably complex, involving multiple steps (geospatial joins/operations, data shaping, data frame operations, etc)
- YES, we will have data shaping and data frame operations
- You use one of the analysis techniques for urban street networks (e.g., osmnx, pandana), clustering (e.g., scikit-learn), or raster datasets
- YES, k-means, clustering with scikit-learn.
- You perform a machine learning analysis with scikit-learn as part of the analysis.
- YES
- The webpage includes a significant interactive component (cross-filtering, interactive widgets, etc)
- YES, we can add this in the webpage, adding in widgets?
## Notes
- parsing the comments for accusations that op is a robot
- sentiment analysis on the posts
- connection to big news stories (another data set)
- looking for clickbait word sin titel
- can we classify posts as clickbait
- sentiment analysis on comments
- pulling posts in the past? to identify repost.
- after and before
- news effect on sentitment on posts. average sentiment on news, average sentiment on r/all
- words in common?
- are there more popular subreddits that end up on the front page post than others?
- before we get into robots (in the pres)
- the bot narrative:
- are there a lot of op comments in the posts?
- op comment amount and content might tell us if robot or not
- complexity of sentiment in comments
- similar posts but one front pagea nd the other not-- what's the difference?
- can identify hot posts and then query for the same content elsewhere
- might help with robot id
- are bots giving bots awards? (`'awarders':post['data']['awarders']`)
- `'is_original_content':post['data']['is_original_content']`
- `'created_utc':post['data']['created_utc']`
- `'num_crossposts':post['data']['num_crossposts']`
- `'removal_reason':post['data']['removal_reason']`
- `'is_crosspostable':post['data']['is_crosspostable']`
- links to news sites - which ones? left leaning, right leaning? if there are news posts -
- Other things we want to know what they are:
- `'domain':post['data']['domain'],`