Class: MAT381E
Lecturer: Atabey Kaygun
Authors: Onur Özkan (@Onurzkan) & Furkan Pehlivan (@furkanpehli1)
Sentiment Analysis can help us decipher the mood and emotions of the general public and gather insightful information regarding the context. Sentiment Analysis is a process of analyzing data and classifying it based on the need of the research.
We want to do a sentiment analysis of an international scandal on Twitter with Twitter API. Because the latest event that is on the agenda all over the world is Will Smith's behavior at the Oscars, we decided to narrow down our scope and analyze this topic.
Will Smith always had a good public presence. Due to the latest scandal, it has changed and that makes our subject suitable for sentiment analysis and also interesting. We would like to see the public reaction on Twitter to an international scandal.
Because the Twitter API is not open-sourced we had to apply for access. In this section we will discuss about how we applied.
First we signed up from Twitter official website for developers and filled up some forms and provided informations about why we need Twitter API.
After our application reviewed and Twitter sent us an email and wanted more detailed infos for our app as you can see in below.
Lastly they provided us access for Developer Platform and we created our app in that platform. After all done we were ready to go!
In python, library is a collection of related modules. It contains bundles of code that can be used repeatedly in different programs. Briefly libraries in python we can also say in programming make our life much easier. In our project we have used 11 different libraries. We used 9 libraries but we want to emphasise 3 main library which are mandatory to use:
Textblob is a python library for processing textual data. It simply provide a API for natural language processing (NLP).
In our project we used textblob
for sentiment analysis.
Tweepy is an open-sourced, easy-to-use Python library for accessing the Twitter API. To working with Twitter API we choose to use tweepy
.
Deep_translator library is a flexible free and unlimited tool to translate between different languages in a simple way using multiple translators.
This library help us to make our analysis for accurate because textblob
can not make analysis on other language than English. So we used that library to translate tweets which are not in English.
As we discussed in first chapter Twitter API is not an open-sourced API so we need a authenticated user to access Twitter API.
In advance we created a json file for our credentials and read json file followingly transformed in to pandas dataframe. Then we passed our credentials to the variables as you can see in below code block. Because we used Twitter API v2 we used tweepy's OAuth 1.0a User Context method.
After we passed our credential values to the variables we provided these to the tweepy's function named OAuth1UserHandler
function and got authenticated after that to get API token we passed auth
variable to the variable named api
. We are going to use API token in future steps.
We declared 5 functions named retrieveTweets
, tweetCleaner
, sentimentAnalysis
for different purposes.
retrieveTweets
First function retrieveTweets
helps us to retrieve tweet datas between timeline. This function need two variable.
To get data we used tweepy's Cursor
class. This class basically does pagination for API methods.
We provided different parameters to that class:
api.search_30_day
define our token and search method. We used fully archive method to retrieve tweet data older than 30 daysquery
parameter let us to query specific tweets. To be able to retrieve accurate data we used hashtags.fromDate
and toDate
parameters help us to declare timeline we want to retrieve data.After that we wanted to limit our retrieving function because even we have access to Twitter API it is restricted access. So we limited the number of tweet retrieving each time function is triggered.
Finally we turned our tweet data to python list format and returned it.
tweetCleaner
Second function tweetCleaner
assist us to manipulating tweets text because when we retrieved tweets we saw that there is unnecessary texts which can affect correctness of our sentiment analysis. To get much accurate analysis we had to manipulate tweets text and format it in our way.
To do that we kindly searched all tweets and if there is RT splitted out text once with :
and took second element.
Followingly we used regex to find unnecessary urls and if there is any url we removed them all with for loop and finally striped the text before returning.
sentimentAnalysis
This function is one of our core component of this project. Basically we are doing our sentiment analysis in here. While doing that we used nested function methodology to do our analysis correctly.
There are 3 nested function in sentimentAnalysis
:
getSubjectivity
: We find subjectivity of the tweet with that function.getPolarity
: This function provides us polarity of the tweet.getAnalysis
: In that part we supply polarity of the tweet and with using the polarity value we are labeling tweet as Positive, Neutral and Negative.pieChart
We created this function to eliminate the use of repeated creation and plotting of pie chart methods.
lineChart
We created this function to eliminate the use of repeated creation and plotting of line chart methods.
For this sectieon we will use retrieveTweets
function that we created but to make retrieving automatic we used a for loop and changed the start/end date with converting them to datetime.
In this for loop we processed the data an got all different variables as list to concat the to dataframe
tweetCleaner
function and TextBlob
we cleaned the tweetsGoogleTranslator
.sentimentAnalysis
function.To be able to do not get stuck with Twitter API we decided to store different type of data. In this part we are going to discuss about how we managed to store datas.
Firstly we stored all kind of data in json format under local path where named ./storage/
and named every file dynamically for each time we retrieved data.
We stored all kind of twitter data which comes with tweet above but we also create local database with desired datas. But to do that we firstly created dataframe with our desired data after dataframe created we wrote that dataframe in local working directory.
While writing dataframe in csv format we used append mode because we are retrieving data in for loop.
Storing was a key concept for our project due to Twitter API limitation so we built our architecture on that key concept.
After we successfully stored our data we read the file and did our manipulation on that.
We grouped the data with Date and Analysis and count
the tweet column to analys how many Positive, Negative and Neutral tweets there were for each date that the Oscars happenned.
We created a bar chart with multiple bars to visualize the positive and negative tweet counts for every date.
loc
to find positive and negative values.Line charts are easy to see when the upwords or downward trands went on positive, negative and neutral tweets for each date. So we created line chart for every one of them with getting date and values for each one. and putting the in our function lineChart
.
With the pie charts we could examine when the most and least positive, negative and neutral tweets tweeted from our extracted data and see the behavior. So we created pie charts with our function pieChart
.
With the crosstab
we can see the same results as 8.1. with a different approach.
For different visualization we transferred it to heathmap
to see how the emotions changed.
We wanted to see which devices tweeted the most negative and positive tweets.
We wanted to see in which languages tweets are tweeted the most negative and positive.
To analyze deeper into the accident we wanted to take the hourly tweets from the date of oscars due to our limited access to full archive tweets. It can later be seen that the emotions changed when Will Smith got an Oscar and the Slap leaked to internet.
We ran our 5.Retrieving Tweets, 6.Processing Tweets and 7.Storing Data parts again in an hourly base approach to gather the storage data below.
Analysing is done same as 8.1.
Creation of bar chart is same but with hourly data as 8.2.
Creation of line chart is same but with hourly data as 8.3.
It can be seen that the downtrend of negative tweets when Will Smith got the Oscar and the uptrend after the Slap leaked to internet.
The negative correlation of Negative tweets can bee seen in this case.
In this project we had to deal with several problems such as Twitter API limitation like HTTP(S) request limits also in Twitter' s full archive access we could request at most 50 times with 100 tweet per request, retrieving tweets only with pagination method and TextBlob's accuracy & precision.
We mentioned there were limitations in HTTP(S) requests so we had to minimize our testing process to achieve the best results with less errors.
At last we tried our best to process and visualize the limited data that we gathered and stored.