Twitter Sentiment Analysis for Will Smith

Class: MAT381E

Lecturer: Atabey Kaygun

Authors: Onur Özkan (@Onurzkan) & Furkan Pehlivan (@furkanpehli1)

Sentiment Analysis can help us decipher the mood and emotions of the general public and gather insightful information regarding the context. Sentiment Analysis is a process of analyzing data and classifying it based on the need of the research.

We want to do a sentiment analysis of an international scandal on Twitter with Twitter API. Because the latest event that is on the agenda all over the world is Will Smith's behavior at the Oscars, we decided to narrow down our scope and analyze this topic.

Will Smith always had a good public presence. Due to the latest scandal, it has changed and that makes our subject suitable for sentiment analysis and also interesting. We would like to see the public reaction on Twitter to an international scandal.

1. Application for Twitter API

Because the Twitter API is not open-sourced we had to apply for access. In this section we will discuss about how we applied.

First we signed up from Twitter official website for developers and filled up some forms and provided informations about why we need Twitter API.

After our application reviewed and Twitter sent us an email and wanted more detailed infos for our app as you can see in below.

Lastly they provided us access for Developer Platform and we created our app in that platform. After all done we were ready to go!

2. Libraries











from textblob import TextBlob
import tweepy
import pandas as pd
import json
import regex as re
import os
import datetime
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns
from deep_translator import GoogleTranslator

In python, library is a collection of related modules. It contains bundles of code that can be used repeatedly in different programs. Briefly libraries in python we can also say in programming make our life much easier. In our project we have used 11 different libraries. We used 9 libraries but we want to emphasise 3 main library which are mandatory to use:

2.1. Textblob

Textblob is a python library for processing textual data. It simply provide a API for natural language processing (NLP).

In our project we used textblob for sentiment analysis.

2.2. Tweepy

Tweepy is an open-sourced, easy-to-use Python library for accessing the Twitter API. To working with Twitter API we choose to use tweepy.

2.3. Deep_translator

Deep_translator library is a flexible free and unlimited tool to translate between different languages in a simple way using multiple translators.

This library help us to make our analysis for accurate because textblob can not make analysis on other language than English. So we used that library to translate tweets which are not in English.

3. Twitter API Authentication

As we discussed in first chapter Twitter API is not an open-sourced API so we need a authenticated user to access Twitter API.

In advance we created a json file for our credentials and read json file followingly transformed in to pandas dataframe. Then we passed our credentials to the variables as you can see in below code block. Because we used Twitter API v2 we used tweepy's OAuth 1.0a User Context method.






APIKeys = pd.read_json('credentials.json', typ='series')
APIKey = APIKeys['APIKey']
APISecretKey = APIKeys['APISecretKey']
BearerToken = APIKeys['BearerToken']
AccessToken = APIKeys['AccessToken']
AccessTokenSecret = APIKeys['AccessTokenSecret']

After we passed our credential values to the variables we provided these to the tweepy's function named OAuth1UserHandler function and got authenticated after that to get API token we passed auth variable to the variable named api. We are going to use API token in future steps.




auth = tweepy.OAuth1UserHandler(
    APIKey, APISecretKey, AccessToken, AccessTokenSecret
)
api = tweepy.API(auth)

4. Functions

We declared 5 functions named retrieveTweets, tweetCleaner, sentimentAnalysis for different purposes.

4.1. Data Collection via `retrieveTweets`

First function retrieveTweets helps us to retrieve tweet datas between timeline. This function need two variable.

To get data we used tweepy's Cursor class. This class basically does pagination for API methods.

We provided different parameters to that class:

api.search_30_day define our token and search method. We used fully archive method to retrieve tweet data older than 30 days
query parameter let us to query specific tweets. To be able to retrieve accurate data we used hashtags.
fromDate and toDate parameters help us to declare timeline we want to retrieve data.

After that we wanted to limit our retrieving function because even we have access to Twitter API it is restricted access. So we limited the number of tweet retrieving each time function is triggered.

Finally we turned our tweet data to python list format and returned it.


def retrieveTweets(startDate,endDate):
    return list(tweepy.Cursor(api.search_30_day,"SentimentAnlysis30Days",query="#willsmith", fromDate = startDate, toDate= endDate).items(100))

4.2. Cleaning Data with `tweetCleaner`

Second function tweetCleaner assist us to manipulating tweets text because when we retrieved tweets we saw that there is unnecessary texts which can affect correctness of our sentiment analysis. To get much accurate analysis we had to manipulate tweets text and format it in our way.

To do that we kindly searched all tweets and if there is RT splitted out text once with : and took second element.

Followingly we used regex to find unnecessary urls and if there is any url we removed them all with for loop and finally striped the text before returning.












def tweetCleaner(tweet):
    if 'RT @' not in tweet:
        tweet = tweet
    elif 'RT @' in tweet:
        tweet = tweet.split(':', 1)[1].strip()
    
    if re.findall(r'(https?://[^\s]+)', tweet) != '':
        urls = re.findall(r'(https?://[^\s]+)', tweet)
        for url in urls:
            tweet = tweet.replace(url, '')
    tweet = tweet.strip()
    return tweet

4.3. Sentiment Analysis with `sentimentAnalysis`

This function is one of our core component of this project. Basically we are doing our sentiment analysis in here. While doing that we used nested function methodology to do our analysis correctly.

There are 3 nested function in sentimentAnalysis:

getSubjectivity: We find subjectivity of the tweet with that function.
getPolarity: This function provides us polarity of the tweet.
getAnalysis: In that part we supply polarity of the tweet and with using the polarity value we are labeling tweet as Positive, Neutral and Negative.













def sentimentAnalysis(tweet):
    def getSubjectivity(text):
        return TextBlob(text).sentiment.subjectivity
    def getPolarity(text):
        return TextBlob(text).sentiment.polarity
    def getAnalysis(score):
           if score < 0:
               return "Negative"
           elif score == 0:
               return "Neutral"
           else:
               return "Positive"
    return [tweet],[getSubjectivity(tweet)],[getPolarity(tweet)],[getAnalysis(getPolarity(tweet))]

4.4. Creating Pie Chart with `pieChart`

We created this function to eliminate the use of repeated creation and plotting of pie chart methods.





def pieChart (values,label,title,size):
    plt.rcParams['figure.figsize'] = size
    plt.pie( values, labels=label, shadow=True)
    plt.title(title)
    plt.show()

4.5. Creating Line Chart with `lineChart`

We created this function to eliminate the use of repeated creation and plotting of line chart methods.








def lineChart (label,values,color,title):
    plt.subplots(figsize = (28,10))
    plt.plot(label, values, color=color, marker='o')
    plt.title(title+' Vs Date', fontsize=14)
    plt.xlabel('Date', fontsize=16)
    plt.ylabel(title, fontsize=16)
    plt.grid(True)
    plt.show()

5. Retrieving Tweets

For this sectieon we will use retrieveTweets function that we created but to make retrieving automatic we used a for loop and changed the start/end date with converting them to datetime.



startDate = "202204280000"
endDate =   "202205280000"
mainTweetList = []





for i in range(7):
    mainTweetList.append(retrieveTweets(startDate,endDate))
    startDate = datetime.datetime.strptime(startDate,"%Y%m%d%H%M")+ datetime.timedelta(days=+1)
    startDate = startDate.strftime("%Y%m%d%H%M")
    print(startDate)

6. Processing Tweets

In this for loop we processed the data an got all different variables as list to concat the to dataframe

We declared a storage list too keep the tweets and reuse them if needed because twitter allows limited requests and tweet retrieving for full archive access.
All the declaration for lists of wanted data
In the for loop we appended all the desired data
Using our tweetCleaner function and TextBlob we cleaned the tweets
After that we had to translate the tweets. For that we used an if control and translated the tweets that are not english with GoogleTranslator.
After that the appending continued with sentiment analyzed data from our sentimentAnalysis function.





















storage = []
processedData = []
textTweet, langTweet, sourceTweet, geoTweet, dateTweet = [], [], [], [], []
tweetList, subjectivityList, polarityList, analysisList = [],[],[],[]

for tweets in mainTweetList:
    for tweet in tweets:
        storage.append(tweet)
        textTweet.append(tweetCleaner(tweet.text))
        dateTweet.append(tweet.created_at.strftime('%Y-%m-%d'))
        langTweet.append(tweet.lang)
        sourceTweet.append(tweet.source)
        tweetBlob = TextBlob(tweetCleaner(tweet.text))
        sentence = ' '.join(tweetBlob.words)
        if tweet.lang != 'en':
            sentence = GoogleTranslator(source='auto', target='en').translate(sentence)
        [returnedtweet,reurnedSubjectivityList,returnedPolarityList,returnedAnalysisList]=sentimentAnalysis(sentence)
        tweetList.append(returnedtweet)
        subjectivityList.append(reurnedSubjectivityList)
        polarityList.append(returnedPolarityList)
        analysisList.append(returnedAnalysisList)

7. Storing Datas

To be able to do not get stuck with Twitter API we decided to store different type of data. In this part we are going to discuss about how we managed to store datas.

Firstly we stored all kind of data in json format under local path where named ./storage/ and named every file dynamically for each time we retrieved data.








# Writing data to storage file in json format
storage_data_df = pd.DataFrame(storage)
storage_data = pd.DataFrame.to_json(storage_data_df)
storage_data_js = json.loads(storage_data)
storage_data = json.dumps(storage_data_js, indent=2)

with open('./storage/' + startDate + '-' + endDate + '.json', "w", encoding='utf-8') as file:
    file.write(storage_data)

We stored all kind of twitter data which comes with tweet above but we also create local database with desired datas. But to do that we firstly created dataframe with our desired data after dataframe created we wrote that dataframe in local working directory.










dateDf = pd.DataFrame(dateTweet, columns=['Date'])
tweetListDf=pd.DataFrame(tweetList, columns=['Tweets'])
subjectivityListDf = pd.DataFrame(subjectivityList,columns=['Subjectivity'])
polarityListDf = pd.DataFrame(polarityList,columns=['Polarity'])
analysisListDf = pd.DataFrame(analysisList,columns=['Analysis'])
langListDf = pd.DataFrame(langTweet, columns=['Language'])
sourceListDf = pd.DataFrame(sourceTweet, columns=['Source Device'])

processedDataDf = pd.concat([tweetListDf, subjectivityListDf, polarityListDf, analysisListDf, dateDf, langListDf, sourceListDf], axis=1)
processedDataDf

While writing dataframe in csv format we used append mode because we are retrieving data in for loop.


mainStoragePath = './main-storage.csv'
processedDataDf.to_csv(mainStoragePath, mode='a', index=False, header=not os.path.exists(mainStoragePath))

Storing was a key concept for our project due to Twitter API limitation so we built our architecture on that key concept.

After we successfully stored our data we read the file and did our manipulation on that.


pulledData = pd.read_csv(mainStoragePath)
pulledData

8. Data Analysis & Visualization

8.1. Analysing Data For Each Date

We grouped the data with Date and Analysis and count the tweet column to analys how many Positive, Negative and Neutral tweets there were for each date that the Oscars happenned.

94th Academy Awards (Oscars) happened at March 27, 2022 but for GMT+3 it is March 28, 2022-03:00


groupedData = pulledData.groupby(["Date", "Analysis"])["Tweet"].count()
pd.DataFrame(groupedData)

8.2. Creating Bar Chart

We created a bar chart with multiple bars to visualize the positive and negative tweet counts for every date.

For that we gathered the dates as X index with getting the grouped data's first level indexes.
Then For the Y index we used loc to find positive and negative values.
After that we created the bars for both positive and negative values.
After that giving the X axis and titling is all that is left.


















plt.subplots(figsize = (30,12))

X = list(groupedData.index.get_level_values(0).drop_duplicates())
X_axis = np.arange(len(X))

YNegative =  groupedData.loc[:,("Negative"),:].values
YPositive = groupedData.loc[:,("Positive"),:].values


plt.bar(X_axis-0.2, YNegative, 0.4, label = 'Negative')
plt.bar(X_axis+0.2, YPositive, 0.4, label = 'Positive')

plt.xticks(X_axis, X)
plt.xlabel("Date")
plt.ylabel("Number of Tweets")
plt.title("Number of Tweet Analysis for Each Date")
plt.legend()
plt.show()

8.3. Creating Line Charts

Line charts are easy to see when the upwords or downward trands went on positive, negative and neutral tweets for each date. So we created line chart for every one of them with getting date and values for each one. and putting the in our function lineChart.

8.3.1 Line Chart For Negative Tweets




Date = list(groupedData.index.get_level_values(0).drop_duplicates())
Negativity = groupedData.loc[:,("Negative"),:].values

lineChart (Date,Negativity,"red","Negativity")

8.3.2 Line Chart For Positive Tweets



Positivity = groupedData.loc[:,("Positive"),:].values

lineChart (Date,Positivity,"green","Positivity")

8.3.3 Line Chart For Neutral Tweets



Neutral = groupedData.loc[:,("Neutral"),:].values

lineChart (Date,Neutral,"orange","Neutral")

8.4. Creating Pie Charts

With the pie charts we could examine when the most and least positive, negative and neutral tweets tweeted from our extracted data and see the behavior. So we created pie charts with our function pieChart.

8.4.1. Pie Chart For Negative Tweets


pieChart(groupedData.loc[:,("Negative"),:].values, groupedData.index.get_level_values(0).drop_duplicates(),"Negative Tweets",(8,8))

8.4.2. Pie Chart For Positive Tweets


pieChart(groupedData.loc[:,("Positive"),:].values, groupedData.index.get_level_values(0).drop_duplicates(),"Positive Tweets",(8,8))

8.4.3. Pie Chart For Neutral Tweets


pieChart(groupedData.loc[:,("Neutral"),:].values, groupedData.index.get_level_values(0).drop_duplicates(),"Neutral Tweets",(8,8))

8.5. Creating Crosstab & Heathmap

With the crosstab we can see the same results as 8.1. with a different approach.


cross = pd.crosstab(pulledData["Date"], pulledData["Analysis"])
cross

For different visualization we transferred it to heathmap to see how the emotions changed.


sns.heatmap(cross, cmap='Oranges')
plt.show()

8.6. Some Fun Fact Charts

8.6.1 Negative & Positive Tweets From Which Device Pie Charts

We wanted to see which devices tweeted the most negative and positive tweets.



sourceGrouped = pulledData.groupby(["Analysis", "Source Device"]).size()
sourceGroupedN = sourceGrouped.loc[("Negative"),:]
pieChart(sourceGroupedN.values, sourceGroupedN.index.get_level_values(1), "Negative Tweets From Which Device in Timeline",(18,18))


sourceGroupedP = sourceGrouped.loc[("Positive"),:]
pieChart(sourceGroupedP.values, sourceGroupedP.index.get_level_values(1), "Positive Tweets From Which Device in Timeline",(35,35))

8.6.2 Negative & Positive Tweets From Which Language Pie Charts

We wanted to see in which languages tweets are tweeted the most negative and positive.



langGrouped = pulledData.groupby(["Analysis", "Language"]).size()
langGroupedN = langGrouped.loc[("Negative"),:]
pieChart(langGroupedN.values, langGroupedN.index.get_level_values(1), "Negative Tweets From Which Language in Timeline",(18,18))


langGroupedP = langGrouped.loc[("Positive"),:]
pieChart(langGroupedP.values, langGroupedP.index.get_level_values(1), "Positive Tweets From Which Language in Timeline",(18,18))

9. Data Analysis & Visualization For Hourly Data

To analyze deeper into the accident we wanted to take the hourly tweets from the date of oscars due to our limited access to full archive tweets. It can later be seen that the emotions changed when Will Smith got an Oscar and the Slap leaked to internet.

94th Academy Awards (Oscars) happened at March 27, 2022 but for GMT+3 it is March 28, 2022-03:00

9.1. Extracting The Hourly Data

We ran our 5.Retrieving Tweets, 6.Processing Tweets and 7.Storing Data parts again in an hourly base approach to gather the storage data below.



hourStoragePath = './main-storage-hour.csv'
pulledDataHourly = pd.read_csv(hourStoragePath)
pulledDataHourly

9.2. Analysing Data For Each Hour

Analysing is done same as 8.1.


groupedDataHourly = pulledDataHourly.groupby(["Date", "Analysis"])["Tweet"].count()
pd.DataFrame(groupedDataHourly)

9.3. Creating Bar Chart For Hourly Data

Creation of bar chart is same but with hourly data as 8.2.
















plt.subplots(figsize = (30,12))

X = list(groupedDataHourly.index.get_level_values(0).drop_duplicates())
YNegative =  groupedDataHourly.loc[:,("Negative"),:].values
YPositive = groupedDataHourly.loc[:,("Positive"),:].values
X_axis = np.arange(len(X))

plt.bar(X_axis-0.2, YNegative, 0.4, label = 'Negative')
plt.bar(X_axis+0.2, YPositive, 0.4, label = 'Positive')

plt.xticks(X_axis, X)
plt.xlabel("Date")
plt.ylabel("Number of Tweets")
plt.title("Number of Tweet Analysis for Each Date")
plt.legend()
plt.show()

9.4. Creating Line Chart For Hourly Data

Creation of line chart is same but with hourly data as 8.3.

9.4.1 Line Chart For Negative Tweets

It can be seen that the downtrend of negative tweets when Will Smith got the Oscar and the uptrend after the Slap leaked to internet.




Date = list(groupedDataHourly.index.get_level_values(0).drop_duplicates())
Negativity = groupedDataHourly.loc[:,("Negative"),:].values

lineChart (Date,Negativity,"red","Negativity")

9.4.2 Line Chart For Positive Tweets

The negative correlation of Negative tweets can bee seen in this case.



Positivity = groupedDataHourly.loc[:,("Positive"),:].values

lineChart (Date,Positivity,"green","Positivity")

9.4.3 Line Chart For Neutral Tweets



Neutral = groupedDataHourly.loc[:,("Neutral"),:].values

lineChart (Date,Neutral,"orange","Neutral")

10. Conclusion

In this project we had to deal with several problems such as Twitter API limitation like HTTP(S) request limits also in Twitter' s full archive access we could request at most 50 times with 100 tweet per request, retrieving tweets only with pagination method and TextBlob's accuracy & precision.

We mentioned there were limitations in HTTP(S) requests so we had to minimize our testing process to achieve the best results with less errors.

At last we tried our best to process and visualize the limited data that we gathered and stored.