---
title: "CogCom2022"
author: "Daniella Varga"
date: "2022-12-19"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = "D:/Desktop/AU/CogCom/TakeHomeExam")
```
```{r}
rm(list=ls()) #cleans working environment
getwd()
```
```{r}
library(tidyverse)
library(dplyr)
library(sentimentr)
library(syuzhet)
library(ggplot2)
library(graphics)
library(stats)
library(lme4)
```
```{r}
# reading data
billboard <- read.csv("D:/Desktop/AU/CogCom/TakeHomeExam/billboardHot100_1999-2019.csv")
#only the top 10 instead of top 100
billboard <- billboard %>%
filter(Weekly.rank >= 1 & Weekly.rank <= 10)
```
```{r}
# new dataframe with only the unique lyrics (no repetition)
df_unique <- billboard %>%
distinct(Lyrics)
# Use the sentimentr package to compute the sentiment for each set of lyrics
df_unique$sentiment <- get_sentiment(df_unique$Lyrics)
# Merge the sentiment scores back into the original dataframe
df <- left_join(billboard, df_unique[, c("Lyrics", "sentiment")], by = "Lyrics")
df2 <- df %>%
select(Artists, Name, Genre, Lyrics, Weekly.rank, Week, sentiment)
```
### VISUALIZING SENTIMENT SCORES OVER TIME ###
```{r}
# Convert the week variable to a date or datetime format
df2$Week <- as.Date(df2$Week, format = "%Y-%m-%d")
# Create a time series plot with a gradient color trend line
p2 <- ggplot(data = df2, aes(x = Week, y = sentiment)) +
geom_line() +
geom_smooth(aes(color = ..y..), method = "gam", se = FALSE) +
scale_color_gradient2(low = "blue", mid = "yellow" , high = "red",
midpoint = mean(df2$sentiment),
name = "Sentiment Score") +
labs(title = "Sentiment score over time", x = "Week", y = "Sentiment score") +
theme_bw()
p2
# Save the plot as a JPEG file
ggsave("p2.jpg", p2, device = "jpg")
```
### VISUALIZING RELATIONSHIP BTW SENTIMENT SCORE AND POPULARITY(WEEKLY.RANK) ###
```{r}
#smth i had to do to get the gradient
df2$sent <- predict(prcomp(~sentiment+Weekly.rank, df2))[,1]
# Create a scatter plot of the sentiment scores and the weekly ranks
plot <- ggplot(df2, aes(x = sentiment, y = Weekly.rank, color = sent)) +
geom_point(size = 2.5, show.legend = FALSE) +
labs(title = "Relationship between sentiment scores and weekly ranks", x = "Sentiment score", y = "Weekly rank")+
theme_minimal()+
scale_color_gradient(low = "#b846ff", high = "#ffaf5a")
plot
# Save the plot as a JPEG file
ggsave("plot.jpg", plot, device = "jpeg")
```
It looks like the scatterplot of the sentiment and weekly rank variables shows multiple linear lines, which may indicate that there are multiple linear relationships present in the data. This could be due to the presence of multiple groups or subgroups in the data that have different relationships between the sentiment and weekly rank variables.
To further analyze the data and determine the nature of the relationships between the sentiment and weekly rank variables, you can consider using a statistical method that can identify and model multiple linear relationships. One option is multiple linear regression, which is a statistical technique used to model the relationship between a dependent variable and one or more independent variables.
```{r}
#Shapiro-Wilk test -> not reliable if the data is higher than 50 samples
shapiro.test(df_unique$sentiment)
#Anderson-Darling test
library(nortest)
ad.test(df2$sentiment)
```
```{r}
#data => not normally distributed shapiro wilk test => p-value... => non-paramethric test instead of t.test()
# Convert the values in the Week column to dates using the origin parameter
df4 <- df2
# Create a new column indicating whether a song was released before or after 2010
df4$sentiment <- ifelse(df4$sentiment > 0, "positive", "negative")
#t-test -> but non-parametric 'cause it is not normally distributed
wilcox.test(Weekly.rank ~ sentiment, data = df4)
```
```{r}
# Linear Models
model1 <- lm(Weekly.rank ~ sentiment, data = df2)
model2 <- lm(Weekly.rank ~ Artists, data = df2)
model3 <- lm(Weekly.rank ~ Genre, data = df2)
AIC(model1, model2, model3)
# Print the model summary
#summary(model1)
#summary(model2)
#summary(model3)
```
In the output of the summary function for model1, the p-value for the sentiment coefficient is 7.55e-08. This means that there is a very small probability (less than 0.001%) that the relationship between sentiment and Weekly.rank occurred by chance.
Generally, a p-value less than 0.05 is considered to be statistically significant, meaning that there is a low probability that the relationship between the variables occurred by chance. In this case, the p-value of 7.55e-08 is much lower than 0.05, indicating that the relationship between sentiment and Weekly.rank is statistically significant.
Based on these results, you can reject your hypothesis that sentiment does not influence Weekly.rank.
```{r}
# Calculate the correlation coefficient between the sentiment scores and the weekly ranks
df2 %>%
summarize(correlation = cor(sentiment, Weekly.rank, method = "pearson"))
```
A correlation coefficient of -0.05469163 indicates a weak negative relationship between the sentiment scores and the weekly ranks.
A negative correlation coefficient of -0.05469163 indicates that as the sentiment scores increase, the weekly ranks decrease, but the relationship is weak.
A correlation coefficient of -0.05469163 is considered to be a very weak relationship. A coefficient closer to -1 or 1 would indicate a stronger relationship.
### RAP SONGS BEFORE AND AFTER 2010 ###
```{r}
# Filter the data to include only songs from the rock genre
df_rap <- df2 %>% filter(Genre == "Rap")
# Convert the values in the Week column to dates using the origin parameter
df3 <- mutate(df_rap, Week = as.Date(Week, origin = "1999-08-16"))
# Create a new column indicating whether a song was released before or after 2010
df3 <- mutate(df3, Year = ifelse(as.Date(Week, format = "%Y-%m-%d") < as.Date("2010-01-01", format = "%Y-%m-%d"), "before 2010", "after 2010"))
# Create a box plot
plot_rap <- ggplot(data = df3, aes(x = "before 2010", y = sentiment, fill = factor(Year))) +
stat_boxplot(geom = "errorbar",
width = 0.15) +
geom_boxplot() +
scale_fill_manual(values = c("#b959e4", "#90EE90"),
name = "Date") +
facet_wrap(~ Year)+
labs(title = "Sentiment value of rap songs after and before 2010", x = " ", y = "Sentiment score")
plot_rap
# Save the plot as a JPEG file
ggsave("plot_rap.jpg", plot_rap, device = "jpeg")
```
To determine if rap songs after 2010 have lower sentiment values, you can use a t-test to compare the mean sentiment value of rap songs before 2010 to the mean sentiment value of rap songs after 2010.
If the data is not normally distributed, you can use a nonparametric test, such as the Wilcoxon rank-sum test, to compare the sentiment values of rap songs before and after 2010.
```{r}
# Split the data into two groups based on the Year column
before_2010 <- df3[df3$Year == "before 2010", ]
after_2010 <- df3[df3$Year == "after 2010", ]
# Perform a Mann-Whitney U test
result <- wilcox.test(before_2010$sentiment, after_2010$sentiment)
print(result)
# the p-value is 0.1053, which is greater than the significance level of 0.05. This means that the difference between the mean sentiment values of rap songs before and after 2010 is not statistically significant and the null hypothesis cannot be rejected.
```