CogCom2022 - HackMD

--- title: "CogCom2022" author: "Daniella Varga" date: "2022-12-19" output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) knitr::opts_knit$set(root.dir = "D:/Desktop/AU/CogCom/TakeHomeExam") ``` ```{r} rm(list=ls()) #cleans working environment getwd() ``` ```{r} library(tidyverse) library(dplyr) library(sentimentr) library(syuzhet) library(ggplot2) library(graphics) library(stats) library(lme4) ``` ```{r} # reading data billboard <- read.csv("D:/Desktop/AU/CogCom/TakeHomeExam/billboardHot100_1999-2019.csv") #only the top 10 instead of top 100 billboard <- billboard %>% filter(Weekly.rank >= 1 & Weekly.rank <= 10) ``` ```{r} # new dataframe with only the unique lyrics (no repetition) df_unique <- billboard %>% distinct(Lyrics) # Use the sentimentr package to compute the sentiment for each set of lyrics df_unique$sentiment <- get_sentiment(df_unique$Lyrics) # Merge the sentiment scores back into the original dataframe df <- left_join(billboard, df_unique[, c("Lyrics", "sentiment")], by = "Lyrics") df2 <- df %>% select(Artists, Name, Genre, Lyrics, Weekly.rank, Week, sentiment) ``` ### VISUALIZING SENTIMENT SCORES OVER TIME ### ```{r} # Convert the week variable to a date or datetime format df2$Week <- as.Date(df2$Week, format = "%Y-%m-%d") # Create a time series plot with a gradient color trend line p2 <- ggplot(data = df2, aes(x = Week, y = sentiment)) + geom_line() + geom_smooth(aes(color = ..y..), method = "gam", se = FALSE) + scale_color_gradient2(low = "blue", mid = "yellow" , high = "red", midpoint = mean(df2$sentiment), name = "Sentiment Score") + labs(title = "Sentiment score over time", x = "Week", y = "Sentiment score") + theme_bw() p2 # Save the plot as a JPEG file ggsave("p2.jpg", p2, device = "jpg") ``` ### VISUALIZING RELATIONSHIP BTW SENTIMENT SCORE AND POPULARITY(WEEKLY.RANK) ### ```{r} #smth i had to do to get the gradient df2$sent <- predict(prcomp(~sentiment+Weekly.rank, df2))[,1] # Create a scatter plot of the sentiment scores and the weekly ranks plot <- ggplot(df2, aes(x = sentiment, y = Weekly.rank, color = sent)) + geom_point(size = 2.5, show.legend = FALSE) + labs(title = "Relationship between sentiment scores and weekly ranks", x = "Sentiment score", y = "Weekly rank")+ theme_minimal()+ scale_color_gradient(low = "#b846ff", high = "#ffaf5a") plot # Save the plot as a JPEG file ggsave("plot.jpg", plot, device = "jpeg") ``` It looks like the scatterplot of the sentiment and weekly rank variables shows multiple linear lines, which may indicate that there are multiple linear relationships present in the data. This could be due to the presence of multiple groups or subgroups in the data that have different relationships between the sentiment and weekly rank variables. To further analyze the data and determine the nature of the relationships between the sentiment and weekly rank variables, you can consider using a statistical method that can identify and model multiple linear relationships. One option is multiple linear regression, which is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. ```{r} #Shapiro-Wilk test -> not reliable if the data is higher than 50 samples shapiro.test(df_unique$sentiment) #Anderson-Darling test library(nortest) ad.test(df2$sentiment) ``` ```{r} #data => not normally distributed shapiro wilk test => p-value... => non-paramethric test instead of t.test() # Convert the values in the Week column to dates using the origin parameter df4 <- df2 # Create a new column indicating whether a song was released before or after 2010 df4$sentiment <- ifelse(df4$sentiment > 0, "positive", "negative") #t-test -> but non-parametric 'cause it is not normally distributed wilcox.test(Weekly.rank ~ sentiment, data = df4) ``` ```{r} # Linear Models model1 <- lm(Weekly.rank ~ sentiment, data = df2) model2 <- lm(Weekly.rank ~ Artists, data = df2) model3 <- lm(Weekly.rank ~ Genre, data = df2) AIC(model1, model2, model3) # Print the model summary #summary(model1) #summary(model2) #summary(model3) ``` In the output of the summary function for model1, the p-value for the sentiment coefficient is 7.55e-08. This means that there is a very small probability (less than 0.001%) that the relationship between sentiment and Weekly.rank occurred by chance. Generally, a p-value less than 0.05 is considered to be statistically significant, meaning that there is a low probability that the relationship between the variables occurred by chance. In this case, the p-value of 7.55e-08 is much lower than 0.05, indicating that the relationship between sentiment and Weekly.rank is statistically significant. Based on these results, you can reject your hypothesis that sentiment does not influence Weekly.rank. ```{r} # Calculate the correlation coefficient between the sentiment scores and the weekly ranks df2 %>% summarize(correlation = cor(sentiment, Weekly.rank, method = "pearson")) ``` A correlation coefficient of -0.05469163 indicates a weak negative relationship between the sentiment scores and the weekly ranks. A negative correlation coefficient of -0.05469163 indicates that as the sentiment scores increase, the weekly ranks decrease, but the relationship is weak. A correlation coefficient of -0.05469163 is considered to be a very weak relationship. A coefficient closer to -1 or 1 would indicate a stronger relationship. ### RAP SONGS BEFORE AND AFTER 2010 ### ```{r} # Filter the data to include only songs from the rock genre df_rap <- df2 %>% filter(Genre == "Rap") # Convert the values in the Week column to dates using the origin parameter df3 <- mutate(df_rap, Week = as.Date(Week, origin = "1999-08-16")) # Create a new column indicating whether a song was released before or after 2010 df3 <- mutate(df3, Year = ifelse(as.Date(Week, format = "%Y-%m-%d") < as.Date("2010-01-01", format = "%Y-%m-%d"), "before 2010", "after 2010")) # Create a box plot plot_rap <- ggplot(data = df3, aes(x = "before 2010", y = sentiment, fill = factor(Year))) + stat_boxplot(geom = "errorbar", width = 0.15) + geom_boxplot() + scale_fill_manual(values = c("#b959e4", "#90EE90"), name = "Date") + facet_wrap(~ Year)+ labs(title = "Sentiment value of rap songs after and before 2010", x = " ", y = "Sentiment score") plot_rap # Save the plot as a JPEG file ggsave("plot_rap.jpg", plot_rap, device = "jpeg") ``` To determine if rap songs after 2010 have lower sentiment values, you can use a t-test to compare the mean sentiment value of rap songs before 2010 to the mean sentiment value of rap songs after 2010. If the data is not normally distributed, you can use a nonparametric test, such as the Wilcoxon rank-sum test, to compare the sentiment values of rap songs before and after 2010. ```{r} # Split the data into two groups based on the Year column before_2010 <- df3[df3$Year == "before 2010", ] after_2010 <- df3[df3$Year == "after 2010", ] # Perform a Mann-Whitney U test result <- wilcox.test(before_2010$sentiment, after_2010$sentiment) print(result) # the p-value is 0.1053, which is greater than the significance level of 0.05. This means that the difference between the mean sentiment values of rap songs before and after 2010 is not statistically significant and the null hypothesis cannot be rejected. ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.