Try   HackMD

CS 100: Studio 5

Data Cleaning in R

October 12, 2022

Instructions

This week in studio, you will be cleaning messy data using a few key R libraries that facilitate data cleaning. You will then do a mini-analysis of the clean data.

Upon completion of all tasks, a TA will give you credit for today's studio. If you do not manage to complete all the assigned work during the studio period, do not worry. You can continue to work on this assignment until Sunday, October 16 at 7 PM. Come by TA hours any time before then to show us your completed work and get credit for today's studio.

Objectives

By the end of this studio, you will be able to:

  • ensure any data you import into R are parsed correctly
  • identify and handle errors and inconsistencies in data
  • save a cleaned data set with a date tag, and reload it
  • merge datasets in R

Overview

In this studio, you will be investigating the relationship between COVID-19 vaccination rates and outbreaks around the world. For example, did countries with higher vaccination rates experience less severe outbreaks? (One hopes so!) You will also investigate how these factors relate to other health and wealth indicators.

Programming Tip: We will clean the data a bit, before proceeding with any analysis. We recommend that you run through the cleaning steps in the console, verifying that the output of each of your commands-one by one-is correct as you proceed, and then copying the verified lines of code into your R markdown file. That way, once your cleaning is complete, you will have also created a script, which you can run again if ever your data become contaminated. And more importantly, if ever you decide to make one or two small changes to your cleaning process, it should be a simple matter to run through all the other steps again automatically.

Data

In this studio, you will be working with COVID-19 and international development data.

For each country in the COVID-19 data set, the measurements include:

  • Country: name of the country
  • Deaths: total number of deaths due to COVID-19
  • Deaths pmp: deaths per million people due to COVID-19
  • Cases: number of confirmed cases of COVID-19
  • Infection rate pmp: infection rate per million people
  • Population: country's total population
  • Country code: World Health Organization (WHO) identification code
  • Vaccination rate: rate per 1000 people at which people were vaccinated against COVID-19 between January 2020 and September 2022

There are several organizations that study international development, so data are available from myriad sources: the United Nations, the CIA, the World Bank, etc. These data generally include various economic, health, education, and demographic metrics for countries worldwide. Of note, the data set we are providing includes each country's Human Development Index (HDI), a composite value measuring life expectancy, education, and per capita income. Countries are ranked by this value and then grouped into four tiers of development (very high, high, medium, and low). The HDI of the United States, for example, is very high.

Cleaning Numerical Data

Create a new R markdown file, load up some libraries, and load the COVID-19 data set:

```{r setup, include = FALSE}
# By default, stringsAsFactors = FALSE
covid <- read.csv("https://cs.brown.edu/courses/cs100/studios/data/5/covid_mod.csv")
```

After loading a data set into R, it is a good idea to check that R is treating each variable appropriately. Spend a minute exploring covid in the usual way (str, summary, etc.).

What do you notice about the variable Deaths? This variable stores the number of COVID-19 deaths reported in each country. You might expect this variable to be numeric. Is it?

You can check whether a variable is integer-valued using the is.integer function: is.integer(covid$Deaths).

This function returns FALSE, which means that Deaths is being stored as something other than a integer. Perhaps there are some non-integer value entries, like 100.5? Try is.numeric(covid$Deaths).

Hmmstill FALSE. To learn what type Deaths is, enter class(covid$Deaths).

As you can see Deaths is a character. Enter covid$Deaths itself to explore its different values.

Apparently, one of the values was entered as the string "twenty-one" rather than the number 21. Consequently, when R imported the data, the entire column was read in as strings, and the data type deemed "character". To correct the error, you will replace "twenty-one" with "21".

To do so, you first have to identify which entry to change. You can do this using the which function. Enter: which(covid$Deaths == "twenty-one"). The result should be [1] 11. You can thus change the corresponding value using: covid$Deaths[11] <- 21.

N.B. It is not really very important (except for debugging purposes) which particular entry needs to be changed. So you can more update the value in one line as follows: covid$Deaths[which(covid$Deaths == "twenty-one")] <- 21.

Enter covid$Deaths again, and notice that it's 11th value has been updated. As all the values are now integers, you can convert its type to integer using as.integer(covid$Deaths). After doing so, enter class(covid$Deaths) again to confirm that the conversion worked properly.

Hint: If you want to replace the Deaths variable in the covid data frame, which comprises strings, with a new one, which comprise integers, you have to use the assignment operator (i.e., <-) to overwrite the old with the new.

Next, you should check that all the values in your integer vector make sense. Take a look at the range of values in the vector: range(covid$Deaths).

You probably weren't expecting any negative values! You can take a closer look by creating at a histogram of the data: hist(covid$Deaths, xlab = "Confirmed Deaths", main = "Histogram of Confirmed Deaths").

It looks like most of the values are in a plausible range. To confirm this, let's filter the data for observations where the number of cases is less than 0: covid %>% filter(Deaths < 0).

There’s only one problematic observation.

At this stage, you would ideally go back to the source of the data, and find out what the actual value should be. Imagine you did this, and discovered that the missing (i.e., negative) value should have been 29. You can correct this error, as long as we have a way to access the problematic observation. And guess what; you do! Can you recall how?

As above, you can use the which function: covid$Deaths[which(covid$Deaths < 0)] <- 29.

Recreate and then re-examine the histogram: hist(covid$Deaths, xlab = "Confirmed Deaths", main = "Histogram of Confirmed Deaths").

Wow! That was a long haul, and at this point you have still cleaned only one variable!

Take a quick look at the data again using str. Do you see any other variables whose types are incorrect? If so (and you should!), try fixing them.

How did that work out? Probably not so well. If you tried applying as.numeric to variables like Deaths.pmp and Infection.rates.pmp that contain numeric values but were imported as strings, you probably received a warning: Warning: NAs introduced by coercion. That's not very promising! Indeed, if you view these variables' values before and after coercion, you will find many many NAs present in the new data that were not present in the old. This cleansing was much too harsh!

What went wrong is that numbers with commas in them are imported as strings and cannot easily be converted to numbers without first stripping out the commas. You can use the str_remove function in the stringr library to remove the commas: for example, str_remove(covid$Deaths.pmp, ","). After confirming that this function works as expected, replace the old covid$Deaths.pmp vector with the new comma-free one (Hint: use the assignment operator!), and then try again to convert this variable to a numeric. Hopefully, the warning has disappeared!

Follow this same procedure for covid$Infection.rates.pmp as well. Everything should be of the correct type now.

Before proceeding with your analysis, you should save a cleaned-up version of the data. You can do so in several ways: e.g., as a .csv file, a .tsv file, etc. But we can also save our data as an .rds file, which is R's unique data format. Regardless of the format you choose, all files will store the same values.

rds or csv: The benefit of using .rds is that it can preserve the variables’ data types as well so that nothing will have to be converted from one type to another when you re-import the data into R at some later date. For example, "POSIXlt" and "POSIXct" are data types specific to R (used for representing dates and times). Only an .rds file can save these types. A .csv file cannot store values as "POSIXlt" and "POSIXct", so saving to a .csv file will store all data of these types as strings instead. The disadvantage of using .rds is that you might not be able to open an .rds file in other applications, like Google Sheets.

We want both filenames to include today’s date, so that we know when we created the file.
Let’s create the csv file first:

filename_csv <- paste("cleanCovid", Sys.Date(), ".csv", sep = "")
write.csv(covid, file = filename_csv)

And now, the rds file:

filename_rds <- paste("cleanCovid", Sys.Date(), ".rds", sep = "")
saveRDS(covid, file = filename_rds)

Programming Tip: Having gone through all the cleaning steps line by line in the console, you should now make sure that they are all inserted, in order, into your R markdown file. You should then run your R markdown code to verify that it is working correctly. To do so, load and View the csv and/or rds files you saved to confirm that they are storing clean versions of the data.

Cleaning Categorical Data

Load (a slightly corrupted version of) the 2021hdi dataset into R. As above, be sure set stringsAsFactors to FALSE. Load the data into a variable called hdi.

```{r setup, include = FALSE}
# By default, stringsAsFactors = FALSE
hdi <- read.csv("https://cs.brown.edu/courses/cs100/studios/data/5/2021hdi_mod.csv")
```

This (slightly corrupted) dataset has a categorical variable, namely "Human.Development.Group", that needs cleaning. This variable can take on four values: Low, Medium, High, and Very High.

To start, let’s see how R is treating the variable: class(hdi$Human.Development.Group).

R is treating "Human.Development.Group" as an integer. Let's look at the values in the vector to see if we can figure out why: hdi$Human.Development.Group. The values of the vector range from 0 to 10, which explains why R is considering them to be integers.

Use the table function to tabulate the number of occurrences of each value: table(hdi$Human.Development.Group).

Within the 0 to 10 range, there are 5 different values, and most of them are 0, 1, 2, and 3. Given the name of the column you can probably guess that these are the numeric codes for Low, Medium, High, and Very High, but you would have to find out from the data source which is which. If possible, you should also go to the source to find out whether the value 10 is a typo or has some sort of meaning (such as standing in for a missing value).

It turns out that 0 stands for Low, 1 stands for Medium, 2 stands for High, and 3 stands for Very High. From this information, we may be able to correct the incorrect value ourselves.

First, let’s see which country has as its “Human.Development.Group” the value 10: hdi[which(hdi$Human.Development.Group == 10), ]$Country.

The country whose value is incorrect is Germany. Surely, Germany’s “Human.Development.Group” is “Very high”. Replace the erroneous value accordingly: hdi[which(hdi$Human.Development.Group == 10), ]$Human.Development.Group <- 3.

Take a look at the table of values again, now that you’ve made the correction: table(hdi$Human.Development.Group).

Things should be looking better!

All that is left to do, then, is to format the variable so that R interprets these values as categories. You can use the factor() function for this purpose. Type ?factor into the console to learn about this function. It takes three arguments: x (the input vector), levels (the current values), and labels (the categorical names for the values).

Create a new factor called hdg and assign it values: hdg <- factor(hdi$Human.Development.Group, levels = c(0, 1, 2, 3), labels = c("Low", "Medium", "High", "Very High")).

Confirm that hdg is in fact correctly stored as a factor: is.factor(hdg).

Finally, just to make sure we've done the conversion correctly, let's compare the table of values in the original vector and in our new vector:

Table of original values: table(hdi$Human.Development.Group).

Table of new values: table(hdg).

These are comparable, so we can now replace the original vector with the new one: hdi$Human.Development.Group <- hdg.

You should, as usual, also make sure that the data types of all the other variables are as expected: i.e., characters for countries, region, and subregion, and numerics for the rest. You can do so quickly using str(hdi), which displays the data type and the first few values of each variable. Assuming everything is looking good, save a cleaned up version of the hdi data.

Merging the Data

In the remainder of this studio, you are going to investigate COVID-19 outbreaks and vaccination rates as they relate to other country indicators, like human development group. In order to do so, you need to merge the data from various sources into a single data frame.

Hint: Refer back to Lecture 5a: Intro to R for a refresher on what it means to join two databases, and how to do so explicitly using the merge function in R.

Let’s start by joining the two data frames already in memory, covid and hdi. What variable (or variables) do you think we should join by? If your answer is “Country”, you are on the same page as us! But it is not going to be as simple as:

combined <- merge(covid, hdi, by = "Country")

Try this, and then try to identify the problem. Hint: How many rows are in covid? And how many are in combined? Use nrow to find out.

You should have noticed that some countries for which there exist COVID-19 data do not appear in combined and one of those countries is none other than the United States of America!

Here’s how you can list the missing countries:

covid[which(!(covid$Country %in% hdi$Country)), ]$Country

This command outputs the countries that are in the hdi data frame, but are not in covid.

The United States of America is missing from combined because it appears as "United States of America" in covid but as simply "United States" in hdi. This error must be fixed manually. There are a number of ways to correct this, including:

covid$Country[which(covid$Country == "United States of America")] <- "United States"

There are several more errors which must be fixed manually. For example, “Estonia” is spelled incorrectly in the covid data frame. This error can be corrected using the same pattern we used to standardize the name of the United States:

covid$Country[which(covid$Country == "Estona")] <- "Estonia"

In addition, Canada and Uruguay have spurious spaces after their names. As above:

covid$Country[which(covid$Country == "Canada ")] <- "Canada"
covid$Country[which(covid$Country == "Uruguay ")] <- "Uruguay"

Finally, taking the country names in the hdi, rather than the covid, data frame as the gold standard, here are the remaining corrections:

covid$Country[which(covid$Country == "Iran")] <- "Iran (Islamic Republic of)"
covid$Country[which(covid$Country == "South Korea")] <- "Korea (Republic of)"
covid$Country[which(covid$Country == "The United Kingdom")] <- "United Kingdom"
covid$Country[which(covid$Country == "Venezuela")] <- "Venezuela (Bolivarian Republic of)"
covid$Country[which(covid$Country == "Vietnam")] <- "Viet Nam"

Run this command again:

covid[which(!(covid$Country %in% hdi$Country)), ]$Country

Having made all of the corrections we can, let’s merge the data frames again:

clean_combined <- merge(covid, hdi, by = "Country")

How many rows are in clean_combined?

Data Analysis

Now that you’ve joined these data, you can do a more thorough analysis!

  1. Plot human development group versus vaccination rate. As human development group is a factor, box plots are recommended. What are the median vaccination rates for the different groups? Which countries have the greatest vaccination rates?

Hint: Use the factor function to assign an ordering to the “Human.Development.Group” factors:

clean_combined$Human.Development.Group <- factor(clean_combined$Human.Development.Group, levels = c("Low", "Medium", "High", "Very High"))
  1. Grouped by region, what are the average infection and death rates?

End of Studio

When you are done please call over a TA to review your work, and check you off for this studio. If you do not finish within the two hour studio period, remember to come to TA office hours to get checked off.