Lab 4: Generating Data

NOTE: Use the 111 Tables Documentation for this lab!

Learning Objective

The objective of this lab is to gain familiarity with sanitizing, visualizing, and gaining insights from data.

In all past assignments, the data that you worked with had been sanitized for you (i.e. you could expect Pyret to process the data without any extra work). The data in this lab, however, is being ported directly from Google Forms submissions, so there exists data that either isn't valid or isn't useful, and you need to find ways to effectively sort through such cases so that you can gather the insights you need.

If you don't feel more comfortable with sanitizing, visualizing, and interpreting data after working on this lab, come to TA hours! Your TAs would love to help in any way they can.

Backstory

You’ve just started at a new island, Rhode Island, in Poptropica and met with one of the local anthropology professors at Brown University.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The professor wants to learn more about the behavior of people. Given your history of helping many other islands, they thought they could use your help.

They need your help to collect meaningful data and analyze it so that they can publish their next paper!

Your Mission

Brainstorm. What defines you as a person? What kind of information would be good for a report on human behavior and society? Some types of data are better for different things. For example, line graphs are often best at representing data in the form of numbers, while pie charts may be better for presenting tallies and votes.

For their paper, the professor would like a variety of data types, from numbers to strings. After thinking for a while, the professor decides that they want to study the life of a student at Brown.

Your lab TAs will give you a link to a Google Form with the professor's questions.

Fill out the form to contribute to the study!

Go to the response spreadsheet (also from your TAs), make a copy of it (select "File" > "Make a Copy"), and import it into your program. Make sure to convert the format of the cells to Plain Text (select "Format" > "Number" > "Plain text") and update the sharing permissions to "Anyone with the link". This copy of the spreadsheet is yours to tweak manually. There are some values that Pyret will not allow you to read in, such as empty cells and other lovely surprises for you to discover. Pyret will let you know when it finds a value it doesn't like, and you will have to correct these by hand directly in the spreadsheet.

To import the spreadsheet, include this at the top of your program:
(Replace the ssid with that of your own copy.)

include tables
include image
include shared-gdrive("dcic-2021", "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep")
include gdrive-sheets
import data-source as ds

# spreadsheet id from Google Sheets
ssid = "<Copy of Your SSID>" # aka that weird sequence of letters and numbers in the url
data-sheet = load-spreadsheet(ssid)
student-data =
  load-table: timestamp, average-sleep, weekend-sleep, concentration, classes, schoolwork-hours, extracurricular-activities, extracurricular-hours, on-campus, avg-commute
    source: data-sheet.sheet-by-name("Form Responses 1", true)
 end

Note: The highlighted portion is what your ssid looks like:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Preparing the Data

Now that we have the data collected, we need to make sure that the data is clean and processed before analysis.

  1. Talk about why it will be difficult to draw insights from the columns about concentration name and hours of extracurricular obligations. Be able to explain your reasoning to a TA. Create a new table (within Google Sheets) that doesn't include these columns, and use this new table for the rest of the lab.
    Hint: For sheets with multiple tables, you can change which one is being referenced by changing the string in the line source: data-sheet.sheet-by-name("Form Responses 1", true) from "Form Responses 1" to whatever your new table is called.

  2. Once you import your table of student data, you will notice that the values in certain columns are in the format some(x) or none. In order to turn these values into the data types we all know and love (String, Number, etc.) use the sanitize functions upon import. Section 3.6.2 of the Pyret Tables documentation gives a quick overview of each sanitizer.

Before writing your sanitizers, remember: There are some values that Pyret will not allow you to read in, such as empty cells and other lovely surprises for you to discover. Pyret will let you know when it finds a value it doesn't like, and you will have to correct these by hand directly in the spreadsheet by either editing the cells or deleting the entry itself.

  1. Look at the other columns. Discuss with your partner which columns might need to be cleaned up to make sense, and come up with a plan to do this. In particular, come up with functions to clean up the columns about the number of classes, the hours spent doing schoolwork, the number of extracurriculars, and the commute time. Think about some times when it might be better to clean the data using sanitizers and other times when it might be more prudent to do so manually.

  2. Write these functions, and create a new table with cleaned columns.


CHECKPOINT: Call over a TA once you reach this point.


Analyze the data

Now that the data has been tidied up, we need to put it together and analyze it in a way that will be meaningful for a research paper! What interesting ways can we represent this information? How should we visualize it?

  1. Filter your table so that you only have students with high numbers of extracurriculars (you can decide which value on the scale defines the cutoff).
  2. Filter your table to people who get 6-8 hours of sleep on an average weeknight.
    1. What is the average number of classes these people are taking?
    2. What percentage of them live on campus?
  3. Filter your table so that you only have students with low numbers of extracurriculars who get at least 7 hours of sleep. Once again, you can choose the cutoff.
  4. Create a scatterplot to compare hours of sleep on weeknights against minutes spent traveling to class.
  5. Create a scatterplot to compare hours spent working vs. hours spent sleeping.
  6. Create a scatterplot to compare average hours of sleep on weekends to average hours of sleep on weekdays.
  7. Create a new column that sums the number of hours spent sleeping and doing homework.
    1. Do you notice any impossible results?
    2. With your partner, discuss the pros and cons of filtering out responses that seem impossible.
  8. Create a new column that calculates the ratio of hours doing homework to hours sleeping (on weeknights).
  9. Find the average number of minutes people spend on their commute.

CHECKPOINT: Call over a TA once you reach this point.


Sharing the data

As the professor starts finishing their paper, they decide they want to find more correlation patterns in the data. Help the professor one last time by finding a cool correlation in the research!

  1. We want to transpose the table (flip rows and columns) to look at the average commute times and average hours of sleep on weekdays of students living on campus and students living off campus. Since we don't yet know how to transpose an entire table through code, we'll build the table we need manually.

    1. Filter the main table to compute average commute times and average hours of sleep on weekdays of students living on campus.
    2. Do the same for students not living on campus.
    3. Create a new table (in Pyret) named campus-living-table with the columns location, avg-commute, and avg-sleep-weekdays.
    4. Fill in the table with the appropriate data.
      Note: there should only be two rows one for those who are on campus and one for those who are not. It will look something like the following:
      Location Commute time Sleep on weekdays
      On campus Average Average
      Off campus Average Average
  2. Plot any two columns that you think might be related to one another. Do you notice any cool correlations?


CHECKPOINT: Call over a TA once you reach this point.


Presentation Time!

It's now time for the professor's paper to be published and presented to a bunch of other smart professors! The professor gets a little shy when it comes to answering questions and presenting so they have asked you to do all of it.

Discuss the answers to the following questions with your partner:

  1. What data preparation did you have to do before you could start your analysis?
  2. Why do messy data exist?
  3. When can you safely assume your data matches a specific format?
  4. How can we expand what we did today to work on even larger, even messier data sets?
  5. In what scenarios do people have to sort through messy data?
    1. Do you have any suggestions for how to clean data?
    2. What should they watch out for?
  6. How can we prevent "messy data"? What could we have changed in the form to make the data / results nicer?

CHECKPOINT: Call over a TA to check you off for lab.


Great job!

Thanks to your excellent work, the presentation was a success and the professor is very pleased. The professor has handed you the island medallion as a thank you gift!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →