# Lab 4: Generating and Reporting About Data **NOTE:** Use the [111 Tables Documentation](https://hackmd.io/@cs111/table) for this lab! The objective of this lab is to gain familiarity with sanitizing, visualizing, and gaining insights from data. In all past assignments, the data that you worked with had been sanitized for you (i.e. you could expect Pyret to process the data without any extra work). The data in this lab, however, is being imported directly from Google Forms submissions, so there exists data that either isn't valid or isn't useful, and you need to find ways to effectively sort through such cases so that you can gather the insights you need. ## Where does this lab fit into the course? We are working up to the course project, in which you will get a real-world data set, prepare it for use, and build some tables and charts/plots to present your analysis of a specific question. Homework 4 focused on combining data across tables. This lab focuses on cleaning up data so that it’s ready for analysis and on manually creating tables that summarize analysis results. ## Working with Forms Data Imagine that you are helping with a research project that is studying aspects of student life at Brown. The lab you're working with has a survey form live: you're going to participate in the survey (as a respondent), then help analyze the collected data. **Task 1:** Your lab TAs will give you a link to a Google Form with the professor's questions. Respond to the questions (as if you were a participant in the study -- don't worry, the survey is anonymous) **Task 2:** Now we're going to load your lab's data into Pyret. Go to the response spreadsheet (also from your TAs), make a copy of it (select "File" > "Make a Copy"), and import it into your program. Make sure to convert the format of the cells to Plain Text (select "Format" > "Number" > "Plain text") and update the sharing permissions to "Anyone with the link". This copy of the spreadsheet is yours to tweak manually. **There are some values that Pyret will *not* allow you to read in**, such as empty cells and other lovely surprises for you to discover. Pyret will let you know when it finds a value it doesn't like, and you will have to correct these by hand directly in the spreadsheet. After you do a few of these corrections, you might be able to spot a pattern for the sorts of data Pyret doesn't like. To save time, feel free to correct multiple issues at once before trying to load the data into Pyret. To import the spreadsheet, include this at the top of your program: (replace the ssid with that of your own copy.) ``` include tables include image include shared-gdrive("dcic-2021", "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep") include gdrive-sheets include data-source # spreadsheet id from Google Sheets ssid = "<Copy of Your SSID>" # aka that weird sequence of letters and numbers in the url data-sheet = load-spreadsheet(ssid) student-data = load-table: timestamp, average-sleep, weekend-sleep, concentration, classes, schoolwork-hours, extracurricular-activities, extracurricular-hours, on-campus, avg-commute source: data-sheet.sheet-by-name("Form Responses 1", true) end ``` :::warning **Note:** The highlighted portion is what your ssid looks like: ![](https://i.imgur.com/JFUdXKb.png) ::: ### Preparing the Data Now that we have the data collected, we need to make sure that the data is clean and processed before analysis. **Task 3:** Talk about why it will be difficult to draw insights from the columns about concentration name and hours of extracurricular obligations. Be able to explain your reasoning to a TA. Create a new table (within Google Sheets) that doesn't include these columns, and use this new table for the rest of the lab. <br> *Hint:* For sheets with multiple tables, you can change which one is being referenced by changing the string in the line `source: data-sheet.sheet-by-name("Form Responses 1", true)` from `"Form Responses 1"` to whatever your new table is called. **Task 4:** Once you import your table of student data, you will notice that the values in certain columns are in the format `some(x)` or `none`. In order to turn these values into the data types we all know and love (`String`, `Number`, *etc.*) use the sanitize functions upon import. Section 3.6.2 of the [Pyret Tables documentation](https://www.pyret.org/docs/latest/tables.html) gives a quick overview of each sanitizer. :::warning Before writing your sanitizers, remember: **There are some values that Pyret will *not* allow you to read in**. Pyret will let you know when it finds a value it doesn't like. If you can't fix it with a sanitizer, you will have to correct these by hand directly in the spreadsheet by either editing the cells or deleting the entry itself. ::: **Task 5:** Look at the other columns. Discuss with your partner which columns might need to be cleaned up to make sense, and come up with a strategy to do this. In particular, come up with functions to clean up the columns about the number of classes, the hours spent doing schoolwork, the number of extracurriculars, and the commute time. Think about some times when it might be better to clean the data using sanitizers and other times when it might be more prudent to do so manually. **Task 6:** Write these functions, and create a new table with cleaned columns. ___ ### *CHECKPOINT:* Call over a TA once you reach this point. ___ ### Analyze the data Now that the data have been tidied up, let's do some analysis on the data. We are interested in relationships between the different pieces of information that were collected on the form. **Task 7:** First, let's understand the lives of students who get 6-8 hours of sleep on an average weeknight. We want to know: - What is the average number of classes these people are taking? - What percentage of them live on campus? Write the code to compute these two answers. **Task 8:** Now, we want to look at the total time students are reporting spending on various activities each day. Create a new column that sums the number of hours spent sleeping and doing homework. Do you notice any impossible results (you might or might not)? **Task 9:** Calculate the ratio of hours that each student spends doing homework to hours they spend sleeping on weeknights. Create a chart/plot to convey your results so that data is displayed for each student. ___ ### *CHECKPOINT:* Call over a TA once you reach this point. ___ ## Sharing the data... As the professor starts finishing their paper, they realize they are interested in a different question: what are the average commute times and average hours of weekday sleep for each of on-campus and off-campus students. The professor wants this data in table form, not just as separate numbers. Specifically, something like: | Location | Commute time | Sleep on weekdays | | ---------------- | ------------ | ----------------- | | On campus | Average | Average | | Off campus | Average | Average | **Task 10:** The table operations we've seen so far don't seem useful here. Discuss why not (your answer should have something to do with rows and columns). **Task 11:** Seeing as the built-in table operations aren't suited to this, we will build the needed table **manually**. Remember we do this as follows: ``` my-table = table: location, commute-time, weekday-sleep row: "On campus", ..., ... row: "Off campus", ..., ... end ``` For this task, you will replace each `...` with an expression that computes the value that should go in that cell. ___ ### *CHECKPOINT:* Call over a TA once you reach this point. ___ ## Reflecting on Data Analysis Discuss the answers to the following questions with your partner: 1. What data preparation did you have to do before you could start your analysis? 2. Why do messy data exist in tables? 3. When can you safely assume your data matches a specific format? 4. How can we expand what we did today to work on even larger, even messier data sets? 5. In what scenarios do people have to sort through messy data? - Do you have any suggestions for how to clean data? - What should they watch out for? 6. How can we prevent "messy data"? What could we have changed in the form to make the data / results nicer? 7. In research, interviewers’/questionnaires’ choice of words can affect the results they collect. Read this [short article](https://www.nngroup.com/articles/leading-questions/) about what leading questions are and how to avoid them. With your partner, **discuss** how the following leading questions could affect a researcher’s results and why the revised question is less biased. - How much did you enjoy your Blue Room coffee? *vs.* How would you rate your Blue Room coffee? - How healthy are you? *vs.* How would you rate your health? - How much do you prefer Andrews pizza over Ratty pizza? *vs.* Do you prefer Andrews pizza or Ratty pizza? 8. In what ways can standardizing data introduce bias in data? ___ ### *CHECKPOINT:* Call over a TA to check you off for lab. ------ > Brown University CSCI 0111 (Spring 2024) > Feedback form: tell us about your lab experience today [here](https://forms.gle/WPXM7ja6KwHdK8by6)! <iframe src="https://docs.google.com/forms/d/e/1FAIpQLSc0gyQHM9YTw8gqqghjr2tU_TbXDCztvfFLmMY4ul5vyJD3jg/viewform?embedded=true" width="640" height="372" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe>