# Lab 4: Generating and Reporting About Data **NOTE:** Use the [111 Tables Documentation](https://hackmd.io/@cs111/table) for this lab! The objective of this lab is to gain familiarity with sanitizing, visualizing, and gaining insights from data. We’ll look at why human insight is important in data analysis and explore themes around responsible artificial intelligence usage. In all past assignments, the data that you worked with had been sanitized for you (i.e. you could expect Pyret to process the data without any extra work). The data in this lab, however, is being imported directly from Google Forms submissions, so there exists data that either isn't valid or isn't useful, and you need to find ways to effectively sort through those cases to sanitize the data, then run exploratory data analysis to uncover patterns and insights. ## Where does this lab fit into the course? We are working up to the course project, in which you will get a real-world data set, prepare it for use, and build some tables and charts/plots to present your analysis of a specific question. Homework 4 focused on combining data across tables. This lab focuses on cleaning up data so that it’s ready for analysis and on manually creating tables that summarize analysis results. ## Working with Forms Data Imagine that you are helping with a research project that is studying aspects of student life at Brown. The lab you're working with has a survey form. You're going to participate in the survey (as a respondent), then help analyze the collected data. **Task 1:** Your lab TAs will give you a link to a Google Form with the professor's questions. Respond to the questions (as if you were a participant in the study – don't worry, the survey is anonymous). **Task 2:** Now we're going to load your lab's data into Pyret. Go to the response spreadsheet (also from your TAs), make a copy of it (select *File* → *Make a Copy*), and import it into your program. Make sure to update the sharing permissions to *Anyone with the link*. This copy of the spreadsheet is yours to tweak manually. **There are some values that Pyret will not allow you to read in,** such as empty cells and other lovely surprises for you to discover. Pyret will let you know when it finds a value it doesn't like, and you will have to correct these by hand directly in the spreadsheet. After you do a few of these corrections, you might be able to spot a pattern for the sorts of data Pyret doesn't like. To save time, feel free to correct multiple issues at once before trying to load the data into Pyret. To import the spreadsheet, make a new Pyret file with the `dcic2024` context (important! Look back at our [Tables](https://hackmd.io/@cs111/table) documentation to learn how to do this) and **paste** this code into the file (replace the `ssid` with that of your own copy): ```pyret include gdrive-sheets include data-source # spreadsheet id from Google Sheets ssid = "<Copy of Your SSID>" # aka that weird sequence of letters and numbers in the url between /d/ and /edit in the URL data-sheet = load-spreadsheet(ssid) student-data = load-table: timestamp, campus, commute_range, sleep_hours, food_spot source: data-sheet.sheet-by-name("Form Responses 1", true) end # display the table student-data ``` :::info **Note:** The highlighted portion is what your ssid looks like: ![JFUdXKb](https://hackmd.io/_uploads/S1MImOo2lx.png) ::: :::warning **STOP!** Run your code! If you get an error like `No Spreadsheet with id ... found`, make sure you have updated the sharing permissions to **"Anyone with the link can view"**. Try to display the table to make sure it was imported correctly! ::: ## Preparing the Data Now that we have the data collected, we need to make sure that the data is clean and processed before analysis. **Task 3:** Talk about why it will be difficult to draw insights from the columns about dorm locations and late-night food spots. Be able to explain your reasoning to a TA. *Hint:* For sheets with multiple tables, you can change which one is being referenced by changing the string in the line `source: data-sheet.sheet-by-name("Form Responses 1", true)` from `"Form Responses 1"` to whatever your new table is called. Make sure to also change the columns being read in based on the ones that remain! **Task 4:** Once you import your table of student data, you will notice that the values in certain columns are in the format `some(x)` or `none`. In order to turn these values into the data types we all know and love (`String`, `Number`, etc.) use the **sanitize functions** upon import. Section 3.6.2 of the [Pyret Tables Documentation](https://pyret.org/docs/latest/tables.html) gives a quick overview of each sanitizer. :::warning Before writing your sanitizers, remember: **There are some values that Pyret will not allow you to read in.** Pyret will let you know when it finds a value it doesn't like. If you can't fix it with a sanitizer, you will have to correct these by hand directly in the spreadsheet by either editing the cells or deleting the entry itself. ::: **Task 5:** Look at the other columns. Discuss with your partner which columns might need to be cleaned up to make sense, and come up with a strategy to do this. In particular, come up with functions to clean up the columns about dorm location and late-night food spots. Think about when it might be better to clean the data using sanitizers versus when it might be more effective to do so manually. **Task 6:** Write these functions, and create a new table that with the cleaned columns. --- ### **CHECKPOINT: Call over a TA once you reach this point.** --- ## Analyze the data Now that the data has been sanitized, let's run exploratory data analysis on the data to find relationships between the different pieces of information that were collected on the form. ### What is Exploratory Data Analysis? :::info Exploratory data analysis (EDA) is the process of analyzing data to **find relationships, understand patterns, detect outliers and answer questions.** ::: For example, streaming services like Netflix use EDA to analyze user viewing data to find relationships between viewers that watch certain genres and what pages those viewers spend the most time on, then use this data for more targeted marketing. Other industries that rely on EDA include Finance for fraud and risk management, Sports Analytics to analyze player performance, Healthcare to predict health outcomes, and Retail to optimize product pricing, among many other applications. **Task 7:** EDA usually starts with a testable question. First, let's choose a research question that will guide your data analysis for the rest of the lab: - **Do North vs. South Campus students report longer commute times?** - **What’s the relationship between sleep hours and late-night food choices?** Write the code to compute these two answers. **Task 8:** Now, depending on the question you chose: - **If your question is about commute times**, create a new column that categorizes campus locations as "North" or "South". - **If your question is about food choices**, create a new column that categorizes food spots as "On campus" or "Off campus". **Task 9:** Use these new columns and other columns in the dataset to calculate the answer to your respective question. Create a [chart](https://pyret.org/docs/latest/chart.html) or [plot](https://pyret.org/docs/latest/plot.html#%28idx._%28gentag._617%29%29) to display your results. Often times within data science work you are communicating insights to people from non-technical backgrounds, meaning the way you communicate your insights needs to be clear and accessible. A big aspect of this is through how you visualize your data. Before diving in, take a moment to discuss with your partner which approach would be most effective for answering your question. Consider whether plotting your data, sorting it in a particular way, filtering for specific conditions, or combining multiple columns would be most helpful. Remember to refer back to the [111 Tables Documentation](https://hackmd.io/@cs111/table) to understand the visualization methods that will help you communicate your findings. --- ### **CHECKPOINT: Call over a TA once you reach this point.** --- ## Comparing with AI (Artificial Intelligence) Now that you’ve walked through the steps of EDA: data cleaning, analysis, and visualization yourself, you will try the same task with ChatGPT. You will prompt it to clean, analyze, and visualize the data, then compare its approach with your own. This comparison will help you understand the limits of AI and how to start thinking about more responsible AI use. **Task 10:** Download the Google Sheet (select *File* → *Download* → *Comma Separated Value (.csv)*). Upload the *.csv* file to ChatGPT. Use **one** prompt to guide the LLM to clean, analyze, and visualize the data in order to address your chosen research question. *Note: If uploading a CSV isn't working you can also copy and paste the data from the Google Sheet directly into ChatGPT* **Task 11:** Reflect on the differences between your exploratory data analysis and the AI's approach. Be prepared to list at least 3 things with a TA: - What are things you noticed in the data that ChatGPT didn't catch or handle properly? Or quickly handled? - What local knowledge or context did you use that the AI didn’t have or had to research? - How did your understanding of Brown University help you make better decisions about the data? **Task 12:** Connect your experiences as a Brown student to the bigger picture. The same way you needed Brown-specific information to understand patterns in student data, humans everywhere carry context and lived experience that helps them interpret information in ways AI can’t. Discuss this question with your partner and write 3–5 sentences responding to the prompt: *How can we effectively use AI tools while recognizing their limitations? What kind of context and guidance should we provide to AI to get meaningful responses?* **Please place your prompt from Task 10 and response to Task 12: [Here!](https://forms.gle/JKuXGCtN1agck97o7)** --- ### **CHECKPOINT: Call over a TA once you reach this point.** --- ## Sharing the data… Now that you've ran exploratory data analysis and reflected on the importance of human insight in data analysis, it's time to present your findings in a clear, digestible format. This is a crucial skill in data work: being able to communicate insights effectively to different audiences and in multiple ways. - **If your question is about commute times**, create a summary table that compares average commute times and average hours of weekday sleep for North Campus vs South Campus students. - **If your question is about food choices**, create a summary table that compares average commute times and average hours of weekday sleep for On Campus vs Off Campus late-night food students. The summary table should have data in table form, not just as separate numbers. Specifically, something like: | Location | Commute time | Sleep on weekdays | |------------|--------------|-------------------| | On campus | Average | Average | | Off campus | Average | Average | **Task 13:** The table operations we've seen so far don't seem useful here. Discuss why not (your answer should have something to do with rows and columns). **Task 14:** Seeing as the built-in table operations aren't suited to this, we will build the needed table **manually**. Remember we do this as follows: ```pyret my-table = table: location, commute-time, weekday-sleep row: "On campus", ..., ... row: "Off campus", ..., ... end ``` For this task, you will replace each `...` with an expression that computes the value that should go in that cell. --- ### **CHECKPOINT: Call over a TA once you reach this point.** --- ## Reflecting on Data Analysis Discuss the answers to the following questions with your partner: 1. What data preparation did you have to do before you could start your analysis? 2. Why does messy data exist in tables? - In what scenarios do people have to sort through messy data? - Do you have any suggestions for how to clean data? What should they watch out for? - How can we prevent "messy data"? What could we have changed in the form to make the data / results nicer? 3. When can you safely assume your data matches a specific format? 4. How can we expand what we did today to work on even larger, even messier data sets? 5. Given what you observed about AI's data cleaning approach versus your own, when might relying solely on AI-cleaned data lead to problematic conclusions? 6. How might the context you brought as a Brown student be similar to other types of local knowledge that communities have about their own information and data? What happens when that knowledge isn't included in an AI or even human’s analysis? --- ### **CHECKPOINT: Call over a TA to check you off for lab.** ![giphy](https://hackmd.io/_uploads/BkhCF_j2gg.gif) --- > Brown University CSCI 0111 (Spring 2025) > Feedback form: tell us about your lab experience today [here!](https://docs.google.com/forms/d/e/1FAIpQLSei8dM4LK-3DLzpThZuu1geLzCm9LPGDdhdnegdc_hABkVafg/viewform)