Mini Project

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Due : Friday, December 10, at 9:00PM EST.

This project has two parts: the action items on this handout, and some conceptual questions on Gradescope The parts are designed so you can work on them independently, i.e. complete one before completing the other.

Note on TA hours: Although you are allowed to use up to 3 late days on this assignment, TA hours will be reduced after December 10th.

Setup and Handin

Setup

  • Download sample.csv here and make sure it is in the same folder as your mini_pandas.py.
  • Start by copying and pasting the following code into a file named mini_pandas.py. This contains all your necessary import statements, as well as a helper function that will store the contents of sample.csv in a Dataframe.
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()


# Add columns to our dataframe

data = pd.DataFrame(columns=["location", "when", "type", "value", "age", "patient_id"])

# Read in information from our CSV

data = data.append(pd.read_csv("sample.csv", parse_dates=["when"]))

  • Please make sure you're able to import Pandas, which we covered in in Lab 11. Link coming soon
  • Do not put your name anywhere in your homework files.

Handin

You will be submitting one Python file to Gradescope under Mini Project:

  • mini_pandas.py, which will contain all of the code for your functions as well as the code snippet above

Remember your resources!

Coding Assignment

In this assignment, we'll be analyzing information about patient readings. However, instead of figuring out how to store this information internally, we'll be taking that data from an external file (sample.csv, which you downloaded above), loading it into our program into a Dataframe (which is done in the code snippet we provided above), and performing different analyses on it using pandas.

Part 1: Data Cleaning

In the past, we learned how to use sanitizers to clean data when importing from a Google sheet to a Pyret table. These sanitizers had their limitations, and we often had to clean and normalize our data in more specific ways using methods such as transform-column. Now, we'll be doing something similar in Pandas.

Task: Write a function, clean_data, that does the following to data, the Dataframe containing all our data from sample.csv:

  • Converts all the values in the type and location columns to all-lowercase
  • Replaces all instances of the value "heighth" in the type column with "height"
  • Only keeps all the rows whose value is non-negative

Note: This function should not take in any inputs, nor should it return anything.

Hint!

There are multiple ways to transform columns of strings in pandas. As alluded to in class, one way to figure out how to do this is to read the pandas documentation, or to search online for a tutorial. We want you to practice this skill, so your TAs will ask you what you've tried to search for before giving you more hints.

Part 2A: Analysis

Now that we've cleaned our data, we're going to perform analyses on it using pandas!

Task: Write a function, age_vs_height, which creates a scatter plot with age on the x-axis and height (values of rows where the type is equal to "height") on the y-axis.

Does age seem to always correlate with height? If not, for approximately which range of ages does age correlate with height? Answer this question in the comments of your code.

Hint!
  1. Create a dataframe that contains only the rows from data whose type is height (after being cleaned). Call your new dataframe height_data. As an example, if you wanted to filter your table for only ages under 11 you could do this:

    ​​​​age_data = data[data["age"] < 11]
    
  2. After step 1, we should have a dataframe of where the type is only height! It might look something like this (in the actual dataframe, you will have many more rows):

    location when type value age patient ID
    healthbridge clinic 2019-12-01 13:30:00 height 53.0 10 0
    healthbridge clinic 2019-12-01 13:30:00 height 68.0 40 3
    healthbridge clinic 2019-12-01 13:30:00 height 64.0 15 1

    How can we create a scatter plot with our newly filtered table? Take a look at this for help!

Note: This function should not take in any inputs, nor should it return anything.

Note: To make sure you're using the cleaned version of data for your analysis, make sure you've called clean_data!

Task: In the code, call the function you just wrote to see your plot! Running the file should have your scatter plot pop up on screen.

Part 2B: More Analysis

Now that we've made one analysis, it's time for a more complex one!

Task: One of the locations in sample.csv is a clinic called Oakdale Health Center. Write a function, oakdale_temp_over_time, that creates a line plot that shows the average temperature (values of rows where the type is equal to "temperature") reading for the Oakdale Health Center over time.

This analysis will take a couple more steps than age_vs_height.

Hint!

You'll have to filter once for the location and then once again for the type. We suggest doing this in two separate steps similar to how you filtered in 2a!

Once you have your final filtered table you might be wondering, "How can we compute an average across all readings with the same when?" We can do so using groupby, which produces a new dataframe that groups by the values of one column, and combines values from another column in a way we specify. Take a look at the documentation for groupby here!

Note: Before creating your line plot, include the line plt.figure() before the line that creates the plot. This tells Python to create a new figure for your line plot instead of using the one that was used in age_vs_height.

Note: This function should not take in any inputs, nor should it return anything.

Note: As with age_vs_height, to make sure you're using the cleaned version of data for your analysis, make sure you've called clean_data!

Note: If you're curious, more information about groupby can be found here!

Task: In the code, call the function you just wrote to see your plot! Running the file should now have both your scatter plot and your line graph pop up on screen.

Between which dates was the largest increase in average temperature? Write your answer as a comment in your code.

Part 3: A bit on testing

Although we didn't ask you to explicitly test clean_data, age_vs_height or oakdale_temp_over_time, you might find yourself in a position where it's important that you know that your code works before running it on the full data set (for example, if the data set were much, much larger).

Task: Create a smaller dataframe like the ones you used for testing your pyret tables. You do not have to write pytest tests for this dataframe, but you should run each of clean_data, age_vs_height, and oakdale_temp_over_time on this dataframe and verify to yourself that the result is correct.

Updated 12/06!

We will leave it up to y’all how to best approach testing (change “data” to point to your small data frame, create test functions) as long as your code still runs on the provided data frame!

Note: Not sure how to make a smaller dataframe? Check this out.

Conceptual Questions

Complete the Gradescope questions linked here. You are allowed to use Pycharm and reference materials when completing these questions, but you should not discuss your answers with classmates.

Theme Song

Kung Fu Fighting by Carl Douglas

Congratulations on completing your final CS0111 homework assignment! On behalf of the TA staff, it's been great getting to work with you all. Best of luck with the final!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Brown University CSCI 0111 (Fall 2021)
Do you have feedback? Fill out this form.