Due : Friday, December 10, at 9:00PM EST.
This project has two parts: the action items on this handout, and some conceptual questions on Gradescope The parts are designed so you can work on them independently, i.e. complete one before completing the other.
Note on TA hours: Although you are allowed to use up to 3 late days on this assignment, TA hours will be reduced after December 10th.
sample.csv
here and make sure it is in the same folder as your mini_pandas.py
.mini_pandas.py
. This contains all your necessary import statements, as well as a helper function that will store the contents of sample.csv
in a Dataframe.You will be submitting one Python file to Gradescope under Mini Project:
mini_pandas.py
, which will contain all of the code for your functions as well as the code snippet aboveIn this assignment, we'll be analyzing information about patient readings. However, instead of figuring out how to store this information internally, we'll be taking that data from an external file (sample.csv
, which you downloaded above), loading it into our program into a Dataframe (which is done in the code snippet we provided above), and performing different analyses on it using pandas.
In the past, we learned how to use sanitizers
to clean data when importing from a Google sheet to a Pyret table. These sanitizers had their limitations, and we often had to clean and normalize our data in more specific ways using methods such as transform-column
. Now, we'll be doing something similar in Pandas.
Task: Write a function, clean_data
, that does the following to data
, the Dataframe containing all our data from sample.csv
:
type
and location
columns to all-lowercase"heighth"
in the type
column with "height"
value
is non-negativeNote: This function should not take in any inputs, nor should it return anything.
There are multiple ways to transform columns of strings in pandas. As alluded to in class, one way to figure out how to do this is to read the pandas documentation, or to search online for a tutorial. We want you to practice this skill, so your TAs will ask you what you've tried to search for before giving you more hints.
Now that we've cleaned our data, we're going to perform analyses on it using pandas!
Task: Write a function, age_vs_height
, which creates a scatter plot with age
on the x-axis and height
(values
of rows where the type
is equal to "height"
) on the y-axis.
Does age seem to always correlate with height? If not, for approximately which range of ages does age correlate with height? Answer this question in the comments of your code.
Create a dataframe that contains only the rows from data
whose type is height
(after being cleaned). Call your new dataframe height_data. As an example, if you wanted to filter your table for only ages under 11 you could do this:
After step 1, we should have a dataframe of where the type is only height! It might look something like this (in the actual dataframe, you will have many more rows):
location | when | type | value | age | patient ID |
---|---|---|---|---|---|
healthbridge clinic | 2019-12-01 13:30:00 | height | 53.0 | 10 | 0 |
healthbridge clinic | 2019-12-01 13:30:00 | height | 68.0 | 40 | 3 |
healthbridge clinic | 2019-12-01 13:30:00 | height | 64.0 | 15 | 1 |
How can we create a scatter plot with our newly filtered table? Take a look at this for help!
Note: This function should not take in any inputs, nor should it return anything.
Note: To make sure you're using the cleaned version of data
for your analysis, make sure you've called clean_data
!
Task: In the code, call the function you just wrote to see your plot! Running the file should have your scatter plot pop up on screen.
Now that we've made one analysis, it's time for a more complex one!
Task: One of the locations in sample.csv
is a clinic called Oakdale Health Center. Write a function, oakdale_temp_over_time
, that creates a line plot that shows the average temperature
(values
of rows where the type
is equal to "temperature"
) reading for the Oakdale Health Center over time.
This analysis will take a couple more steps than age_vs_height
.
You'll have to filter once for the location and then once again for the type. We suggest doing this in two separate steps similar to how you filtered in 2a!
Once you have your final filtered table you might be wondering, "How can we compute an average across all readings with the same when
?" We can do so using groupby, which produces a new dataframe that groups by the values of one column, and combines values from another column in a way we specify. Take a look at the documentation for groupby here!
Note: Before creating your line plot, include the line plt.figure()
before the line that creates the plot. This tells Python to create a new figure for your line plot instead of using the one that was used in age_vs_height
.
Note: This function should not take in any inputs, nor should it return anything.
Note: As with age_vs_height
, to make sure you're using the cleaned version of data
for your analysis, make sure you've called clean_data
!
Note: If you're curious, more information about groupby
can be found here!
Task: In the code, call the function you just wrote to see your plot! Running the file should now have both your scatter plot and your line graph pop up on screen.
Between which dates was the largest increase in average temperature? Write your answer as a comment in your code.
Although we didn't ask you to explicitly test clean_data
, age_vs_height
or oakdale_temp_over_time
, you might find yourself in a position where it's important that you know that your code works before running it on the full data set (for example, if the data set were much, much larger).
Task: Create a smaller dataframe like the ones you used for testing your pyret tables. You do not have to write pytest tests for this dataframe, but you should run each of clean_data
, age_vs_height
, and oakdale_temp_over_time
on this dataframe and verify to yourself that the result is correct.
Updated 12/06!
We will leave it up to y’all how to best approach testing (change “data” to point to your small data frame, create test functions) as long as your code still runs on the provided data frame!
Note: Not sure how to make a smaller dataframe? Check this out.
Complete the Gradescope questions linked here. You are allowed to use Pycharm and reference materials when completing these questions, but you should not discuss your answers with classmates.
Kung Fu Fighting by Carl Douglas
Congratulations on completing your final CS0111 homework assignment! On behalf of the TA staff, it's been great getting to work with you all. Best of luck with the final!
Brown University CSCI 0111 (Fall 2021)
Do you have feedback? Fill out this form.