Lab 11: Using Resources and Pandas

Oh no! You're flying over the city streets but can't seem to track down a duck on the loose! You will need to need to scour the interwebs to find information that can help you find the duck's path. To do this, it helps to know how to search effectively online. This week's lab is designed to help you and your professional duck-whisperer, Jeremy, learn how to do this! In other words, getting stuck and unstuck is part of the point this week, so don't get frustrated.

We will be working on searching the Internet for useful and trustworthy sources to get information and debug your code:

  • We'll start with a couple examples of scenarios and potential queries to get you familiarized with what an effective search query looks like.
  • Then we'll apply these skills to write a program that reads from and writes to a file while utilizing Python packages, with a particular emphasis on pandas.

Problem 1 - Googling for Python Questions

Instructions

We'll go through a couple scenarios in which you might want to search online for information. Each scenario has several queries. For each of these scenarios:

  1. Predict which example queries might be good and which might be bad.
  2. Google the queries and look into a couple of the search results. Try to note whether the same sites are consistently helpful.
  3. Write a ranking of the queries.

Then, after you've made a ranking for each scenario, think about the common elements of the good and bad queries.

Scenario #1: Getting Help on Error Messages

Jeremy decides to start out with some simple code, and writes the following: 2+"1", which throws an error: TypeError: unsupported operand type(s) for +: 'int' and 'str'. Assist him in searching online for help understanding the error.

Example queries:

  • python 2+"1"
  • TypeError: unsupported operand type(s) for +: 'int' and 'str'
  • python add int and string
  • python strings

Scenario #2: Getting Details on an Operation

Jeremy has grown quite fond of ASCII art, and he wants to be able to use print, but without a new line (as is printed by default). In other words, he wants print("(\") and print("(\") to print:
(\(\
rather than:
(\
(\

Example queries:

  • print("(\") but without new line
  • python print
  • python print without new line

Scenario #3: Finding an Appropriate Operation

Jeremy is doing list operations and wants to write a base case that checks if a list is empty, but he doesn't know how to do that.

Example queries:

  • “[]” python
  • check if list is empty
  • python lists
  • python check if list is empty
  • python list length

TASK: For scenarios 1, 2, and 3, which queries were the best? Which ones were the worst? Write in a Google Doc or on a piece of paper your responses to these questions. Make sure for each scenario to write one or two bullet points to explain your answers.


CHECKPOINT:

Call a TA over to discuss the questions above!


Problem 2 - Intro to Python Packages/pandas

Instructions

Jeremy is now equipped to navigate and effectively use the Internet to learn about programming! He wants to test his skills by tackling this topic that he's been hearing about a lot: file input and file output in Python.

Jeremy learns that you can write a Python program that reads the contents of a file on your computer, makes calculations, and even writes data to a new file on your machine.

We'll start with a brief explanation of what a package is. Then, the rest of the lab will consist of a number of explanations and practice problems meant to familiarize you with popular Python packages.

What's a package?

First, some terminology:

  • A package is a collection of files that make up a module.
  • A module is a file containing Python definitions and statements which can be imported into your code. A more generalized word for a module is a library.

There are hundreds of thousands of Python packages available online. Some are so commonly used that you'll find them in almost every large-scale Python application; others serve highly specific purposes.

VSCode makes it very easy to download and use packages in your projects.

Installing Packages using VSCode

If you followed our lab 0 instructions, you should already have pandas in your cs111-env environment. If you want to install a new package in this environment, you should go to VSCode > Terminal > New Terminal, check that the cs111-env is activated, and use the command

conda install -c conda-forge [package name]

For example, you should install matplotlib (a plotting package) if you haven't already, by running conda install -c conda-forge matplotlib. We won't use it for this lab, but you will use it in the mini project!


pandas

pandas is a really powerful and fun Python library for data manipulation/analysis, with easy syntax and fast operations. Because of this, it is the probably the most popular library for data analysis in Python programming language.

In this lab section, we're going to learn the basics of pandas and use its functionality to analyze some datasets.

To start using pandas in your code, include this line at the top of your Python file:

import pandas as pd

Understanding DataFrames

pandas is built around the concept of a DataFrame. Simply said, a DataFrame is a table. It has rows and columns. Each column in a DataFrame is a Series data structure, rows consist of elements inside Series.

A DataFrame can be constructed using built-in Python lists and dictionaries:

>>> import pandas as pd
>>> df = pd.DataFrame([
...      {'country': 'Kazakhstan', 'population': 17.04, 'square': 2724902},
...      {'country': 'Russia', 'population': 143.5, 'square': 17125191},
...      {'country': 'Belarus', 'population': 9l5, 'square': 207600},
...      {'country': 'Ukraine', 'population': 45.5, 'square': 603628}
... ])
>>> df
   country  population    square
0  Kazakhstan       17.04   2724902
1      Russia      143.50  17125191
2     Belarus        9.50    207600
3     Ukraine       45.50    603628
An alternate way to construct a DataFrame

You can also define a DataFrame as dict of columns:

df = pd.DataFrame({
...     'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
...     'population': [17.04, 143.5, 9.5, 45.5],
...     'square': [2724902, 17125191, 207600, 603628]
... })

Reading and Writing to Files

Reading and writing file data is incredibly easy using pandas, and pandas supports many file formats, including CSV, XML, HTML, Excel, JSON, and many more (check out the official pandas documentation).

For example, if we wanted to save our previous DataFrame df to a CSV file (spreadsheet), we only need a single line of code:

>>> df.to_csv('filename.csv')

We have saved our DataFrame, but what about reading data? No problem:

>>> df = pd.read_csv('filename.csv', sep=',')

Now that we know the basics of pandas, let's go ahead and analyze some datasets! Here are some links to our documentation and a cheat sheet if you get stuck. Sometimes it is also helpful to find answers on StackOverflow, but be careful there are many ways to perform a single action in pandas, and it can be easy to copy-paste a line of code without understanding what it does (which is a whole mess when it comes to debugging!)


Candy Data

Jeremy has recently been craving candy a lot, so we've been requested to revisit the candy dataset from Lab 3.

Unlike Pyret, Python has no built-in table functionality (like reading a table directly from Google Sheets, table functions, etc). To complete this lab, we're going to have to take advantage of Python's ability to mutate data, iterate through data, and read and write data to and from input/output files, specifically using pandas.

Setup

  1. Navigate to this GitHub file. In the top right of the dataset, click 'Raw'. You should see a wall of text on your browser.
  2. Go ahead and right click the page of text and click "Save as" to save the file as candy-data.csv inside your current directory.
  3. If trying this doesn't work (it might say it can only be saved as a .webarchive, not a .csv), instead create a new file in your VSCode project called candy-data.csv. This will open a blank file and you can copy and paste all the 'Raw' data into it.
  4. Create a Python file called lab11.py. This is where you'll implement the functions below.

Task 1: Read Candy Data using pandas

You and Jeremy should be experts on surfing the web for relevant information and answers now, so let's put those skills to the test. We aren't going to give you much guidance about how to complete these tasks; remember the takeaways from Part 1, and try to use online resources (but if you get stuck, the TAs are still here to help).

  1. Write a function or series of expressions that read from candy-data.csv and calculates the name of the candy with the highest win percentage.

    HINT: If you're not sure where to start, try following the steps below:

    • Figure out how to read a CSV file into Python using pandas note that this is different from what we saw in class, where we downloaded a CSV from the internet. Here, we're asking you to figure out how to read in a CSV file that is stored on your computer.
    • Try to print out a few win percentages for different candies (figure out how to access them from a DataFrame using our reference document).
    • Plan out your code! You should use function(s) from our Pandas Operations Summary to help you (even ones you might not have seen demonstrated in class yet!). You might run into an issue with row labels expand the hint below for more guidance.
    • Test your code to ensure that it works properly.
Dealing with row labels

If you tried to use sort_values in your plan, that's great! You're on the right track. You might notice that sort_values does not arrange the row labels so the row at the top does not magically become row 0. Try to use your newfound search skills to figure out how to use sort_values in a way that does re-label the rows!


  1. Write another function or series of expressions that write the results of your answer to Question 1 to a file named result-1.csv. How the results are stored in this file is up to you (as long as it's a valid CSV that opens in a spreadsheet program such as Excel or Google Sheets).

    HINT: This time, start by writing code that just writes the string "Hello, World" onto the first line of a file. Once you've done so, integrate data from the CSV. This might be easier with just the normal python write function (look up the documentation!).

  2. Write a function or series of expresions that read from candy-data.csv and writes the names of the candies with chocolate to a file named result-2.csv, such that each name is on a separate line. Your solution should not use a for-loop.

NOTE:

  • Your solutions should read directly from the file. Make sure not to copy the contents of the file into your code.

CHECKPOINT:

Call a TA over to go over your work from above!


Task 2: More Candy Data Manipulation

Again, for the following, refrain from using for-loops. Most of these can be done by using operations in our Pandas Operations Summary, along with the knowledge you learned in Task 1. For questions that ask for "top N" rows, you will have to search the web!

  1. Use pandas to get the candy with the highest sugar percentage.
  2. Use pandas to get candy that contains both chocolate and caramel.
  3. Save the DataFrame with candy containing chocolate and caramel as a csv file called chocolate_and_caramel.csv.
  4. Use pandas to find the top 5 most "boujee" candies, aka the ones with the highest price percents.
  5. Use pandas to find the top 3 most liked and popular non-chocolate candies (highest win percents).
  6. Now, use pandas to add a column to the candy data called too-sugary, which will store a Boolean value (True and False, rather than 1s and 0s) for each candy depending on if it's too sugary. In this case, if sugar-percent is 0.50 or higher, then it's too sugary.

CHECKPOINT:

Call a TA over to go over your work from above!


Key Takeaways + Resources

Googling for Python Questions

Things to consider when googling in order to debug your code:

  • Stack Overflow is a website useful for answering specific coding questions (but try to find posts with lots of upvotes)
  • Websites with tutorials such as GeeksForGeeks are more useful for explaining a particular concept or algorithm
  • If an answer contains concepts that you haven't seen before, keep searching there are often many ways to implement the same feature, and a different one might be more familiar
  • If you're not sure why an answer isn't working, double check that it uses Python 3.7 or higher (and not Python 2). The specific version of Python for CS 111 this semester is 3.9 (configured when you created your cs111-env environment).

Python Packages

More information about python packages (you can browse this whole site for more specific info):

CSV files:

  • CSV files are plain text files that arrange tabular data, with each piece of data separated by a comma
  • CSV files makes it easy to import/ export large chunks of data from spreadsheets or databases
  • You can use pandas to manipulate CSVs
  • You can also utilize Python CSV's package since Python already has a built-in CSV library, which we can import (not covered in this lab)

pandas

If you want to learn more about the power of pandas, below are a few resources that you can explore:

A real life Panda!


Brown University CSCI 0111 (Fall 2023)
Do you have feedback? Fill out this form.