Lab 11: Using Resources and Pandas

# Lab 11: Using Resources and Pandas Oh no! You've been hanging "Missing" posters all over the city streets but can't seem to track down a pilot on the loose! You will need to need to scour the internet to find information that can help you find the lost pilot. To do this, it helps to know how to search effectively online. This week's lab is designed to help you and your professional pilot-whisperer, Aidan, learn how to do this! In other words, getting stuck and unstuck is part of the point this week, so don't get frustrated. We will be working on searching the Internet for useful and trustworthy sources to get information and debug your code: - We'll start with a couple examples of scenarios and potential queries to get you familiarized with what an effective search query looks like. - Then we'll apply these skills to write a program that reads from and writes to a file while utilizing Python packages, with a particular emphasis on [`pandas`](https://pandas.pydata.org/docs/getting_started/overview.html). ## Lab Presentation Slides [Lab 11 Presentation Slides](https://docs.google.com/presentation/d/1ykz5IfNBtQfzfVOggKE5il14FM-gr9U8/edit?usp=sharing&ouid=103145164288835407922&rtpof=true&sd=true) ## Problem 1 - Googling for Python Questions ### Instructions We'll go through a couple scenarios in which you might want to search online for information. Each scenario has several queries. For each of these scenarios: 1. Predict which example queries might be good and which might be bad. 2. Google the queries and look into a couple of the search results. Try to note whether the same sites are consistently helpful. 3. Write a ranking of the queries. Then, after you've made a ranking for each scenario, think about the common elements of the good and bad queries. ### Scenario #1: Getting Help on Error Messages Aidan decides to start out with some simple code, and writes the following: `2+"1"`, which throws an error: `TypeError: unsupported operand type(s) for +: 'int' and 'str'`. Assist him in searching online for help understanding the error. Example queries: - `python 2+"1"` - `TypeError: unsupported operand type(s) for +: 'int' and 'str'` - `python add int and string` - `python strings` ### Scenario #2: Getting Details on an Operation Aidan has grown quite fond of [ASCII art](https://i.pinimg.com/originals/d9/83/1d/d9831d5626c42e481cd4d96b3938f6f2.jpg), and he wants to be able to use `print`, but without a new line (as is printed by default). In other words, he wants `print("(\")` and `print("(\")` to print: `(\(\` rather than: `(\` `(\` Example queries: - `print("(\") but without new line` - `python print` - `python print without new line` ### Scenario #3: Finding an Appropriate Operation Aidan is doing list operations and wants to write a base case that checks if a list is empty, but he doesn't know how to do that. Example queries: - `“[]” python` - `check if list is empty` - `python lists` - `python check if list is empty` - `python list length` ***TASK:*** For scenarios 1, 2, and 3, which queries were the best? Which ones were the worst? Write in a Google Doc or on a piece of paper your responses to these questions. Make sure for each scenario to write one or two bullet points to explain your answers. ___ ### CHECKPOINT: **Call a TA over to discuss the questions above!** ___ ## Problem 2 - Intro to Python Packages/`pandas` ### Instructions Aidan is now equipped to navigate and effectively use the Internet to learn about programming! He wants to test his skills by tackling this topic that he's been hearing about a lot: file input and file output in Python. Aidan learns that you can write a Python program that reads the contents of a file on your computer, makes calculations, and even writes data to a new file on your machine. We'll start with a brief explanation of what a package is. Then, the rest of the lab will consist of a number of explanations and practice problems meant to familiarize you with popular Python packages. ### What's a package? First, some terminology: - A **package** is a collection of files that make up a **module**. - A **module** is a file containing Python definitions and statements which can be *imported* into your code. A more generalized word for a module is a *library*. There are *hundreds of thousands* of Python packages available online. Some are so commonly used that you'll find them in almost every large-scale Python application; others serve highly specific purposes. Sometimes, you will have to install a library separately from installing the software necessary to run Python. Thankfully for us, Ed Workspaces comes with all of the libraries we need for this semester. If you take other CS courses, you might encounter the complicated world of *package* and *environment* management (an environment is just a fancy term for a custom programming setup that's configured only with the packages you need). ### `pandas` `pandas` is a really powerful and fun Python library for data manipulation/analysis, with easy syntax and fast operations. Because of this, it is the probably the most popular library for data analysis in Python programming language. In this lab section, we're going to learn the basics of `pandas` and use its functionality to analyze some datasets. To start using `pandas` in your code, include this line at the top of your Python file: ``` import pandas as pd ``` #### Understanding DataFrames `pandas` is built around the concept of a `DataFrame`. Simply said, a `DataFrame` is a table. It has rows and columns. Each column in a `DataFrame` is a `Series` data structure, rows consist of elements inside `Series`. A `DataFrame` can be constructed using built-in Python lists and dictionaries: ``` >>> import pandas as pd >>> df = pd.DataFrame([ ... {'country': 'Kazakhstan', 'population': 17.04, 'square': 2724902}, ... {'country': 'Russia', 'population': 143.5, 'square': 17125191}, ... {'country': 'Belarus', 'population': 9l5, 'square': 207600}, ... {'country': 'Ukraine', 'population': 45.5, 'square': 603628} ... ]) ``` ``` >>> df country population square 0 Kazakhstan 17.04 2724902 1 Russia 143.50 17125191 2 Belarus 9.50 207600 3 Ukraine 45.50 603628 ``` :::spoiler **An alternate way to construct a `DataFrame`** You can also define a `DataFrame` as `dict` of columns: ``` df = pd.DataFrame({ ... 'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'], ... 'population': [17.04, 143.5, 9.5, 45.5], ... 'square': [2724902, 17125191, 207600, 603628] ... }) ``` ::: #### Reading and Writing to Files Reading and writing file data is incredibly easy using `pandas`, and `pandas` supports many file formats, including CSV, XML, HTML, Excel, JSON, and many more (check out the official `pandas` documentation). For example, if we wanted to save our previous DataFrame `df` to a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values) (spreadsheet), we only need a single line of code: ``` >>> df.to_csv('filename.csv') ``` We have saved our DataFrame, but what about reading data? No problem: ``` >>> df = pd.read_csv('filename.csv', sep=',') ``` In class, we also saw an example where we could ignore the headers defined by the CSV and re-label our columns: ``` >>> df = pd.read_csv('filename.csv', header=0, names=['col1', 'col2', col3']) ``` Now that we know the basics of `pandas`, let's go ahead and analyze some datasets! Here are some links to our documentation and a cheat sheet if you get stuck. Sometimes it is also helpful to find answers on StackOverflow or an AI chatbot, but be careful -- there are many ways to perform a single action in `pandas`, and it can be easy to copy-paste a line of code without understanding what it does (which is a whole mess when it comes to debugging!) or potentially choosing an inefficient/hard to test/hard to work with operation. * [Drill 27](https://www.gradescope.com/courses/1103553/assignments/6601031) -- this has the basic `pandas` syntax we talked about in lecture. **Refer to this and your notes from the 11/19 lecture for the basic operations to select Series/values from a dataframe, filter, and build/transform columns.** * [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) * [Official Documentation](http://pandas.pydata.org/pandas-docs/stable/index.html) -- you can refer to it if you need, but it has a *lot* of information to comb through :::spoiler **Not seeing the Pandas Cheat Sheet?** Download the pdf from the above link to access the cheat sheet ::: --- ### Candy Data Aidan has recently been craving candy a lot, so we've been requested to revisit the candy dataset from [Lab 3](https://hackmd.io/@cs111/lab03-f25). Unlike Pyret, Python has no built-in table functionality (like reading a table directly from Google Sheets, table functions, etc). To complete this lab, we're going to have to take advantage of Python's ability to mutate data, iterate through data, and read and write data to and from input/output files, specifically using `pandas`. ### Setup Fork [this Ed Workspace](https://edstem.org/us/courses/84795/workspaces/p4tLjvcF3ejokLn19x2vswcngOqjzq4f) (you can also find it listed in the public workspaces) and share it with your partner. The data we're using is stored in `candy-data.csv` and the file you'll be working in is `lab11.py`. ### Task 1: Read Candy Data using `pandas` You and Aidan should be experts on surfing the web for relevant information and answers now, so let's put those skills to the test. We aren't going to give you much guidance about how to complete these tasks; remember the takeaways from Part 1, and try to use online resources (but if you get stuck, the TAs are still here to help). :::warning **NOTE:** Your solutions should read directly from the `candy-data.csv` file. Make sure not to copy the contents of the file into your code. ::: 1. Write a function or series of expressions that read from `candy-data.csv` and calculates the name of the candy with the highest win percentage. :::info **HINT:** If you're not sure where to start, try following the steps below: - Read a CSV file into Python using `pandas`. *The path of your CSV file will be 'candy-data.csv', because we've set up the workspace so that the data file is in the same folder as the code file* - Try to print out a few win percentages for different candies using the pandas access operations we learned about. - Plan out your code! You should be able to do this with the basic table building blocks we know about (row/column access, length, building/transforming columns, filtering, and sorting). - If you're in the Thursday lab, we haven't yet covered sorting in class. Try reading the [`sort_values` documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html). It's ok if this page feels intimidating -- Python library functions are used in a lot of different ways, and the documentation has a lot of information that can be hard to comb through. Try using your searching skills to find examples of using the `sort_values` function! - You might run into an issue with row labels -- expand the hint below for more guidance. - Run your code and ensure that it works properly :::spoiler **Dealing with row labels** You might notice that `sort_values` does not arrange the row labels -- so the row at the top does not magically become row 0. Try to use your newfound search skills to figure out how to use `sort_values` in a way that *does* re-label the rows! Even if you know the answer (which we will/have covered in class on Friday), try to see if you can figure out an effective way to search for the answer. ::: 2. Write a function or series of expresions that read from `candy-data.csv` and writes the names of the candies with chocolate to a file named `chocolates.csv`, such that each name is on a separate line. Your solution should **not** use a `for`-loop. ___ ### CHECKPOINT: **Call a TA over to go over your work from above!** ___ ### Task 2: More Candy Data Manipulation Again, for the following, refrain from using `for`-loops. Most of these can be done by using operations in our Pandas Operations Summary, along with the knowledge you learned in Task 1. For questions that ask for "top N" rows, you will have to search the web! 1. Use `pandas` to get the candy with the highest sugar percentage. 2. Use `pandas` to get all of the candy that contains both chocolate and caramel. 3. Save the `DataFrame` with candy containing chocolate and caramel as a csv file called `chocolate_and_caramel.csv`. 4. Use `pandas` to find the top 5 most "boujee" candies, aka the ones with the highest price percents. See the hint below on how you might build a dataframe with just the first 5 rows. 5. Use `pandas` to find the top 3 most liked and popular **non-chocolate** candies (highest win percents). 6. Now, use `pandas` to add a column to the candy data called `too-sweet`, which will store a Boolean value (`True` and `False`, rather than 1s and 0s) for each candy depending on if it's too sweet. In this case, if sugar-percent is 0.50 or higher, then it's too sweet. :::spoiler Hint for getting the first N rows of a dataframe In class, we saw that `.loc` can be used in fancy ways, such as by passing in a row and a column at the same time (e.g. `my_df.loc[2, 'col1']`). In the previous lab, you learned about a Python term called *slicing* which allows you to get only a part of a string or list by specifying a range. Try to use your search skills to see if `pandas` has a way of combining these two ideas! ::: ___ ### CHECKPOINT: **Call a TA over to go over your work from above!** ___ ### Task 3: Continuing your Journey in CS! Take some time to skim [this article](https://www.geeksforgeeks.org/imposter-syndrome-in-software-developers-am-i-a-fake-developer/) about imposter syndrome (a feeling of not accomplishing enough or being unable to accomplish goals) among programmers. Reflect on the following questions with your lab partner: - Why did you take csci0111? Did this experience change your potential area of study/interest? - How does csci0111 apply to your own interests inside and/or outside the field of cs? As the semester comes to an end, remember that the TAs (in this course and more advanced CS courses) are here to support you in navigating your own pathway within computer science, whether you plan to go into industry, want to research, or apply computer science to another field of study! Please remember to take care of yourself and congratulations on finishing the last lab of csci0111! ___ ## Key Takeaways + Resources ### Googling for Python Questions Things to consider when googling in order to debug your code: - [Stack Overflow](https://stackoverflow.com/) is a website useful for answering specific coding questions (but try to find posts with lots of upvotes) - Websites with tutorials such as [GeeksForGeeks](https://www.geeksforgeeks.org/) are more useful for explaining a particular concept or algorithm - If an answer contains concepts that you haven't seen before, keep searching -- there are often many ways to implement the same feature, and a different one might be more familiar - If you're not sure why an answer isn't working, double check that it uses Python 3.7 or higher (and not Python 2). The specific version of Python for CS 111 this semester is 3.11 (configured when you created your `cs111-env` environment). ### Python Packages More information about python packages (you can browse this whole site for more specific info): - [An Overview of Packaging for Python](https://packaging.python.org/overview/) CSV files: - CSV files are plain text files that arrange tabular data, with each piece of data separated by a comma - CSV files makes it easy to import/ export large chunks of data from spreadsheets or databases - You can use `pandas` to manipulate CSVs - You can also utilize Python CSV's package since Python already has a built-in CSV library, which we can import (not covered in this lab) ### `pandas` If you want to learn more about the power of `pandas`, below are a few resources that you can explore: - [12 Useful Pandas Techniques in Python for Data Manipulation](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/) - [Pandas Tutorial](https://www.python-course.eu/pandas.php) --- > Brown University CSCI 0111 (Fall 2025) > Feedback form: tell us about your lab experience today [here](https://forms.gle/avVrN7H8u6hjiH8j7)! >