M1T7 - Pandas (Kaggle) - V1

# Pandas - Kaggles & Notebook 7 **Disclaimer**: These notes were created with the sole intention of creating an 'all-in-one' place to store information to use during the CSE6040 exams. As such I have included my solutions to the Notebooks within these notes and wrapped them in a spoiler block. If another individual chooses to use these notes for their own educational purposes please be aware that the intentional copying of my solutions for use on your homework is a violation of the Georgia Tech Code of Conduct. Information on the Code of Conduct can be found here: [Georgia Tech Code of Conduct](https://policylibrary.gatech.edu/student-life/student-code-conduct) ## 🧭 Emoji Legend (Note Formatting Guide) <details> <summary>🧭 Click to expand Emoji Legend</summary> | Icon | Meaning / Use Case | Example Usage | |:----:|--------------------|----------------| | 🧩 | **Main topic or section header** — introduces a major theme or concept | `## 🧩 Groupwise Analysis` | | 🔸 | **Major concept or subsection** — represents a key idea under the main topic | `### 🔸 Why Group Data?` | | 🔹 | **Detailed operation or example** — used for methods, step-by-step logic, or code examples | `### 🔹 groupby() minimum` | | ⚙️ | **Mechanics** — explains what’s happening under the hood | “⚙️ Mechanics: Pandas scans column values and builds sub-DataFrames.” | | 🎯 | **Goal / Purpose** — clarifies what we’re trying to achieve | “🎯 Goal: Count how many rows belong to each group.” | | 💡 | **Tip / Takeaway / Insight** — highlights key points, advice, or conceptual notes | “💡 Takeaway: Use .agg() for multiple summary stats.” | | ✅ | **Shortcut / Confirmation** — shows equivalent methods or quick wins | “✅ Equivalent: `reviews.points.value_counts()`” | | 🧠 | **Conceptual Insight** — emphasizes mental models or deeper understanding | “🧠 Think of groupby() as split → apply → combine.” | | 📘 | **Reference / Documentation Link** — links to external docs or sources | “📘 [Pandas MultiIndex Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)” | | 🪄 | **Trick / Tip** — optional visual for clever techniques or shortcuts | “🪄 Use .apply() to manipulate each group independently.” | | 🧭 | **Legend / Navigation Aid** — used for orientation or overview sections | “## 🧭 Emoji Legend (Note Formatting Guide)” | --- 🪄 *Tip:* All of these are Unicode emojis that work natively in HackMD, GitHub, Notion, and most Markdown renderers. You can insert them using your OS emoji picker: - **Windows:** `Win + .` - **macOS:** `Cmd + Ctrl + Space` </details> ## 🧭 Table of Contents [TOC] # Creating, Reading and Writing ## 🧩 Getting Started ### 🔸 Why Data Creation and I/O Matter Before you can analyze data, you need to create, import, or export it in a structured way. This section covers how to set up Pandas, create basic data structures, and read or write files so you can start exploring data efficiently. ### 🔹 Getting Started **🎯 Goal:** Initialize Pandas and prepare your environment for data analysis. **Setup:** ```python import pandas as pd ``` ## 🧩 Creating Data ### 🔸Why Create Data Manually? Before analyzing real datasets, it’s important to understand how to build DataFrames and Series from scratch. This gives you full control for testing, debugging, or learning Pandas operations on simple examples before scaling up to large data sources. ### 🔹 Creating Data **🎯 Goal:** Understand how to create Pandas data structures — DataFrames and Series — as the foundation for data analysis. **⚙️ Mechanics:** Pandas provides two core objects: * DataFrame → A 2-D labeled table (rows + columns). * Series → A 1-D labeled array (a single column of data). You can create them manually or from external data sources (like CSVs, SQL tables, or JSON files). ### 🔹 DataFrame **🎯 Goal:** Create and understand a Pandas `DataFrame`, the table-like structure used for most operations. **⚙️ Mechanics:** A `DataFrame` is built using the `pd.DataFrame()` constructor. You typically pass in a dictionary, where: * Keys → Column names * Values → Lists (column data) **Examples:** ```python pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]}) ``` ![image](https://hackmd.io/_uploads/B1mNH5_aee.png) Rename the rows using the `index` parameter: ```python pd.DataFrame({ 'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.'] }, index=['Product A', 'Product B']) ``` ![image](https://hackmd.io/_uploads/rkRuH5_all.png) **💡 Takeaway:** A **DataFrame** is a collection of **Series** aligned by shared row labels (the index). Use `pd.DataFrame()` to build tabular data directly from Python structures like lists or dicts. ### 🔹 Series **🎯 Goal:** Learn how to create and label a one-dimensional data column. **⚙️ Mechanics:** A `Series` is created with `pd.Series()`, which stores data in a single labeled column. You can assign labels to rows using the `index` parameter, and a name to the Series itself with `name`. **Examples:** ```python pd.Series([1, 2, 3, 4, 5]) ``` ```python pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A') ``` ![image](https://hackmd.io/_uploads/rJ0ILq_aeg.png) ## 🧩 Reading Data Files ### 🔸 Why Read External Data? Most real-world analysis starts with **importing datasets** rather than creating them manually. Reading data from external files (like CSV, Excel, or databases) allows you to analyze large, structured datasets efficiently within Pandas. ### 🔹 Importing Data **🎯 Goal:** Learn how to **import, inspect, and manage tabular data** from external files (like CSVs) using Pandas. **⚙️ Mechanics:** Use `pd.read_csv()` to load data into a DataFrame, then inspect it with built-in methods like .`shape` and `.head()`. You can also specify an index column and export your cleaned dataset using `.to_csv()`. **Examples:** Read a CSV into a DataFrame: ```python wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv") ``` Check dataset dimensions: ```python wine_reviews.shape ``` Preview first few rows: ```python wine_reviews.head() ``` Specify a column as the index when reading: ```python wine_reviews = pd.read_csv( "../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0 ) ``` Save DataFrame to a new CSV: ```python wine_reviews.to_csv("my_dataframe.csv", index=True) ``` ## 💡 Summary of Reading Data Files | Operation | Description | Example | |------------|--------------|----------| | `import pandas as pd` | Imports the Pandas library for data manipulation and analysis. | `import pandas as pd` | | `pd.DataFrame()` | Creates a DataFrame (table) from a dictionary of lists or arrays. | `pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})` | | `index=` (in DataFrame) | Assigns custom row labels when creating a DataFrame. | `pd.DataFrame({'Bob': [...], 'Sue': [...]}, index=['A', 'B'])` | | `pd.Series()` | Creates a one-dimensional labeled array (like a single column). | `pd.Series([30, 35, 40], index=['2015', '2016', '2017'], name='Sales')` | | `.name` | Assigns or displays the name of a Series. | `s.name = 'Revenue'` | | `pd.read_csv()` | Reads a CSV file into a DataFrame. | `pd.read_csv('data.csv')` | | `.shape` | Returns the tuple (rows, columns) of the DataFrame. | `df.shape` | | `.head()` | Displays the first five rows of the DataFrame. | `df.head()` | | `index_col=` | Sets a specific column as the DataFrame’s index during import. | `pd.read_csv('data.csv', index_col=0)` | | `.to_csv()` | Exports a DataFrame to a CSV file. | `df.to_csv('output.csv', index=True)` | --- # Indexing, Selecting & Assigning ## 🧩 Native Accessors **🎯 Goal:** Learn how to access columns and individual entries within a Pandas DataFrame using native Python-style syntax. **⚙️ Mechanics:** * Access DataFrame columns using dot notation (`df.column`) or bracket notation (`df['column']`). * Access individual values using another pair of brackets `(df['column'][row_index])`. * Both return a Series when selecting a column, or a scalar when selecting one value. ### 🔹 Accessing Columns Access a column as a Series using dot notation: ```python reviews.country ``` Access a column as a Series using bracket notation: ```python reviews['country'] ``` These are functionally the same - but bracket notation is considered safer if there are special characters. ### 🔹 Accessing Individual Values Access a single entry by chaining the two accessors: ```python reviews['country'][0] ``` **💡 Think of it like:** Grab the `country` Series from `reviews`, then select the first value. ## 🧩 Indexing in Pandas **🎯 Goal:** Learn how to select specific rows and columns in a DataFrame using index-based (`.iloc`) and label-based (`.loc`) selection. **⚙️ Mechanics:** * `.iloc` → Selects data by position (numeric indices). * `.loc` → Selects data by label (index or column names). * Both follow the pattern: `df[rows, columns]`. (Row first, column second unlike Python) ### 🔹 Index-Based Selection (`.iloc`) Access data by integer position, similar to standard Python indexing. ```python reviews.iloc[0] # First row reviews.iloc[:, 0] # First column reviews.iloc[:3, 0] # First 3 rows, first column reviews.iloc[[0,1,2], 0] # Specific rows reviews.iloc[-5:] # Last 5 rows ``` ### 🔹 Label-Based Selection (`.loc`) Access data by index labels or column names - not by numeric position. ```python reviews.loc[0, 'country'] # Row label 0, column 'country' reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']] ``` **🧠 Concept:** `.loc` relies on labels, not positions — so it’s aware of named indices or custom labels. **💡 Tip:** Works better for most real datasets, since indices and columns often have semantic meaning (e.g., “Country,” “Year,” etc.). ### 🔹 Choosing Between .iloc and .loc **⚙️ Key Difference:** * `.iloc` → Position-based and exclusive end range * `.loc` → Label-based and inclusive end range ### 💡 Summary of Indexing in Pandas | Operation | Description | Example | |------------|--------------|----------| | **`df.iloc[i]`** | Selects a single row by **integer position**. | `reviews.iloc[0]` | | **`df.iloc[:, j]`** | Selects a single column by **integer position**. | `reviews.iloc[:, 0]` | | **`df.iloc[a:b, c]`** | Selects a **range of rows** (end **excluded**). | `reviews.iloc[:3, 0]` | | **`df.iloc[[a,b,c], d]`** | Selects **specific rows** by list of positions. | `reviews.iloc[[0,1,2], 0]` | | **`df.iloc[-n:]`** | Selects the **last n rows** using negative indices. | `reviews.iloc[-5:]` | | **`df.loc[row_label, column_label]`** | Selects by **index label** and **column name**. | `reviews.loc[0, 'country']` | | **`df.loc[:, ['col1','col2']]`** | Selects multiple columns by **label**. | `reviews.loc[:, ['taster_name','points']]` | | **`df.loc[row_start:row_end]`** | Selects a **range of labels** (end **included**). | `reviews.loc[0:10]` | | **Range behavior** | `.iloc` is **exclusive**; `.loc` is **inclusive**. | `iloc[0:10] → 10 rows`<br>`loc[0:10] → 11 rows` | | **Typical Use Case** | `.iloc` → position-based selection.<br>`.loc` → label-based selection. | — | ## 🧩 Manipulating the Index **🎯 Goal:** Change or redefine the DataFrame’s index to make data easier to organize or access. **⚙️ Mechanics:** Use set_index() to replace the default numeric index with one or more meaningful columns. ```python reviews.set_index("title") ``` **💡 Tip:** Useful when the new index better represents the data’s structure (e.g., unique IDs, product names, dates). ## 🧩 Conditional Selection **🎯 Goal:** Filter rows based on conditions — like comparisons, inclusion, or null checks. **⚙️ Mechanics:** Combine Boolean expressions with `&` (and), `|` (or), and use `.isin()` or `.notnull()` for advanced filtering. ### 🔹 Basic Conditional Filtering ```python reviews.loc[reviews.country == 'Italy'] ``` Returns all rows where `country` is `"Italy"`. ### 🔹 Multiple Conditions Use parentheses around each condition: ```python reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] ``` ➡️ Wines from Italy and with 90+ points. Use `|` for logical OR: ```python reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)] ``` ➡️ Wines from Italy or with 90+ points. ### 🔹 Using `isin()` ```python reviews.loc[reviews.country.isin(['Italy', 'France'])] ``` ➡️ Select rows where country is either` Italy` or `France`. ### 🔹 Using `isnull()` / `notnull()` ```python reviews.loc[reviews.price.notnull()] ``` ➡️ Select rows where the `price` is not missing. **💡 Tip:** Always use parentheses when chaining multiple conditions to avoid precedence issues. ## 🧩 Assigning Data **🎯 Goal:** Add or modify columns in a DataFrame. **⚙️ Mechanics:** Use direct assignment (`df['new_col'] = ...`) to add new columns or overwrite existing ones. ### 🔹 Assigning a Constant ```python reviews['critic'] = 'everyone' ``` ➡️ Adds a new column with the same value for all rows. ### 🔹 Assigning a Sequence ```python reviews['index_backwards'] = range(len(reviews), 0, -1) ``` ➡️ Adds a column counting backward from the total row count. ## 💡 Summary of Manipulating, Filtering, and Assigning Data | Concept | Description | Example | |----------|--------------|----------| | **Set Index** | Replaces DataFrame’s index with a column. | `df.set_index('title')` | | **Conditional Filter** | Selects rows meeting logical conditions. | `df.loc[(df.country == 'Italy') & (df.points >= 90)]` | | **Membership Filter** | Selects rows matching any value in a list. | `df.loc[df.country.isin(['Italy', 'France'])]` | | **Null Check** | Filters missing or non-missing values. | `df.loc[df.price.notnull()]` | | **Assign Constant** | Adds a new column with one constant value. | `df['critic'] = 'everyone'` | | **Assign Iterable** | Adds a column using a range or list. | `df['index_backwards'] = range(len(df), 0, -1)` | --- # Summary Functions and Maps # Grouping and Sorting ## 🧩 Groupwise Analysis ### 🔸 Why Group Data? `map()` lets us transform data *one value at a time* in a column, but `groupby()` lets us **split the dataset into logical groups** and then perform an operation *within* each group. Think of `groupby()` as: > “Break this DataFrame into smaller DataFrames where all rows share the same value in a column.” --- ### 🔹 value_counts() vs groupby() **🎯 Goal:** Count how many rows fall into each group. **⚙️ Mechanics:** `groupby()` organizes data into subgroups based on unique values, then we can count, average, or aggregate within each subgroup. **Example:** ```python reviews.groupby('points').points.count() == reviews.points.value_counts() ``` ![image](https://hackmd.io/_uploads/HJICPGuaex.png) This groups all rows that share the same `points` value, then counts how many rows (reviews) are in each group. **💡 Takeaway:** `value_counts()` is a simplified `groupby()` that automatically counts rows per unique value. --- ### 🔹 groupby() minimum **🎯 Goal:** Summarize or analyze each group separately. **⚙️ Mechanics:** After grouping, you can access a specific column and apply any aggregation like `.min()`, `.max()`, `.mean()`, etc. **Example:** ```python reviews.groupby('points').price.min() ``` ![image](https://hackmd.io/_uploads/B1woxBOTxg.png) Each `points` group returns the minimum `price` found among its rows. **🧠 Think of it as:** A "slice" of the DataFrame for each group - like multiple smaller tables filtered by shared values. **More control with apply():** ```python #Get the first wine review title from each winery reviews.groupby('winery').apply(lambda df: df.title.iloc[0]) ``` ![image](https://hackmd.io/_uploads/Ske0lBOTex.png) **Multiple columns:** ```python # Find the best wine by country and province reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()]) ``` ![image](https://hackmd.io/_uploads/SJSJWr_Tll.png) **💡 Takeaway:** `groupby()` creates on sub-DataFrame per unique group. `apply()` lets you perform customer operations on each sub-DataFrame. --- ### 🔹 groupby() + agg() **🎯 Goal:** Run several aggregate functions at once. **⚙️ Mechanics:** `agg()` allows multiple summary operations in one step. **Example:** ```python reviews.groupby('country').price.agg([len, min, max]) ``` ![image](https://hackmd.io/_uploads/S1LXbHdage.png) This returns a DataFrame with the number of rows, min price, and max price per country. **💡 Takeaway:** Use `agg()` when you want multiple summary stats instead of chaining multiple `min()` or `max()` calls. --- ### 💡 Summary of Groupby Section | Operation | Description | Example | |------------|--------------|----------| | **value_counts()** | Quickly counts occurrences of each unique value in a Series. | `reviews.points.value_counts()` | | **groupby()** | Splits the DataFrame into sub-DataFrames based on unique column values. | `reviews.groupby('points')` | | **groupby().count()** | Counts rows in each group (same as `value_counts()` but flexible for any column). | `reviews.groupby('points').points.count()` | | **groupby().min() / .max() / .mean()** | Applies summary functions to each group. | `reviews.groupby('points').price.min()` | | **groupby().apply()** | Runs a custom function on each group (full control of each sub-DataFrame). | `reviews.groupby('winery').apply(lambda df: df.title.iloc[0])` | | **groupby().agg([len, min, max])** | Runs multiple aggregation functions at once for concise summaries. | `reviews.groupby('country').price.agg([len, min, max])` | --- 🪄 *Tip:* Use `groupby()` when you need to analyze or summarize data **per category** instead of across the entire DataFrame. ## 🧩 Multi-Indexes ### 🔸 Why Multi-Indexes Exist When you group by **multiple columns**, Pandas creates a **MultiIndex** — a hierarchical structure that represents grouped levels of data. **🧠 Think of it as:** > “Rows are organized by two (or more) keys instead of one.” This allows for more complex, layered summaries such as `(country → province)` or `(year → month)` groupings. --- ### 🔹 Creating a Multi-Index **🎯 Goal:** Summarize data across multiple grouping keys. **⚙️ Mechanics:** When you call `groupby()` with multiple columns, Pandas creates one “level” of the index for each grouping column. **Example:** ```python countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len]) countries_reviewed.head() ``` ![image](https://hackmd.io/_uploads/BJwDmSuaxe.png) * `country` → first index level * `province` → second index level * `len` → aggregated column showing review counts Each `(country, province)` pair now uniquely identifies a row. --- ### 🔹 Working with Multi-Indexes **🎯 Goal:** Make hierarchical results readable or easy to manipulate. **⚙️ Mechanics:** Reverts a multiple column index back to a single, numerical index. This also converts the index columns to standard columns. **Example:** ```python countries_reviewed.reset_index() ``` ![image](https://hackmd.io/_uploads/HkkCNrualx.png) `country` and `province` return to being standard columns instead of index labels. --- ### 💡 Summary of Multi-Index Section | Operation | Description | Example | |------------|--------------|----------| | **MultiIndex** | A hierarchical index created when grouping by multiple columns. Useful for layered summaries. | `reviews.groupby(['country','province']).description.agg([len])` | | **.index** | Returns the index object of a grouped DataFrame. Used to check if a result has a MultiIndex. | `countries_reviewed.index` | | **reset_index()** | Converts a MultiIndex back into standard columns for easier manipulation or sorting. | `countries_reviewed.reset_index()` | | **Multi-level access** | Lets you select or slice data using one or more levels of the MultiIndex. | `countries_reviewed.loc['Italy']` | | **Why use it** | Efficiently represents multi-key data (e.g., country → province) in compact tabular form. | — | 📘 [Pandas MultiIndex Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) ## 🧩 Sorting ### 🔸 Why Sorting Matters After using `groupby()`, Pandas returns results in **index order**, not based on the actual data values. That means your rows might be organized alphabetically by index labels rather than by numeric or meaningful order. **🧠 Think of it as:** > “`groupby()` organizes results by how they’re labeled, not by what their values are.” To make summaries easier to read or analyze, we use **sorting** functions. --- ### 🔹 sort_values() **🎯 Goal:** Reorder rows in a DataFrame based on one or more column values. **⚙️ Mechanics:** `sort_values()` sorts rows by column data — not by index. It returns a *new* DataFrame unless you use `inplace=True`. **Example:** ```python countries_reviewed = countries_reviewed.reset_index() countries_reviewed.sort_values(by='len') ``` ![image](https://hackmd.io/_uploads/Byu-wS_age.png) * Sorts rows by the `len` column in **ascending order** (lowest to highest) * Does not modify the original DataFrame unless specified **🪄 Tip** To sort from highest to lowest, set `ascending=False`. ```python countries_reviewed.sort_values(by='len', ascending=False) ``` ![image](https://hackmd.io/_uploads/Hk9aPH_age.png) --- ### 🔹 sort_index() **🎯 Goal:** Reorder rows or columns based on index labels instead of values. **⚙️ Mechanics:** Useful when your DataFrame has a Multi-Index or when the order of index labels matters. **Example:** ```python countries_reviewed.sort_index() ``` ![image](https://hackmd.io/_uploads/HkIVdBu6gl.png) This sorts the DataFrame by its index. **💡 When to use it:** After `groupby()` or hierarchical operations that leave your data with a structured index. --- ### 🔹 Sorting by Multiple Columns **🎯 Goal:** Define a hierarchy of sorting — for example, first by country, then by length (count). **⚙️ Mechanics:** * Pandas sorts by the first column listed, then breaks ties using the next one(s). * Works with both numeric and string columns. **Example:** ```python countries_reviewed.sort_values(by=['country', 'len']) ``` ![image](https://hackmd.io/_uploads/HksrtrOTle.png) **💡 Use case:** Organizing grouped summaries where you want a primary and secondary sorting key. --- ### 💡 Summary of Sorting Section | Operation | Description | Example | |------------|--------------|----------| | **sort_values(by=)** | Sorts the DataFrame by one or more columns in ascending order (lowest first by default). | `countries_reviewed.sort_values(by='len')` | | **sort_values(..., ascending=False)** | Sorts the DataFrame in descending order (highest values first). | `countries_reviewed.sort_values(by='len', ascending=False)` | | **sort_index()** | Sorts rows or columns based on their index labels instead of column values. | `countries_reviewed.sort_index()` | | **Multi-column sort** | Sorts by multiple columns in sequence (uses the next column to break ties). | `countries_reviewed.sort_values(by=['country', 'len'])` | | **Why use it** | Ensures that grouped or summarized results are displayed in a meaningful and logical order. | — | --- # Data Types and Missing Values ## 🧩 Dtypes ### 🔹 Dtype Identification **🎯 Goal:** Identify the data types of the data stored within either a Series or a DataFrame. **⚙️ Mechanics:** Identify the data type of the stored data for either a Series/column or the DataFrame. **Examples:** View the data type of a single column named `price`: ```python reviews.price.dtype ``` `dtype('float64')` View the data type of every column: ```python reviews.dtypes ``` ![image](https://hackmd.io/_uploads/ByhO-tdplg.png) * Note: Columns consisting entirely of strings do not get their own type, they are given the `object` type instead. ### 🔹 Change Dtype **🎯 Goal:** Convert a column (or Series) from one data type to another when such a conversion makes sense. **⚙️ Mechanics:** Use the `.astype()` method to change the data type (`dtype`) of a column or Series. This is often done to make data compatible for calculations or to ensure consistent storage formats. **Examples:** Change the datatype of a single column (`int64` → `float64`) ```python reviews.points.astype('float64') ``` ![image](https://hackmd.io/_uploads/rJC8VKdale.png) Change the datatype of the DataFrame index: ```python reviews.index.dtype ``` `dtype('int64')` ## 🧩 Missing Data ### 🔹 Not a Number (`NaN`) **🎯 Goal:** Identify, select, and replace missing or invalid entries within a DataFrame. **⚙️ Mechanics:** * Missing values are represented by **`NaN`** (Not a Number). * For technical reasons, `NaN` values are always stored as the `float64` dtype. * Pandas provides several tools to detect and fill missing values. **Examples:** To select `NaN` entries: ```python reviews[pd.isnull(reviews.country)] ``` ![image](https://hackmd.io/_uploads/HyVRvYdalx.png) To select not `NaN` entries: ```python reviews[pd.notnull(reviews.country)] ``` ### 🔹 Replace Missing Values **🎯 Goal:** Fill or replace missing (`NaN`) entries within a DataFrame or Series using different strategies. **⚙️ Mechanics:** Use the `.fillna()` method to replace all missing values in a column or Series. This can be done by assigning a constant value (like `"Unknown"`) or using a fill strategy such as *backfill* or *forward fill*. **Examples:** To replace `NaN` with `Unknown`: ```python reviews.region_2.fillna("Unknown") ``` ![image](https://hackmd.io/_uploads/Sk0ouKupgl.png) **Alternate Strategy** Use *backfill* or *forward fille* to propagate valid values into missing entries. ```python # Backfill: replace with next valid value reviews.region_2.fillna(method='bfill') # Forward fill: replace with previous valid value reviews.region_2.fillna(method='ffill') ``` ### 🔹 Replacing Specific or Invalid Values **🎯 Goal:** Update or correct specific entries within a DataFrame or Series, whether missing or non-missing. **⚙️ Mechanics:** Use the `.replace()` method to substitute one or more specific values with new ones. This is useful for correcting errors, updating outdated entries, or replacing placeholder text such as `"Unknown"` or `"Invalid"`. **Examples:** Update a reviewer's Twitter handle: ```python reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino") ``` * Note: The replace() method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like `"Unknown"`, `"Undisclosed"`, `"Invalid"`, and so on. # Renaming and Combining ## 🧩 Renaming ### 🔹 Renaming Columns **🎯 Goal:** Change one or more column names in a DataFrame to more meaningful or updated labels. **⚙️ Mechanics:** Use the `.rename()` method with the `columns` keyword and a dictionary mapping **old names → new names**. This updates only the specified columns, leaving the rest of the DataFrame unchanged. **Examples:** Rename the `points` column to `score`: ```python reviews.rename(columns={'points': 'score'}) ``` ### 🔹 Renaming Index Values **🎯 Goal:** Change the labels (names) of specific rows in a DataFrame’s index. **⚙️ Mechanics:** Use the `.rename()` method with the `index` keyword and a dictionary mapping **old index values → new labels**. This updates only the specified index entries while leaving all other rows untouched. **Examples:** Rename the first two index values from `0` and `1` to `firstEntry` and `secondEntry`: ```python reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'}) ``` ### 🔹 Renaming Axis Names **🎯 Goal:** Assign or modify the *name* of the DataFrame’s row and column axes (not the individual labels). **⚙️ Mechanics:** Use the `.rename_axis()` method to change the *axis names* for rows or columns. This affects the axis titles, not the row or column values themselves. **Examples:** Rename the **row axis** to `"wines"` and the **column axis** to `"fields"`: ```python reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns') ``` ## 🧩 Combining Pandas provides three main methods for combining data: 1. `concat()` → Stack DataFrames or Series along an axis. 2. `join()` → Combine DataFrames by matching index labels. 3. `merge()` → SQL-style joins (not covered here). ### 🔹 concat() **🎯 Goal:** Combine multiple DataFrames or Series into one by stacking them along an axis (row-wise or column-wise). **⚙️ Mechanics:** `pd.concat()` takes a list of DataFrames or Series and merges them together. By default, it stacks them vertically (`axis=0`). **Examples:** Combine two YouTube datasets with identical columns: ```python canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv") british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv") pd.concat([canadian_youtube, british_youtube]) ``` ![image](https://hackmd.io/_uploads/H1200KOalx.png) ### 🔹 left.join() or right.join() **🎯 Goal:** Combine two DataFrames based on matching **index values**, similar to a SQL join. **⚙️ Mechanics:** Use the `.join()` method to merge two DataFrames where their indexes align. You can specify suffixes (`lsuffix`, `rsuffix`) to differentiate overlapping column names. If indexes don’t match perfectly, rows with missing matches will contain `NaN`. **My Breakdown**: - **Think of it as:** a *search-and-append* operation. - **Goal:** Keep all rows from the left DataFrame and pull matching data from the right. - **Key idea:** For each row in the left DataFrame, pandas searches the right one for matching key(s) and copies those values over. - **If no match:** the new columns get `NaN`, but the row is still kept. - **If multiple matches:** the left row is repeated once per match. - **Common use:** Start with the table that connects to all others (the “hub”) and add context with left merges. - **Example mental model:** like using a dictionary: - Keys = the columns you match on (e.g., `cust_id`, `service`) - Values = the extra data being attached **Examples:** Create two YouTube DataFrames — one for Canada and one for the UK — and align them on `title` and `trending_date`: ```python left = canadian_youtube.set_index(['title', 'trending_date']) right = british_youtube.set_index(['title', 'trending_date']) left.join(right, lsuffix='_CAN', rsuffix='_UK') ``` ![image](https://hackmd.io/_uploads/BkSjyqu6gg.png) --- # Notebook 7 Notes https://github.com/Frodo-Swagginz/CSE6040-Fall-2025/blob/main/module%201/Topic%207%20-%20Tidying%20Data/(M1T7)%20Notebook%207.ipynb ## 🧩 Tidy Data A tidy dataset is a data set that can be organized into a 2-D table that: 1. each column represents a *variable* 2. each row represents an *observation* 3. each entry of the table represents a single *Value*, which may come from either categorical (discrete) or continuous spaces. ++Tibble++: Stands for tidy table. ## 🧩 Canonical Order A tibble is considered canonical if: 1. The variables appear in sorted order by name, ascending from left to right. 2. The rows appear in lexicographically sorted order by variable, ascending from top to bottom. 3. The row labels (Y.index) go from 0 to n-1, where n is the number of observations. ## 🧩 Iris Dataset We have loaded the Iris Dataset into Vocareum and use several Pandas operations to view the dataset's information. ## 🧩 Exercises ### 🔸 Exercise 0 **🎯 Goal:** Calculate 'prevalence', which is the ratio of cases to the population using DataFrame G which is guarenteed to have two columns 'cases' and 'population'. Return a new DataFrame, H, that is a copy of G that contains a new column 'prevalence'. **← Inputs:** DataFrame G that has columns 'cases' and 'population'. ![image](https://hackmd.io/_uploads/BJkN-Aspee.png) **→ Outputs:** DataFrame H, a copy of G, with a new column 'prevalence'. We cannot use `G['prevalence'] = G['cases'] / G['population']`. **⚙️ Pseudocode:** 1. Create a copy of G, name it H 2. Create our new column, `H['prevalence']` and use `.apply()` on `H` to calculate prevalence: case/population **Solution**: ```python ### Solution - Exercise 0 def calc_prevalence(G): assert 'cases' in G.columns and 'population' in G.columns #Step 1: Create a copy to avoid editing the original G H = G.copy() #Step 2: H['prevalence'] assigns a new column within H ##.apply() on H, using axis = 1 and result_type='expand'. axis=1 applies the function ACROSS the columns i.e. row by row ##'expand' turns list-like results into columns (Shouldn't apply here, but it's a safety valve) H['prevalence'] = H.apply(lambda x: x['cases'] / x['population'], axis=1, result_type='expand') return H ``` **⚙️ Key Pandas Concepts:** * `G.copy()` → creates a copy of the DataFrame * `H['prevalence']=` → creates a new column labeled 'prevalence' * `.apply(,axis=1, result_type='expand')` → `axis=1` to compute a new value per row based on the other columns, `result_type='expand'` converts list-like results into columns (not necessary, but safer) **📘 References**: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html ### 🔸 Exercise 1 **🎯 Goal:** Given a Tibble X, return a copy Y so that is canonically structured. **← Inputs:** DataFrame X that is non-canonical. ![image](https://hackmd.io/_uploads/r16B7RsTel.png) **→ Outputs:** DataFrame Y, a copy of X, that is canonical. **⚙️ Pseudocode:** 1. Alphabetically sort the list-like object of the column headers 2. Create a copy of X, Y, that sorts data on the sorted headers (Think vertically sorted) 3. Sort the *values* of Y based on the sorted headers 🧠 By sorting on the sorted headers, Pandas will appropriately handle any 'ties' that occur within the first column based on the contents of the second 4. Reset the index column **Solution**: ```python ### Solution - Exercise 1 def canonicalize_tibble(X): # Enforce Property 1: The variables appear in sorted order by name, ascending from left to right. ##X.columns accesses the METADATA that holds the column headers (i.e. headers are not rows/columns just labels) ##we sort those labels alphabetically using sorted() ##Create a copy of X, called Y, that sorts columns based on the sorted headers. ##Thoughtful Insight: Think of this step as moving the data around VERTICALLY var_names = sorted(X.columns) Y = X[var_names].copy() # Enforce Property 2: Columns must be sorted ##We have moved the data around vertically, but now must move it around horizontally. ##.sort_values() reorders rows based on column values ###by= provides the sorting parameter ###ascending=True sorts from lowest to highest Y = Y.sort_values(by=var_names, ascending=True) # Enforce Property 3: Reset row labels Y = Y.reset_index(drop=True) return Y ``` **⚙️ Key Pandas Concepts:** * `X.columns` → returns a list-like object containing the column headers of X * `sorted()` → sorts a list alphabetically * `sort_values(by=sorted(X.columns), ascending=True)` → sorts the *values* within each column from smallest to largest (`ascending=True`) and `by=` provides our sorting rule, which in this case is alphabetical due to referencing `sorted()` * `.reset_index(drop=True)` → `drop=True` removes the old index while `reset_index` creates a new one IN-PLACE **📘 References**: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html ### 🔸 Exercise 2 **🎯 Goal:** Given two tibbles, A and B, determine if they are equivalent. Equivalence is defined as having identical variables and observations, but allows the variables and observations to be out of order. **← Inputs:** DataFrames A and B. ![image](https://hackmd.io/_uploads/BkNcRJ26xg.png) **→ Outputs:** DataFrames AA and BB that are passed into a `try-except` function that checks for `assert_frame_equal(AA, BB)` and returns the appropriate statement given `True` or `False`. **⚙️ Pseudocode:** 1. Pass each DataFrame A and B into the properties of canonical order 2. Use a `try-except` function to either verify equivalence or provide an `error` **Solution**: ```python ### Solution - Exercise 2 def tibbles_are_equivalent(A, B): from pandas.testing import assert_frame_equal #Step 1: convert both dataframes to canonical order #Property 1 var_names_a = sorted(A.columns) var_names_b = sorted(B.columns) AA = A[var_names_a].copy() BB = B[var_names_b].copy() #Property 2 AA = AA.sort_values(by=var_names_a, ascending=True) BB = BB.sort_values(by=var_names_b, ascending=True) #Property 3 AA = AA.reset_index(drop=True) BB = BB.reset_index(drop=True) try: assert_frame_equal(AA, BB) return 'Tibbles are equivalent!' except AssertionError as error_message: return str(error_message) ``` **⚙️ Key Pandas Concepts:** * `assert_frame_equal(df1, df2)` → compares two DataFrames and outputs any differences **📘 References**: https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_frame_equal.html ### 🔸 Exercise 3 **🎯 Goal:** Manually `melt` a 'wide' DataFrame to tidy the *values* and turn columns into rows. Melting involves: 1. Extract the *column* values into a new variable. 2. Convert the values associated with the column values into a new variable as well. **← Inputs:** * A DataFrame, `df` * A list of the *column* names that will serve as values, `col_vals` * Name of the new variable, `key` * Name of the column to hold the values, `value` | country | 1999| 2000 | | -------- | -------- | -------- | | Afghanistan | 745 | 2666 | | Brazil | 37737 | 80488 | | China | 212258 | 213766 | **→ Outputs:** A manually melted DataFrame, Y, that is a copy of the original. 'Long' **⚙️ Pseudocode:** 1. Create a copy of the original DataFrame 2. Identify the identifier column(s) * Compare all column names in `a` with `col_vals` * Keep the column(s) that are *not* in `col_vals` * Assign that column name to `id_col` 3. Initialize an empty list (iterable) to collect the new DataFrames 4. Loop through each value in `col_vals` (1999, 2000) * For each year: * Create a new temporary DataFrame (`temp`) with: * `"country"`: taken from `a[id_col]` * `"year"`: set to the current year (loop variable) * `"cases"`: the values from that year's column (`a[year]`) * Append `temp` to `frames` 5. Concatenate all DataFrames in `frames` into one * Use `pd.concat(frames, ignore_index=True)` * This stacks each mini DataFrame vertically into one tidy table 6. Return the final long-form DataFrame **Solution**: ```python ### Solution - Exercise 3 def melt(df, col_vals, key, value): assert type(df) is pd.DataFrame #Step 1: Create a copy of our original DataFrame a = df.copy() #Step 2: Determine the identifier column(s) ##Note: col_vals = [1999, 2000] ##Note: a.columns = Index(['country', 1999, 2000]) ##Loop over a.columns and identify the id column that is not present in col_vals AND add that to a list then grab the string id_col = [c for c in a.columns if c not in col_vals][0] # assumes one id column, returns 'country' #Step 3: For each column name in col_vals, build a small DataFrame ##Note: Each temp DataFrame should look like: pd.DataFrame({'country': [Afghanistan, Brazil, China], ## 'year': 1999 ## 'cases': [745, 37737, 212258]}) ## The only thing that changes for the second loop is that 'year' is 2000 and the cases become the appropriate values for 2000 ##Create a blank list to hold our values (DataFrames look like df({key: [values]})) frames = [] #loop over the years in col_values for year in col_vals: #Each key:value pair below defines a new column in our temporary DataFrame temp = pd.DataFrame({ #'country' is the name of the new column in temp. a['country'] pulls the country column from a id_col: a[id_col], #key ('year') is the name, year is the current loop value, hence there is no a['year'] key: year, #value ('cases') is the name, a[year] refers to the column name from the original. value: a[year] #So our loop is [1999, 2000] so it pulls the column from 1999 on the first loop and then the column from 2000 on the 2nd. }) frames.append(temp) ##Step 4: concatenate our dataframes into one using pd.concat() which accepts an itera Y = pd.concat(frames, ignore_index=True) return Y ``` **⚙️ Key Pandas Concepts:** * `pd.DataFrame({})` → creates a new DataFrame * `pd.concat(iterable, ignore_index=True)` → Combines DataFrames stored in an iterable into one DataFrame. `ignore_index=True` resets the row label * `DataFrame.columns` → returns an index-like list of column names * `a[id_col]` (Dynamic Column Selection) → Access the data for a column usings its named stored in a variable **📘 References**: https://pandas.pydata.org/docs/reference/api/pandas.concat.html ### 🔸 Exercise 4 **🎯 Goal:** Manually `cast` a too-long DataFrame where observations are split across multiple rows. Tidy up the DataFrame by placing observations in a column under their relevant variable. **← Inputs:** * A DataFrame, `df` * A `key` column: values will become the new headers * A `value` column: values will fill the new columns * A `join_how` parameter for `pd.merge`: determines how row align when merging ![image](https://hackmd.io/_uploads/rJx24znaxx.png) **→ Outputs:** A wide DataFrame that matches: * Each unique key becomes its own column. * Fixed variables remain. * alues aligned by the fixed variables. * Merge behaviour is controlled by `join_how` **⚙️ Pseudocode:** 1. Identify fixed variables * Extract columns that are not the key or value and keep them constant 2. Initialize an empty DataFrame with only our fixed variables 3. Loop through each unique key: * Subset `df` to only rows matching that key * Keep the `fixed_vars` and `value` column * Rename the `value` column to the current key's name * Merge this small DataFrame with the growing tibble of the fixed variables using the supplied `how=join_how` **Solution**: ```python ### Solution - Exercise 4 def cast(df, key, value, join_how='outer'): """Casts the input data frame into a tibble, given the key column and value column. """ assert type(df) is pd.DataFrame assert key in df.columns and value in df.columns assert join_how in ['outer', 'inner'] #Step 1: Identify the fixed variables and keep them constant ##df.columns returns a list of headers Index(['country', 'year', 'type', 'count'] ##df.columns.difference returns a list of headers that are not in the supplied list: Index(['country', 'year']) fixed_vars = df.columns.difference([key, value]) #Step 2: Create an empty tibble (DataFrame) to hold our new values ##define the initial columns, fixed_vars tibble = pd.DataFrame(columns=fixed_vars) # start as an empty tibble #Step 3: Loop through each UNIQUE key ##For each unique entry in the key column (e.g., 'cases', 'population') for k in df[key].unique(): ##df[df[key]==k] creates a T/F to pull values within the key column of df that matches the current loop value ('cases' or 'population') ###i.e. we're only pulling rows where the key == cases and then where key == population ##[list(fixed_vars) + [value]].copy() tells pandas which columns to keep. This creates a list ['country', 'year', 'value'] #So, take only the rows where the key equals the current k value, and then, #from those rows, keep only the columns ['country', 'year', 'value'] sub = df[df[key] == k][list(fixed_vars) + [value]].copy() # ##Rename the value column to the current key name sub = sub.rename(columns={value: k}) ##Merge this subset into the growing tibble tibble = pd.merge(tibble, sub, on=list(fixed_vars), how=join_how) return tibble ``` **⚙️ Key Pandas Concepts:** * `df.columns.difference([key, value])` → returns a list of values not part of `[key,value]` * `df[key].unique()` → gets all unique key values * `DataFrame.rename()` → renames columns dynamically within each loop * `pd.merge(..., how=join_how)` → merges two DataFrames on shared columns; how controls whether to perform an inner or outer join * `pd.merge()` → combines horizontally based on shared keys * `df[rows][columns]` → is how we access DataFrames values **📘 References**: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html https://pandas.pydata.org/docs/reference/api/pandas.merge.html ### 🔸 Exercise 5 **🎯 Goal:** Split a text column into multiple new columns using a splitter function. **← Inputs:** * A DataFrame, `df`, pictured below * A column name to split, `key` * A list of new column names, `into` * A Regex Splitter function, `splitter=default_splitter` ![image](https://hackmd.io/_uploads/Bk4wDm3Tex.png) **→ Outputs:** A new DataFrame with: * Original DataFrame columns with the exception of `key` * New columns listed in `into` filled from the split **⚙️ Pseudocode:** 1. Apply the splitter to the `key` column 2. Turn the resulting Series of list into a DataFrame 3. Drop the old `key` column from the original DataFrame 4. Concatenate side-by-side (by index), stacking horizontally **Solution**: ```python ### Solution - Exercise 5 def separate(df, key, into, splitter=default_splitter): """Given a data frame, separates one of its columns, the key, into new variables. """ assert type(df) is pd.DataFrame assert key in df.columns # Hint: http://stackoverflow.com/questions/16236684/apply-pandas-function-to-column-to-create-multiple-new-columns #Step 1: Apply the splitter to each value in the key column (eg 'rate' column) and returns a Series of list ##[745/19987071] → ['745', '19987071'] split_cols = df[key].apply(splitter) #Step 2: Convert the resulting lists into a new DataFrame with specified column names ##into = ['cases', 'population'] ##.tolist() converts a Pandas Series into a Python list #So, create a new DataFrame where the data is: #0 ['745', '19987071'] #1 ['2666', '20595360']... #gets applied to the columns in into = ['cases', 'population'] #where [0] is applied to [0] split_df = pd.DataFrame(split_cols.tolist(), columns=into) #Step 3: Combine the split columns with the original data frame (drop the original key) out = pd.concat([df.drop(columns=[key]), split_df], axis=1) return out ``` **⚙️ Key Pandas Concepts:** * `df[key]` → Select one column (Series) * `.apply(func)` on a Series → run `func` element-wise * `.tolist()` → Turns a Series of list into a list of list * `pd.DataFrame(data, columns=...)` → Create a new DataFrame by supplying data and columns (columns can be a list) * `df.drop(columns=[...])` → Remove the supplied column * `pd.concat([...], axis=1)` → Horizontal joins, aligns by index. **📘 References**: https://pandas.pydata.org/docs/reference/api/pandas.concat.html https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html ### 🔸 Exercise 6 **🎯 Goal:** Combine multiple columns into a single column usings a supplied join function. **← Inputs:** * A DataFrame, `df` * A List of Column names to combine in the `df`, `cols` * A Name of the New Column, `new_var` * Name of the supplied helper function, `combine=str_join_elements` ![image](https://hackmd.io/_uploads/rJnpiX3Tll.png) **→ Outputs:** A new DataFrame that: * Combines *values* within the supplied `cols` list and renames the Series to `new_var` * Drops the Series provided within `cols` **⚙️ Pseudocode:** 1. Create a copy of the DataFrame 2. Apply the combine function row-by-row to the selected columns and store the result as a new column 3. Drop the original columns that were merged **Solution**: ```python ### Solution - Exercise 6 def unite(df, cols, new_var, combine=str_join_elements): # Hint: http://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe """Combines multiple columns into a single new column.""" assert isinstance(df, pd.DataFrame) #Step 1: Create a copy a = df.copy() #Step 2: Apply combine() to each row's selected columns ##a[new_var] → creates a new column with the name found within new_car ##a[cols] → Within the DataFrame a, access the columns found within the cols variable ##.apply(lambda x: combine(x), axis=1) → row-by-row, apply the function to the cols identified a[new_var] = a[cols].apply(lambda x: combine(x), axis=1) #Step 3: Drop the original columns after combining a = a.drop(columns=cols) return a ``` **⚙️ Key Pandas Concepts:** * `df['column_name'] = ` → Set up a new column within the DataFrame * `= df['column_name']` → Access the supplied `column_name` within the DataFrame * `.apply(..., axis=1)` → Apply a function, row-by-row, to operate across multiple columns * `df.drop(column='column_name')` → Drop the supplied `column_name` from the `df` **📘 References**: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html ### 🔸 Exercise 7 **🎯 Goal:** Given a wide table `who` turn it into a tidy table with the following properties: * All the `'new...'` columns of `who` become *values* of a *single variable*, `case_type`. Store the counts associated with each `case_type` value as a new variable called `'count'`. * Remove the `iso2` and `iso3` columns, since they are redundant with `country` (which you should keep!). * Keep the `year` column as a variable. * Remove all not-a-number (NaN) counts. Hint: You can test for a NaN using Python's math.isnan(). * Convert the counts to integers. (Because of the presence of NaNs, the counts will be otherwise be treated as floating-point values, which is undesirable since you do not expect to see non-integer counts.) **← Inputs:** A DataFrame, `who`, with identifier columns and many `"new..."` measurement columns. ![image](https://hackmd.io/_uploads/r1eT-426xe.png) **→ Outputs:** `who2` a tidy DataFrame with: * identifiers: `country`, `year` * variable name: `case_type` whose values come from any columns labeled `"new..."` * measurement: `count` which is a sum of all of the values within the `"new..."` **⚙️ Pseudocode:** 1. Create a copy of the original 2. Drop redundant columns 3. Identify and store fixed variables 4. Create a list of all column names that starts with "new" 5. Initialize an empty list to hold mini DataFrames 6. Loop through each `"new..."` column: * Select fixed id columns and add the current loop value to the list of columns * Rename the current loop value column name (`c`) to `count` * Create a new column `case_type` and assign it a value of `c` * Remove rows with missing counts (i.e. NaN values) * Reorder columns * Append temp DataFrame to empty list 7. Concatenate all temp DataFrames 8. Convert all values within `count` to type integer 9. Sort the values based on the list and reset the index **Solution**: ```python def tidy_data(who): assert isinstance(who, pd.DataFrame) #Step 1. Create a copy a = who.copy() #Step 2: Drop redundant columns a = a.drop(columns=['iso2', 'iso3']) #Step 3: Identify and Store Fixed variables fixed = ['country', 'year'] #Step 4: Build a list of all column headers that starts with 'new' case_cols = sorted([c for c in a.columns if c.startswith('new')]) #Step 5: Create a list to store temporary DataFrames ##Initiate an empty list frames = [] ##Loop over each item within case_cols (list of "new..." variables) for c in case_cols: ##fixed = ['country', 'year'] ##[c] → creates a list of current value of our loop of case_cols ###first element of case_cols is 'new_ep_f014' ##a[fixed+[c]] → returns a small DataFrame with columns defined above temp = a[fixed + [c]].copy() ##Rename c to 'count' temp = temp.rename(columns={c: 'count'}) ##Create a new column 'case_type' and apply value of c to it temp['case_type'] = c ##temp['count'].notna() → Returns a Series of Booleans where if the value is NaN, then it is False ##temp[temp['count'].notna()] → Only keep values that are True temp = temp[temp['count'].notna()] ##fixed = ['country', 'year'] ##Reorder and keep only these columns in this order temp = temp[fixed + ['case_type', 'count']] frames.append(temp) #Step 6: Concatenate all DataFrames within the list frames, cast to int, and sort deterministically who2 = pd.concat(frames, ignore_index=True) #Step 7: Turn all values within 'count' to the type integer who2['count'] = who2['count'].astype(int) #Step 8: Sort the values based on this list and reset the index who2 = who2.sort_values(['country', 'year', 'case_type']).reset_index(drop=True) return who2 ``` **⚙️ Key Pandas Concepts:** * `cols = [c for c in df.columns if c.startswith('new')` → List comprehension that selects columns by pattern * `df.rename(columns={old_col_name:'new_column_name'})` → Rename a column * `df[df['col_name'].notna()]` → Boolean filter to remove NaN values * `df['col_name'].astype(int)` → Converts values within the `col_name` to the specified type **📘 References**: ### 🔸 Exercise 8 **🎯 Goal:** Take the DataFrame `who2` and split the `case_type` column into three new variables: * `'type'`, to hold the TB type, having possible values of rel, ep, sn, and sp; * `'gender'`, to hold the gender as a string having possible values of female and male; and * `'age_group'`, to hold the age group as a string having possible values of 0-14, 15-24, 25-34, 35-44, 45-54, 55-64, and 65+. **← Inputs:** The DataFrame `who2` from exercise 7. **→ Outputs:** A new DataFrame `who3` that splits `case_type` into three separate columns. **⚙️ Pseudocode:** 1. Extract the 3 components using Regex and grouping them. 2. Map gender and age to required labels. 3. Concatenate the final DataFrame. **Solution**: ```python ### Solution - Exercise 8 def split_who(who2): assert isinstance(who2, pd.DataFrame) #Step 1: Extract type, gender, age from case_type (handles both 'new_sp_m014' and 'newrel_m65') ##df['col_name'].str.extract(r'') parts = who2['case_type'].str.extract(r'^new_?(?P<type>rel|ep|sn|sp)_(?P<gender>[mf])(?P<age>\d+)$') #Step 2: Map gender and age to required labels gender_map = {'m': 'male', 'f': 'female'} age_map = { '014': '0-14', '1524': '15-24', '2534': '25-34', '3544': '35-44', '4554': '45-54', '5564': '55-64', '65': '65+' } parts['gender'] = parts['gender'].map(gender_map) parts['age_group'] = parts['age'].map(age_map) #Step 3: Assemble final tibble who3 = pd.concat( [who2[['country', 'year', 'count']].reset_index(drop=True), parts[['type', 'gender', 'age_group']].reset_index(drop=True)], axis=1 ) # Optional: order columns neatly who3 = who3[['count', 'country', 'year', 'type', 'gender', 'age_group']] return who3 ``` **⚙️ Key Pandas Concepts:** * `df['col_name'].str.extract(r'')` → Extract information from a string using Regex * `df['age_group'] = df['age'].map(age_map)` → Maps a range to the intended input. (ie '014' maps to '0-14') **📘 References**: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html