# Pandas - Kaggles & Notebook 7
**Disclaimer**: These notes were created with the sole intention of creating an 'all-in-one' place to store information to use during the CSE6040 exams. As such I have included my solutions to the Notebooks within these notes and wrapped them in a spoiler block.
If another individual chooses to use these notes for their own educational purposes please be aware that the intentional copying of my solutions for use on your homework is a violation of the Georgia Tech Code of Conduct.
Information on the Code of Conduct can be found here: [Georgia Tech Code of Conduct](https://policylibrary.gatech.edu/student-life/student-code-conduct)
## π§ Emoji Legend (Note Formatting Guide)
<details>
<summary>π§ Click to expand Emoji Legend</summary>
| Icon | Meaning / Use Case | Example Usage |
|:----:|--------------------|----------------|
| π§© | **Main topic or section header** β introduces a major theme or concept | `## π§© Groupwise Analysis` |
| πΈ | **Major concept or subsection** β represents a key idea under the main topic | `### πΈ Why Group Data?` |
| πΉ | **Detailed operation or example** β used for methods, step-by-step logic, or code examples | `### πΉ groupby() minimum` |
| βοΈ | **Mechanics** β explains whatβs happening under the hood | ββοΈ Mechanics: Pandas scans column values and builds sub-DataFrames.β |
| π― | **Goal / Purpose** β clarifies what weβre trying to achieve | βπ― Goal: Count how many rows belong to each group.β |
| π‘ | **Tip / Takeaway / Insight** β highlights key points, advice, or conceptual notes | βπ‘ Takeaway: Use .agg() for multiple summary stats.β |
| β
| **Shortcut / Confirmation** β shows equivalent methods or quick wins | ββ
Equivalent: `reviews.points.value_counts()`β |
| π§ | **Conceptual Insight** β emphasizes mental models or deeper understanding | βπ§ Think of groupby() as split β apply β combine.β |
| π | **Reference / Documentation Link** β links to external docs or sources | βπ [Pandas MultiIndex Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)β |
| πͺ | **Trick / Tip** β optional visual for clever techniques or shortcuts | βπͺ Use .apply() to manipulate each group independently.β |
| π§ | **Legend / Navigation Aid** β used for orientation or overview sections | β## π§ Emoji Legend (Note Formatting Guide)β |
---
πͺ *Tip:* All of these are Unicode emojis that work natively in HackMD, GitHub, Notion, and most Markdown renderers.
You can insert them using your OS emoji picker:
- **Windows:** `Win + .`
- **macOS:** `Cmd + Ctrl + Space`
</details>
## π§ Table of Contents
[TOC]
# Creating, Reading and Writing
## π§© Getting Started
### πΈ Why Data Creation and I/O Matter
Before you can analyze data, you need to create, import, or export it in a structured way. This section covers how to set up Pandas, create basic data structures, and read or write files so you can start exploring data efficiently.
### πΉ Getting Started
**π― Goal:** Initialize Pandas and prepare your environment for data analysis.
**Setup:**
```python
import pandas as pd
```
## π§© Creating Data
### πΈWhy Create Data Manually?
Before analyzing real datasets, itβs important to understand how to build DataFrames and Series from scratch. This gives you full control for testing, debugging, or learning Pandas operations on simple examples before scaling up to large data sources.
### πΉ Creating Data
**π― Goal:** Understand how to create Pandas data structures β DataFrames and Series β as the foundation for data analysis.
**βοΈ Mechanics:**
Pandas provides two core objects:
* DataFrame β A 2-D labeled table (rows + columns).
* Series β A 1-D labeled array (a single column of data).
You can create them manually or from external data sources (like CSVs, SQL tables, or JSON files).
### πΉ DataFrame
**π― Goal:** Create and understand a Pandas `DataFrame`, the table-like structure used for most operations.
**βοΈ Mechanics:**
A `DataFrame` is built using the `pd.DataFrame()` constructor.
You typically pass in a dictionary, where:
* Keys β Column names
* Values β Lists (column data)
**Examples:**
```python
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
```

Rename the rows using the `index` parameter:
```python
pd.DataFrame({
'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']
}, index=['Product A', 'Product B'])
```

**π‘ Takeaway:**
A **DataFrame** is a collection of **Series** aligned by shared row labels (the index). Use `pd.DataFrame()` to build tabular data directly from Python structures like lists or dicts.
### πΉ Series
**π― Goal:** Learn how to create and label a one-dimensional data column.
**βοΈ Mechanics:**
A `Series` is created with `pd.Series()`, which stores data in a single labeled column. You can assign labels to rows using the `index` parameter, and a name to the Series itself with `name`.
**Examples:**
```python
pd.Series([1, 2, 3, 4, 5])
```
```python
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
```

## π§© Reading Data Files
### πΈ Why Read External Data?
Most real-world analysis starts with **importing datasets** rather than creating them manually. Reading data from external files (like CSV, Excel, or databases) allows you to analyze large, structured datasets efficiently within Pandas.
### πΉ Importing Data
**π― Goal:** Learn how to **import, inspect, and manage tabular data** from external files (like CSVs) using Pandas.
**βοΈ Mechanics:**
Use `pd.read_csv()` to load data into a DataFrame, then inspect it with built-in methods like .`shape` and `.head()`. You can also specify an index column and export your cleaned dataset using `.to_csv()`.
**Examples:**
Read a CSV into a DataFrame:
```python
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")
```
Check dataset dimensions:
```python
wine_reviews.shape
```
Preview first few rows:
```python
wine_reviews.head()
```
Specify a column as the index when reading:
```python
wine_reviews = pd.read_csv(
"../input/wine-reviews/winemag-data-130k-v2.csv",
index_col=0
)
```
Save DataFrame to a new CSV:
```python
wine_reviews.to_csv("my_dataframe.csv", index=True)
```
## π‘ Summary of Reading Data Files
| Operation | Description | Example |
|------------|--------------|----------|
| `import pandas as pd` | Imports the Pandas library for data manipulation and analysis. | `import pandas as pd` |
| `pd.DataFrame()` | Creates a DataFrame (table) from a dictionary of lists or arrays. | `pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})` |
| `index=` (in DataFrame) | Assigns custom row labels when creating a DataFrame. | `pd.DataFrame({'Bob': [...], 'Sue': [...]}, index=['A', 'B'])` |
| `pd.Series()` | Creates a one-dimensional labeled array (like a single column). | `pd.Series([30, 35, 40], index=['2015', '2016', '2017'], name='Sales')` |
| `.name` | Assigns or displays the name of a Series. | `s.name = 'Revenue'` |
| `pd.read_csv()` | Reads a CSV file into a DataFrame. | `pd.read_csv('data.csv')` |
| `.shape` | Returns the tuple (rows, columns) of the DataFrame. | `df.shape` |
| `.head()` | Displays the first five rows of the DataFrame. | `df.head()` |
| `index_col=` | Sets a specific column as the DataFrameβs index during import. | `pd.read_csv('data.csv', index_col=0)` |
| `.to_csv()` | Exports a DataFrame to a CSV file. | `df.to_csv('output.csv', index=True)` |
---
# Indexing, Selecting & Assigning
## π§© Native Accessors
**π― Goal:** Learn how to access columns and individual entries within a Pandas DataFrame using native Python-style syntax.
**βοΈ Mechanics:**
* Access DataFrame columns using dot notation (`df.column`) or bracket notation (`df['column']`).
* Access individual values using another pair of brackets `(df['column'][row_index])`.
* Both return a Series when selecting a column, or a scalar when selecting one value.
### πΉ Accessing Columns
Access a column as a Series using dot notation:
```python
reviews.country
```
Access a column as a Series using bracket notation:
```python
reviews['country']
```
These are functionally the same - but bracket notation is considered safer if there are special characters.
### πΉ Accessing Individual Values
Access a single entry by chaining the two accessors:
```python
reviews['country'][0]
```
**π‘ Think of it like:**
Grab the `country` Series from `reviews`, then select the first value.
## π§© Indexing in Pandas
**π― Goal:** Learn how to select specific rows and columns in a DataFrame using index-based (`.iloc`) and label-based (`.loc`) selection.
**βοΈ Mechanics:**
* `.iloc` β Selects data by position (numeric indices).
* `.loc` β Selects data by label (index or column names).
* Both follow the pattern: `df[rows, columns]`. (Row first, column second unlike Python)
### πΉ Index-Based Selection (`.iloc`)
Access data by integer position, similar to standard Python indexing.
```python
reviews.iloc[0] # First row
reviews.iloc[:, 0] # First column
reviews.iloc[:3, 0] # First 3 rows, first column
reviews.iloc[[0,1,2], 0] # Specific rows
reviews.iloc[-5:] # Last 5 rows
```
### πΉ Label-Based Selection (`.loc`)
Access data by index labels or column names - not by numeric position.
```python
reviews.loc[0, 'country'] # Row label 0, column 'country'
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
```
**π§ Concept:** `.loc` relies on labels, not positions β so itβs aware of named indices or custom labels.
**π‘ Tip:** Works better for most real datasets, since indices and columns often have semantic meaning (e.g., βCountry,β βYear,β etc.).
### πΉ Choosing Between .iloc and .loc
**βοΈ Key Difference:**
* `.iloc` β Position-based and exclusive end range
* `.loc` β Label-based and inclusive end range
### π‘ Summary of Indexing in Pandas
| Operation | Description | Example |
|------------|--------------|----------|
| **`df.iloc[i]`** | Selects a single row by **integer position**. | `reviews.iloc[0]` |
| **`df.iloc[:, j]`** | Selects a single column by **integer position**. | `reviews.iloc[:, 0]` |
| **`df.iloc[a:b, c]`** | Selects a **range of rows** (end **excluded**). | `reviews.iloc[:3, 0]` |
| **`df.iloc[[a,b,c], d]`** | Selects **specific rows** by list of positions. | `reviews.iloc[[0,1,2], 0]` |
| **`df.iloc[-n:]`** | Selects the **last n rows** using negative indices. | `reviews.iloc[-5:]` |
| **`df.loc[row_label, column_label]`** | Selects by **index label** and **column name**. | `reviews.loc[0, 'country']` |
| **`df.loc[:, ['col1','col2']]`** | Selects multiple columns by **label**. | `reviews.loc[:, ['taster_name','points']]` |
| **`df.loc[row_start:row_end]`** | Selects a **range of labels** (end **included**). | `reviews.loc[0:10]` |
| **Range behavior** | `.iloc` is **exclusive**; `.loc` is **inclusive**. | `iloc[0:10] β 10 rows`<br>`loc[0:10] β 11 rows` |
| **Typical Use Case** | `.iloc` β position-based selection.<br>`.loc` β label-based selection. | β |
## π§© Manipulating the Index
**π― Goal:** Change or redefine the DataFrameβs index to make data easier to organize or access.
**βοΈ Mechanics:**
Use set_index() to replace the default numeric index with one or more meaningful columns.
```python
reviews.set_index("title")
```
**π‘ Tip:** Useful when the new index better represents the dataβs structure (e.g., unique IDs, product names, dates).
## π§© Conditional Selection
**π― Goal:** Filter rows based on conditions β like comparisons, inclusion, or null checks.
**βοΈ Mechanics:**
Combine Boolean expressions with `&` (and), `|` (or), and use `.isin()` or `.notnull()` for advanced filtering.
### πΉ Basic Conditional Filtering
```python
reviews.loc[reviews.country == 'Italy']
```
Returns all rows where `country` is `"Italy"`.
### πΉ Multiple Conditions
Use parentheses around each condition:
```python
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
```
β‘οΈ Wines from Italy and with 90+ points.
Use `|` for logical OR:
```python
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
```
β‘οΈ Wines from Italy or with 90+ points.
### πΉ Using `isin()`
```python
reviews.loc[reviews.country.isin(['Italy', 'France'])]
```
β‘οΈ Select rows where country is either` Italy` or `France`.
### πΉ Using `isnull()` / `notnull()`
```python
reviews.loc[reviews.price.notnull()]
```
β‘οΈ Select rows where the `price` is not missing.
**π‘ Tip:** Always use parentheses when chaining multiple conditions to avoid precedence issues.
## π§© Assigning Data
**π― Goal:** Add or modify columns in a DataFrame.
**βοΈ Mechanics:**
Use direct assignment (`df['new_col'] = ...`) to add new columns or overwrite existing ones.
### πΉ Assigning a Constant
```python
reviews['critic'] = 'everyone'
```
β‘οΈ Adds a new column with the same value for all rows.
### πΉ Assigning a Sequence
```python
reviews['index_backwards'] = range(len(reviews), 0, -1)
```
β‘οΈ Adds a column counting backward from the total row count.
## π‘ Summary of Manipulating, Filtering, and Assigning Data
| Concept | Description | Example |
|----------|--------------|----------|
| **Set Index** | Replaces DataFrameβs index with a column. | `df.set_index('title')` |
| **Conditional Filter** | Selects rows meeting logical conditions. | `df.loc[(df.country == 'Italy') & (df.points >= 90)]` |
| **Membership Filter** | Selects rows matching any value in a list. | `df.loc[df.country.isin(['Italy', 'France'])]` |
| **Null Check** | Filters missing or non-missing values. | `df.loc[df.price.notnull()]` |
| **Assign Constant** | Adds a new column with one constant value. | `df['critic'] = 'everyone'` |
| **Assign Iterable** | Adds a column using a range or list. | `df['index_backwards'] = range(len(df), 0, -1)` |
---
# Summary Functions and Maps
# Grouping and Sorting
## π§© Groupwise Analysis
### πΈ Why Group Data?
`map()` lets us transform data *one value at a time* in a column,
but `groupby()` lets us **split the dataset into logical groups** and then perform an operation *within* each group.
Think of `groupby()` as:
> βBreak this DataFrame into smaller DataFrames where all rows share the same value in a column.β
---
### πΉ value_counts() vs groupby()
**π― Goal:** Count how many rows fall into each group.
**βοΈ Mechanics:**
`groupby()` organizes data into subgroups based on unique values, then we can count, average, or aggregate within each subgroup.
**Example:**
```python
reviews.groupby('points').points.count() == reviews.points.value_counts()
```

This groups all rows that share the same `points` value, then counts how many rows (reviews) are in each group.
**π‘ Takeaway:**
`value_counts()` is a simplified `groupby()` that automatically counts rows per unique value.
---
### πΉ groupby() minimum
**π― Goal:** Summarize or analyze each group separately.
**βοΈ Mechanics:**
After grouping, you can access a specific column and apply any aggregation like `.min()`, `.max()`, `.mean()`, etc.
**Example:**
```python
reviews.groupby('points').price.min()
```

Each `points` group returns the minimum `price` found among its rows.
**π§ Think of it as:**
A "slice" of the DataFrame for each group - like multiple smaller tables filtered by shared values.
**More control with apply():**
```python
#Get the first wine review title from each winery
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
```

**Multiple columns:**
```python
# Find the best wine by country and province
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
```

**π‘ Takeaway:**
`groupby()` creates on sub-DataFrame per unique group.
`apply()` lets you perform customer operations on each sub-DataFrame.
---
### πΉ groupby() + agg()
**π― Goal:** Run several aggregate functions at once.
**βοΈ Mechanics:**
`agg()` allows multiple summary operations in one step.
**Example:**
```python
reviews.groupby('country').price.agg([len, min, max])
```

This returns a DataFrame with the number of rows, min price, and max price per country.
**π‘ Takeaway:**
Use `agg()` when you want multiple summary stats instead of chaining multiple `min()` or `max()` calls.
---
### π‘ Summary of Groupby Section
| Operation | Description | Example |
|------------|--------------|----------|
| **value_counts()** | Quickly counts occurrences of each unique value in a Series. | `reviews.points.value_counts()` |
| **groupby()** | Splits the DataFrame into sub-DataFrames based on unique column values. | `reviews.groupby('points')` |
| **groupby().count()** | Counts rows in each group (same as `value_counts()` but flexible for any column). | `reviews.groupby('points').points.count()` |
| **groupby().min() / .max() / .mean()** | Applies summary functions to each group. | `reviews.groupby('points').price.min()` |
| **groupby().apply()** | Runs a custom function on each group (full control of each sub-DataFrame). | `reviews.groupby('winery').apply(lambda df: df.title.iloc[0])` |
| **groupby().agg([len, min, max])** | Runs multiple aggregation functions at once for concise summaries. | `reviews.groupby('country').price.agg([len, min, max])` |
---
πͺ *Tip:* Use `groupby()` when you need to analyze or summarize data **per category** instead of across the entire DataFrame.
## π§© Multi-Indexes
### πΈ Why Multi-Indexes Exist
When you group by **multiple columns**, Pandas creates a **MultiIndex** β
a hierarchical structure that represents grouped levels of data.
**π§ Think of it as:**
> βRows are organized by two (or more) keys instead of one.β
This allows for more complex, layered summaries such as
`(country β province)` or `(year β month)` groupings.
---
### πΉ Creating a Multi-Index
**π― Goal:** Summarize data across multiple grouping keys.
**βοΈ Mechanics:**
When you call `groupby()` with multiple columns,
Pandas creates one βlevelβ of the index for each grouping column.
**Example:**
```python
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed.head()
```

* `country` β first index level
* `province` β second index level
* `len` β aggregated column showing review counts
Each `(country, province)` pair now uniquely identifies a row.
---
### πΉ Working with Multi-Indexes
**π― Goal:** Make hierarchical results readable or easy to manipulate.
**βοΈ Mechanics:**
Reverts a multiple column index back to a single, numerical index. This also converts the index columns to standard columns.
**Example:**
```python
countries_reviewed.reset_index()
```

`country` and `province` return to being standard columns instead of index labels.
---
### π‘ Summary of Multi-Index Section
| Operation | Description | Example |
|------------|--------------|----------|
| **MultiIndex** | A hierarchical index created when grouping by multiple columns. Useful for layered summaries. | `reviews.groupby(['country','province']).description.agg([len])` |
| **.index** | Returns the index object of a grouped DataFrame. Used to check if a result has a MultiIndex. | `countries_reviewed.index` |
| **reset_index()** | Converts a MultiIndex back into standard columns for easier manipulation or sorting. | `countries_reviewed.reset_index()` |
| **Multi-level access** | Lets you select or slice data using one or more levels of the MultiIndex. | `countries_reviewed.loc['Italy']` |
| **Why use it** | Efficiently represents multi-key data (e.g., country β province) in compact tabular form. | β |
π [Pandas MultiIndex Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)
## π§© Sorting
### πΈ Why Sorting Matters
After using `groupby()`, Pandas returns results in **index order**, not based on the actual data values.
That means your rows might be organized alphabetically by index labels rather than by numeric or meaningful order.
**π§ Think of it as:**
> β`groupby()` organizes results by how theyβre labeled, not by what their values are.β
To make summaries easier to read or analyze, we use **sorting** functions.
---
### πΉ sort_values()
**π― Goal:** Reorder rows in a DataFrame based on one or more column values.
**βοΈ Mechanics:**
`sort_values()` sorts rows by column data β not by index.
It returns a *new* DataFrame unless you use `inplace=True`.
**Example:**
```python
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')
```

* Sorts rows by the `len` column in **ascending order** (lowest to highest)
* Does not modify the original DataFrame unless specified
**πͺ Tip**
To sort from highest to lowest, set `ascending=False`.
```python
countries_reviewed.sort_values(by='len', ascending=False)
```

---
### πΉ sort_index()
**π― Goal:** Reorder rows or columns based on index labels instead of values.
**βοΈ Mechanics:**
Useful when your DataFrame has a Multi-Index or when the order of index labels matters.
**Example:**
```python
countries_reviewed.sort_index()
```

This sorts the DataFrame by its index.
**π‘ When to use it:**
After `groupby()` or hierarchical operations that leave your data with a structured index.
---
### πΉ Sorting by Multiple Columns
**π― Goal:** Define a hierarchy of sorting β for example, first by country, then by length (count).
**βοΈ Mechanics:**
* Pandas sorts by the first column listed, then breaks ties using the next one(s).
* Works with both numeric and string columns.
**Example:**
```python
countries_reviewed.sort_values(by=['country', 'len'])
```

**π‘ Use case:**
Organizing grouped summaries where you want a primary and secondary sorting key.
---
### π‘ Summary of Sorting Section
| Operation | Description | Example |
|------------|--------------|----------|
| **sort_values(by=)** | Sorts the DataFrame by one or more columns in ascending order (lowest first by default). | `countries_reviewed.sort_values(by='len')` |
| **sort_values(..., ascending=False)** | Sorts the DataFrame in descending order (highest values first). | `countries_reviewed.sort_values(by='len', ascending=False)` |
| **sort_index()** | Sorts rows or columns based on their index labels instead of column values. | `countries_reviewed.sort_index()` |
| **Multi-column sort** | Sorts by multiple columns in sequence (uses the next column to break ties). | `countries_reviewed.sort_values(by=['country', 'len'])` |
| **Why use it** | Ensures that grouped or summarized results are displayed in a meaningful and logical order. | β |
---
# Data Types and Missing Values
## π§© Dtypes
### πΉ Dtype Identification
**π― Goal:** Identify the data types of the data stored within either a Series or a DataFrame.
**βοΈ Mechanics:**
Identify the data type of the stored data for either a Series/column or the DataFrame.
**Examples:**
View the data type of a single column named `price`:
```python
reviews.price.dtype
```
`dtype('float64')`
View the data type of every column:
```python
reviews.dtypes
```

* Note: Columns consisting entirely of strings do not get their own type, they are given the `object` type instead.
### πΉ Change Dtype
**π― Goal:** Convert a column (or Series) from one data type to another when such a conversion makes sense.
**βοΈ Mechanics:**
Use the `.astype()` method to change the data type (`dtype`) of a column or Series. This is often done to make data compatible for calculations or to ensure consistent storage formats.
**Examples:**
Change the datatype of a single column (`int64` β `float64`)
```python
reviews.points.astype('float64')
```

Change the datatype of the DataFrame index:
```python
reviews.index.dtype
```
`dtype('int64')`
## π§© Missing Data
### πΉ Not a Number (`NaN`)
**π― Goal:** Identify, select, and replace missing or invalid entries within a DataFrame.
**βοΈ Mechanics:**
* Missing values are represented by **`NaN`** (Not a Number).
* For technical reasons, `NaN` values are always stored as the `float64` dtype.
* Pandas provides several tools to detect and fill missing values.
**Examples:**
To select `NaN` entries:
```python
reviews[pd.isnull(reviews.country)]
```

To select not `NaN` entries:
```python
reviews[pd.notnull(reviews.country)]
```
### πΉ Replace Missing Values
**π― Goal:** Fill or replace missing (`NaN`) entries within a DataFrame or Series using different strategies.
**βοΈ Mechanics:**
Use the `.fillna()` method to replace all missing values in a column or Series. This can be done by assigning a constant value (like `"Unknown"`) or using a fill strategy such as *backfill* or *forward fill*.
**Examples:**
To replace `NaN` with `Unknown`:
```python
reviews.region_2.fillna("Unknown")
```

**Alternate Strategy**
Use *backfill* or *forward fille* to propagate valid values into missing entries.
```python
# Backfill: replace with next valid value
reviews.region_2.fillna(method='bfill')
# Forward fill: replace with previous valid value
reviews.region_2.fillna(method='ffill')
```
### πΉ Replacing Specific or Invalid Values
**π― Goal:** Update or correct specific entries within a DataFrame or Series, whether missing or non-missing.
**βοΈ Mechanics:**
Use the `.replace()` method to substitute one or more specific values with new ones. This is useful for correcting errors, updating outdated entries, or replacing placeholder text such as `"Unknown"` or `"Invalid"`.
**Examples:**
Update a reviewer's Twitter handle:
```python
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
```
* Note: The replace() method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like `"Unknown"`, `"Undisclosed"`, `"Invalid"`, and so on.
# Renaming and Combining
## π§© Renaming
### πΉ Renaming Columns
**π― Goal:** Change one or more column names in a DataFrame to more meaningful or updated labels.
**βοΈ Mechanics:**
Use the `.rename()` method with the `columns` keyword and a dictionary mapping **old names β new names**. This updates only the specified columns, leaving the rest of the DataFrame unchanged.
**Examples:**
Rename the `points` column to `score`:
```python
reviews.rename(columns={'points': 'score'})
```
### πΉ Renaming Index Values
**π― Goal:** Change the labels (names) of specific rows in a DataFrameβs index.
**βοΈ Mechanics:**
Use the `.rename()` method with the `index` keyword and a dictionary mapping **old index values β new labels**. This updates only the specified index entries while leaving all other rows untouched.
**Examples:**
Rename the first two index values from `0` and `1` to `firstEntry` and `secondEntry`:
```python
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})
```
### πΉ Renaming Axis Names
**π― Goal:** Assign or modify the *name* of the DataFrameβs row and column axes (not the individual labels).
**βοΈ Mechanics:**
Use the `.rename_axis()` method to change the *axis names* for rows or columns. This affects the axis titles, not the row or column values themselves.
**Examples:**
Rename the **row axis** to `"wines"` and the **column axis** to `"fields"`:
```python
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
```
## π§© Combining
Pandas provides three main methods for combining data:
1. `concat()` β Stack DataFrames or Series along an axis.
2. `join()` β Combine DataFrames by matching index labels.
3. `merge()` β SQL-style joins (not covered here).
### πΉ concat()
**π― Goal:** Combine multiple DataFrames or Series into one by stacking them along an axis (row-wise or column-wise).
**βοΈ Mechanics:**
`pd.concat()` takes a list of DataFrames or Series and merges them together. By default, it stacks them vertically (`axis=0`).
**Examples:**
Combine two YouTube datasets with identical columns:
```python
canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")
pd.concat([canadian_youtube, british_youtube])
```

### πΉ left.join() or right.join()
**π― Goal:** Combine two DataFrames based on matching **index values**, similar to a SQL join.
**βοΈ Mechanics:**
Use the `.join()` method to merge two DataFrames where their indexes align. You can specify suffixes (`lsuffix`, `rsuffix`) to differentiate overlapping column names. If indexes donβt match perfectly, rows with missing matches will contain `NaN`.
**My Breakdown**:
- **Think of it as:** a *search-and-append* operation.
- **Goal:** Keep all rows from the left DataFrame and pull matching data from the right.
- **Key idea:**
For each row in the left DataFrame, pandas searches the right one for matching key(s) and copies those values over.
- **If no match:** the new columns get `NaN`, but the row is still kept.
- **If multiple matches:** the left row is repeated once per match.
- **Common use:** Start with the table that connects to all others (the βhubβ) and add context with left merges.
- **Example mental model:** like using a dictionary:
- Keys = the columns you match on (e.g., `cust_id`, `service`)
- Values = the extra data being attached
**Examples:**
Create two YouTube DataFrames β one for Canada and one for the UK β and align them on `title` and `trending_date`:
```python
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
```

---
# Notebook 7 Notes
https://github.com/Frodo-Swagginz/CSE6040-Fall-2025/blob/main/module%201/Topic%207%20-%20Tidying%20Data/(M1T7)%20Notebook%207.ipynb
## π§© Tidy Data
A tidy dataset is a data set that can be organized into a 2-D table that:
1. each column represents a *variable*
2. each row represents an *observation*
3. each entry of the table represents a single *Value*, which may come from either categorical (discrete) or continuous spaces.
++Tibble++: Stands for tidy table.
## π§© Canonical Order
A tibble is considered canonical if:
1. The variables appear in sorted order by name, ascending from left to right.
2. The rows appear in lexicographically sorted order by variable, ascending from top to bottom.
3. The row labels (Y.index) go from 0 to n-1, where n is the number of observations.
## π§© Iris Dataset
We have loaded the Iris Dataset into Vocareum and use several Pandas operations to view the dataset's information.
## π§© Exercises
### πΈ Exercise 0
**π― Goal:** Calculate 'prevalence', which is the ratio of cases to the population using DataFrame G which is guarenteed to have two columns 'cases' and 'population'. Return a new DataFrame, H, that is a copy of G that contains a new column 'prevalence'.
**β Inputs:** DataFrame G that has columns 'cases' and 'population'.

**β Outputs:** DataFrame H, a copy of G, with a new column 'prevalence'. We cannot use `G['prevalence'] = G['cases'] / G['population']`.
**βοΈ Pseudocode:**
1. Create a copy of G, name it H
2. Create our new column, `H['prevalence']` and use `.apply()` on `H` to calculate prevalence: case/population
**Solution**:
```python
### Solution - Exercise 0
def calc_prevalence(G):
assert 'cases' in G.columns and 'population' in G.columns
#Step 1: Create a copy to avoid editing the original G
H = G.copy()
#Step 2: H['prevalence'] assigns a new column within H
##.apply() on H, using axis = 1 and result_type='expand'. axis=1 applies the function ACROSS the columns i.e. row by row
##'expand' turns list-like results into columns (Shouldn't apply here, but it's a safety valve)
H['prevalence'] = H.apply(lambda x: x['cases'] / x['population'], axis=1, result_type='expand')
return H
```
**βοΈ Key Pandas Concepts:**
* `G.copy()` β creates a copy of the DataFrame
* `H['prevalence']=` β creates a new column labeled 'prevalence'
* `.apply(,axis=1, result_type='expand')` β `axis=1` to compute a new value per row based on the other columns, `result_type='expand'` converts list-like results into columns (not necessary, but safer)
**π References**:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
### πΈ Exercise 1
**π― Goal:** Given a Tibble X, return a copy Y so that is canonically structured.
**β Inputs:** DataFrame X that is non-canonical.

**β Outputs:** DataFrame Y, a copy of X, that is canonical.
**βοΈ Pseudocode:**
1. Alphabetically sort the list-like object of the column headers
2. Create a copy of X, Y, that sorts data on the sorted headers (Think vertically sorted)
3. Sort the *values* of Y based on the sorted headers
π§ By sorting on the sorted headers, Pandas will appropriately handle any 'ties' that occur within the first column based on the contents of the second
4. Reset the index column
**Solution**:
```python
### Solution - Exercise 1
def canonicalize_tibble(X):
# Enforce Property 1: The variables appear in sorted order by name, ascending from left to right.
##X.columns accesses the METADATA that holds the column headers (i.e. headers are not rows/columns just labels)
##we sort those labels alphabetically using sorted()
##Create a copy of X, called Y, that sorts columns based on the sorted headers.
##Thoughtful Insight: Think of this step as moving the data around VERTICALLY
var_names = sorted(X.columns)
Y = X[var_names].copy()
# Enforce Property 2: Columns must be sorted
##We have moved the data around vertically, but now must move it around horizontally.
##.sort_values() reorders rows based on column values
###by= provides the sorting parameter
###ascending=True sorts from lowest to highest
Y = Y.sort_values(by=var_names, ascending=True)
# Enforce Property 3: Reset row labels
Y = Y.reset_index(drop=True)
return Y
```
**βοΈ Key Pandas Concepts:**
* `X.columns` β returns a list-like object containing the column headers of X
* `sorted()` β sorts a list alphabetically
* `sort_values(by=sorted(X.columns), ascending=True)` β sorts the *values* within each column from smallest to largest (`ascending=True`) and `by=` provides our sorting rule, which in this case is alphabetical due to referencing `sorted()`
* `.reset_index(drop=True)` β `drop=True` removes the old index while `reset_index` creates a new one IN-PLACE
**π References**:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
### πΈ Exercise 2
**π― Goal:** Given two tibbles, A and B, determine if they are equivalent. Equivalence is defined as having identical variables and observations, but allows the variables and observations to be out of order.
**β Inputs:** DataFrames A and B.

**β Outputs:** DataFrames AA and BB that are passed into a `try-except` function that checks for `assert_frame_equal(AA, BB)` and returns the appropriate statement given `True` or `False`.
**βοΈ Pseudocode:**
1. Pass each DataFrame A and B into the properties of canonical order
2. Use a `try-except` function to either verify equivalence or provide an `error`
**Solution**:
```python
### Solution - Exercise 2
def tibbles_are_equivalent(A, B):
from pandas.testing import assert_frame_equal
#Step 1: convert both dataframes to canonical order
#Property 1
var_names_a = sorted(A.columns)
var_names_b = sorted(B.columns)
AA = A[var_names_a].copy()
BB = B[var_names_b].copy()
#Property 2
AA = AA.sort_values(by=var_names_a, ascending=True)
BB = BB.sort_values(by=var_names_b, ascending=True)
#Property 3
AA = AA.reset_index(drop=True)
BB = BB.reset_index(drop=True)
try:
assert_frame_equal(AA, BB)
return 'Tibbles are equivalent!'
except AssertionError as error_message:
return str(error_message)
```
**βοΈ Key Pandas Concepts:**
* `assert_frame_equal(df1, df2)` β compares two DataFrames and outputs any differences
**π References**:
https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_frame_equal.html
### πΈ Exercise 3
**π― Goal:** Manually `melt` a 'wide' DataFrame to tidy the *values* and turn columns into rows. Melting involves:
1. Extract the *column* values into a new variable.
2. Convert the values associated with the column values into a new variable as well.
**β Inputs:**
* A DataFrame, `df`
* A list of the *column* names that will serve as values, `col_vals`
* Name of the new variable, `key`
* Name of the column to hold the values, `value`
| country | 1999| 2000 |
| -------- | -------- | -------- |
| Afghanistan | 745 | 2666 |
| Brazil | 37737 | 80488 |
| China | 212258 | 213766 |
**β Outputs:** A manually melted DataFrame, Y, that is a copy of the original. 'Long'
**βοΈ Pseudocode:**
1. Create a copy of the original DataFrame
2. Identify the identifier column(s)
* Compare all column names in `a` with `col_vals`
* Keep the column(s) that are *not* in `col_vals`
* Assign that column name to `id_col`
3. Initialize an empty list (iterable) to collect the new DataFrames
4. Loop through each value in `col_vals` (1999, 2000)
* For each year:
* Create a new temporary DataFrame (`temp`) with:
* `"country"`: taken from `a[id_col]`
* `"year"`: set to the current year (loop variable)
* `"cases"`: the values from that year's column (`a[year]`)
* Append `temp` to `frames`
5. Concatenate all DataFrames in `frames` into one
* Use `pd.concat(frames, ignore_index=True)`
* This stacks each mini DataFrame vertically into one tidy table
6. Return the final long-form DataFrame
**Solution**:
```python
### Solution - Exercise 3
def melt(df, col_vals, key, value):
assert type(df) is pd.DataFrame
#Step 1: Create a copy of our original DataFrame
a = df.copy()
#Step 2: Determine the identifier column(s)
##Note: col_vals = [1999, 2000]
##Note: a.columns = Index(['country', 1999, 2000])
##Loop over a.columns and identify the id column that is not present in col_vals AND add that to a list then grab the string
id_col = [c for c in a.columns if c not in col_vals][0] # assumes one id column, returns 'country'
#Step 3: For each column name in col_vals, build a small DataFrame
##Note: Each temp DataFrame should look like: pd.DataFrame({'country': [Afghanistan, Brazil, China],
## 'year': 1999
## 'cases': [745, 37737, 212258]})
## The only thing that changes for the second loop is that 'year' is 2000 and the cases become the appropriate values for 2000
##Create a blank list to hold our values (DataFrames look like df({key: [values]}))
frames = []
#loop over the years in col_values
for year in col_vals:
#Each key:value pair below defines a new column in our temporary DataFrame
temp = pd.DataFrame({
#'country' is the name of the new column in temp. a['country'] pulls the country column from a
id_col: a[id_col],
#key ('year') is the name, year is the current loop value, hence there is no a['year']
key: year,
#value ('cases') is the name, a[year] refers to the column name from the original.
value: a[year]
#So our loop is [1999, 2000] so it pulls the column from 1999 on the first loop and then the column from 2000 on the 2nd.
})
frames.append(temp)
##Step 4: concatenate our dataframes into one using pd.concat() which accepts an itera
Y = pd.concat(frames, ignore_index=True)
return Y
```
**βοΈ Key Pandas Concepts:**
* `pd.DataFrame({})` β creates a new DataFrame
* `pd.concat(iterable, ignore_index=True)` β Combines DataFrames stored in an iterable into one DataFrame. `ignore_index=True` resets the row label
* `DataFrame.columns` β returns an index-like list of column names
* `a[id_col]` (Dynamic Column Selection) β Access the data for a column usings its named stored in a variable
**π References**:
https://pandas.pydata.org/docs/reference/api/pandas.concat.html
### πΈ Exercise 4
**π― Goal:** Manually `cast` a too-long DataFrame where observations are split across multiple rows. Tidy up the DataFrame by placing observations in a column under their relevant variable.
**β Inputs:**
* A DataFrame, `df`
* A `key` column: values will become the new headers
* A `value` column: values will fill the new columns
* A `join_how` parameter for `pd.merge`: determines how row align when merging

**β Outputs:** A wide DataFrame that matches:
* Each unique key becomes its own column.
* Fixed variables remain.
* alues aligned by the fixed variables.
* Merge behaviour is controlled by `join_how`
**βοΈ Pseudocode:**
1. Identify fixed variables
* Extract columns that are not the key or value and keep them constant
2. Initialize an empty DataFrame with only our fixed variables
3. Loop through each unique key:
* Subset `df` to only rows matching that key
* Keep the `fixed_vars` and `value` column
* Rename the `value` column to the current key's name
* Merge this small DataFrame with the growing tibble of the fixed variables using the supplied `how=join_how`
**Solution**:
```python
### Solution - Exercise 4
def cast(df, key, value, join_how='outer'):
"""Casts the input data frame into a tibble,
given the key column and value column.
"""
assert type(df) is pd.DataFrame
assert key in df.columns and value in df.columns
assert join_how in ['outer', 'inner']
#Step 1: Identify the fixed variables and keep them constant
##df.columns returns a list of headers Index(['country', 'year', 'type', 'count']
##df.columns.difference returns a list of headers that are not in the supplied list: Index(['country', 'year'])
fixed_vars = df.columns.difference([key, value])
#Step 2: Create an empty tibble (DataFrame) to hold our new values
##define the initial columns, fixed_vars
tibble = pd.DataFrame(columns=fixed_vars) # start as an empty tibble
#Step 3: Loop through each UNIQUE key
##For each unique entry in the key column (e.g., 'cases', 'population')
for k in df[key].unique():
##df[df[key]==k] creates a T/F to pull values within the key column of df that matches the current loop value ('cases' or 'population')
###i.e. we're only pulling rows where the key == cases and then where key == population
##[list(fixed_vars) + [value]].copy() tells pandas which columns to keep. This creates a list ['country', 'year', 'value']
#So, take only the rows where the key equals the current k value, and then,
#from those rows, keep only the columns ['country', 'year', 'value']
sub = df[df[key] == k][list(fixed_vars) + [value]].copy() #
##Rename the value column to the current key name
sub = sub.rename(columns={value: k})
##Merge this subset into the growing tibble
tibble = pd.merge(tibble, sub, on=list(fixed_vars), how=join_how)
return tibble
```
**βοΈ Key Pandas Concepts:**
* `df.columns.difference([key, value])` β returns a list of values not part of `[key,value]`
* `df[key].unique()` β gets all unique key values
* `DataFrame.rename()` β renames columns dynamically within each loop
* `pd.merge(..., how=join_how)` β merges two DataFrames on shared columns; how controls whether to perform an inner or outer join
* `pd.merge()` β combines horizontally based on shared keys
* `df[rows][columns]` β is how we access DataFrames values
**π References**:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html
https://pandas.pydata.org/docs/reference/api/pandas.merge.html
### πΈ Exercise 5
**π― Goal:** Split a text column into multiple new columns using a splitter function.
**β Inputs:**
* A DataFrame, `df`, pictured below
* A column name to split, `key`
* A list of new column names, `into`
* A Regex Splitter function, `splitter=default_splitter`

**β Outputs:** A new DataFrame with:
* Original DataFrame columns with the exception of `key`
* New columns listed in `into` filled from the split
**βοΈ Pseudocode:**
1. Apply the splitter to the `key` column
2. Turn the resulting Series of list into a DataFrame
3. Drop the old `key` column from the original DataFrame
4. Concatenate side-by-side (by index), stacking horizontally
**Solution**:
```python
### Solution - Exercise 5
def separate(df, key, into, splitter=default_splitter):
"""Given a data frame, separates one of its columns, the key,
into new variables.
"""
assert type(df) is pd.DataFrame
assert key in df.columns
# Hint: http://stackoverflow.com/questions/16236684/apply-pandas-function-to-column-to-create-multiple-new-columns
#Step 1: Apply the splitter to each value in the key column (eg 'rate' column) and returns a Series of list
##[745/19987071] β ['745', '19987071']
split_cols = df[key].apply(splitter)
#Step 2: Convert the resulting lists into a new DataFrame with specified column names
##into = ['cases', 'population']
##.tolist() converts a Pandas Series into a Python list
#So, create a new DataFrame where the data is:
#0 ['745', '19987071']
#1 ['2666', '20595360']...
#gets applied to the columns in into = ['cases', 'population']
#where [0] is applied to [0]
split_df = pd.DataFrame(split_cols.tolist(), columns=into)
#Step 3: Combine the split columns with the original data frame (drop the original key)
out = pd.concat([df.drop(columns=[key]), split_df], axis=1)
return out
```
**βοΈ Key Pandas Concepts:**
* `df[key]` β Select one column (Series)
* `.apply(func)` on a Series β run `func` element-wise
* `.tolist()` β Turns a Series of list into a list of list
* `pd.DataFrame(data, columns=...)` β Create a new DataFrame by supplying data and columns (columns can be a list)
* `df.drop(columns=[...])` β Remove the supplied column
* `pd.concat([...], axis=1)` β Horizontal joins, aligns by index.
**π References**:
https://pandas.pydata.org/docs/reference/api/pandas.concat.html
https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html
### πΈ Exercise 6
**π― Goal:** Combine multiple columns into a single column usings a supplied join function.
**β Inputs:**
* A DataFrame, `df`
* A List of Column names to combine in the `df`, `cols`
* A Name of the New Column, `new_var`
* Name of the supplied helper function, `combine=str_join_elements`

**β Outputs:** A new DataFrame that:
* Combines *values* within the supplied `cols` list and renames the Series to `new_var`
* Drops the Series provided within `cols`
**βοΈ Pseudocode:**
1. Create a copy of the DataFrame
2. Apply the combine function row-by-row to the selected columns and store the result as a new column
3. Drop the original columns that were merged
**Solution**:
```python
### Solution - Exercise 6
def unite(df, cols, new_var, combine=str_join_elements):
# Hint: http://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe
"""Combines multiple columns into a single new column."""
assert isinstance(df, pd.DataFrame)
#Step 1: Create a copy
a = df.copy()
#Step 2: Apply combine() to each row's selected columns
##a[new_var] β creates a new column with the name found within new_car
##a[cols] β Within the DataFrame a, access the columns found within the cols variable
##.apply(lambda x: combine(x), axis=1) β row-by-row, apply the function to the cols identified
a[new_var] = a[cols].apply(lambda x: combine(x), axis=1)
#Step 3: Drop the original columns after combining
a = a.drop(columns=cols)
return a
```
**βοΈ Key Pandas Concepts:**
* `df['column_name'] = ` β Set up a new column within the DataFrame
* `= df['column_name']` β Access the supplied `column_name` within the DataFrame
* `.apply(..., axis=1)` β Apply a function, row-by-row, to operate across multiple columns
* `df.drop(column='column_name')` β Drop the supplied `column_name` from the `df`
**π References**:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
### πΈ Exercise 7
**π― Goal:** Given a wide table `who` turn it into a tidy table with the following properties:
* All the `'new...'` columns of `who` become *values* of a *single variable*, `case_type`. Store the counts associated with each `case_type` value as a new variable called `'count'`.
* Remove the `iso2` and `iso3` columns, since they are redundant with `country` (which you should keep!).
* Keep the `year` column as a variable.
* Remove all not-a-number (NaN) counts. Hint: You can test for a NaN using Python's math.isnan().
* Convert the counts to integers. (Because of the presence of NaNs, the counts will be otherwise be treated as floating-point values, which is undesirable since you do not expect to see non-integer counts.)
**β Inputs:**
A DataFrame, `who`, with identifier columns and many `"new..."` measurement columns.

**β Outputs:** `who2` a tidy DataFrame with:
* identifiers: `country`, `year`
* variable name: `case_type` whose values come from any columns labeled `"new..."`
* measurement: `count` which is a sum of all of the values within the `"new..."`
**βοΈ Pseudocode:**
1. Create a copy of the original
2. Drop redundant columns
3. Identify and store fixed variables
4. Create a list of all column names that starts with "new"
5. Initialize an empty list to hold mini DataFrames
6. Loop through each `"new..."` column:
* Select fixed id columns and add the current loop value to the list of columns
* Rename the current loop value column name (`c`) to `count`
* Create a new column `case_type` and assign it a value of `c`
* Remove rows with missing counts (i.e. NaN values)
* Reorder columns
* Append temp DataFrame to empty list
7. Concatenate all temp DataFrames
8. Convert all values within `count` to type integer
9. Sort the values based on the list and reset the index
**Solution**:
```python
def tidy_data(who):
assert isinstance(who, pd.DataFrame)
#Step 1. Create a copy
a = who.copy()
#Step 2: Drop redundant columns
a = a.drop(columns=['iso2', 'iso3'])
#Step 3: Identify and Store Fixed variables
fixed = ['country', 'year']
#Step 4: Build a list of all column headers that starts with 'new'
case_cols = sorted([c for c in a.columns if c.startswith('new')])
#Step 5: Create a list to store temporary DataFrames
##Initiate an empty list
frames = []
##Loop over each item within case_cols (list of "new..." variables)
for c in case_cols:
##fixed = ['country', 'year']
##[c] β creates a list of current value of our loop of case_cols
###first element of case_cols is 'new_ep_f014'
##a[fixed+[c]] β returns a small DataFrame with columns defined above
temp = a[fixed + [c]].copy()
##Rename c to 'count'
temp = temp.rename(columns={c: 'count'})
##Create a new column 'case_type' and apply value of c to it
temp['case_type'] = c
##temp['count'].notna() β Returns a Series of Booleans where if the value is NaN, then it is False
##temp[temp['count'].notna()] β Only keep values that are True
temp = temp[temp['count'].notna()]
##fixed = ['country', 'year']
##Reorder and keep only these columns in this order
temp = temp[fixed + ['case_type', 'count']]
frames.append(temp)
#Step 6: Concatenate all DataFrames within the list frames, cast to int, and sort deterministically
who2 = pd.concat(frames, ignore_index=True)
#Step 7: Turn all values within 'count' to the type integer
who2['count'] = who2['count'].astype(int)
#Step 8: Sort the values based on this list and reset the index
who2 = who2.sort_values(['country', 'year', 'case_type']).reset_index(drop=True)
return who2
```
**βοΈ Key Pandas Concepts:**
* `cols = [c for c in df.columns if c.startswith('new')` β List comprehension that selects columns by pattern
* `df.rename(columns={old_col_name:'new_column_name'})` β Rename a column
* `df[df['col_name'].notna()]` β Boolean filter to remove NaN values
* `df['col_name'].astype(int)` β Converts values within the `col_name` to the specified type
**π References**:
### πΈ Exercise 8
**π― Goal:** Take the DataFrame `who2` and split the `case_type` column into three new variables:
* `'type'`, to hold the TB type, having possible values of rel, ep, sn, and sp;
* `'gender'`, to hold the gender as a string having possible values of female and male; and
* `'age_group'`, to hold the age group as a string having possible values of 0-14, 15-24, 25-34, 35-44, 45-54, 55-64, and 65+.
**β Inputs:** The DataFrame `who2` from exercise 7.
**β Outputs:** A new DataFrame `who3` that splits `case_type` into three separate columns.
**βοΈ Pseudocode:**
1. Extract the 3 components using Regex and grouping them.
2. Map gender and age to required labels.
3. Concatenate the final DataFrame.
**Solution**:
```python
### Solution - Exercise 8
def split_who(who2):
assert isinstance(who2, pd.DataFrame)
#Step 1: Extract type, gender, age from case_type (handles both 'new_sp_m014' and 'newrel_m65')
##df['col_name'].str.extract(r'')
parts = who2['case_type'].str.extract(r'^new_?(?P<type>rel|ep|sn|sp)_(?P<gender>[mf])(?P<age>\d+)$')
#Step 2: Map gender and age to required labels
gender_map = {'m': 'male', 'f': 'female'}
age_map = {
'014': '0-14', '1524': '15-24', '2534': '25-34', '3544': '35-44',
'4554': '45-54', '5564': '55-64', '65': '65+'
}
parts['gender'] = parts['gender'].map(gender_map)
parts['age_group'] = parts['age'].map(age_map)
#Step 3: Assemble final tibble
who3 = pd.concat(
[who2[['country', 'year', 'count']].reset_index(drop=True),
parts[['type', 'gender', 'age_group']].reset_index(drop=True)],
axis=1
)
# Optional: order columns neatly
who3 = who3[['count', 'country', 'year', 'type', 'gender', 'age_group']]
return who3
```
**βοΈ Key Pandas Concepts:**
* `df['col_name'].str.extract(r'')` β Extract information from a string using Regex
* `df['age_group'] = df['age'].map(age_map)` β Maps a range to the intended input. (ie '014' maps to '0-14')
**π References**:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html