owned this note
owned this note
Published
Linked with GitHub
# Pandas 4
---
title: Agenda
description:
duration: 300
card_type: cue_card
---
### Content
- Multi-indexing
- Melting
- `pd.melt()`
- Pivoting
- `pd.pivot()`
- `pd.pivot_table()`
- Binning
- `pd.cut()`
---
title: Multi-Indexing
description:
duration: 2400
card_type: cue_card
---
Code:
``` python=
# !pip install --upgrade gdown
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
```
> **Output**
```
Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 25.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 64.8MB/s]
```
Code:
```python=
import pandas as pd
import numpy as np
movies = pd.read_csv('movies.csv', index_col=0)
directors = pd.read_csv('directors.csv', index_col=0)
data = movies.merge(directors, how='left', left_on='director_id', right_on='id')
data.drop(['director_id','id_y'], axis=1, inplace=True)
```
### Multi-Indexing
**Which director according to you should be considered as most productive?**
- Should we decide based on the **number of movies** directed?
- Or take the **quality of the movies** into consideration as well?
- Or maybe look at the the **amount of business** the movie is doing?
To simplify, let's calculate who has directed maximum number of movies.
Code:
``` python=
data.groupby(['director_name'])['title'].count().sort_values(ascending=False)
```
> **Output**
```
director_name
Steven Spielberg 26
Clint Eastwood 19
Martin Scorsese 19
Woody Allen 18
Robert Rodriguez 16
..
Paul Weitz 5
John Madden 5
Paul Verhoeven 5
John Whitesell 5
Kevin Reynolds 5
Name: title, Length: 199, dtype: int64
```
`Steven Spielberg` has directed maximum number of movies.
**But does it make `Steven` the most productive director?**
- Chances are, he might be active for more years than the other directors.
**Calculating the active years for every director?**
- We can subtract both `min` and `max` of year.
Code:
``` python=
data_agg = data.groupby(['director_name'])[["year", "title"]].aggregate({"year":['min','max'], "title": "count"})
data_agg
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/235/original/k.png?1708848751" width=350 height=450>
\
Notice,
- `director_name` column has turned into **row labels**.
- There are multiple levels for the column names.
This is called a **Multi-index DataFrame**.
- It can have **multiple indexes along a dimension**.
- The no. of dimensions remain same though.
- Multi-level indexes are **possible both for rows and columns**.
Code:
``` python=
data_agg.columns
```
> **Output**
```
MultiIndex([( 'year', 'min'),
( 'year', 'max'),
('title', 'count')],
)
```
The level-1 column names are `year` and `title`.
**What would happen if we print the column `year` of this multi-index dataframe?**
Code:
``` python=
data_agg["year"]
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/236/original/l.png?1708848778" width=325 height=450>
\
**How can we convert multi-level back to only one level of columns?**
- e.g. `year_min`, `year_max`, `title_count`
Code:
``` python=
data_agg = data.groupby(['director_name'])[["year","title"]].aggregate(
{"year":['min', 'max'], "title": "count"})
data_agg.columns = ['_'.join(col) for col in data_agg.columns]
data_agg
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/237/original/m.png?1708848795" width=500 height=450>
\
Since these were tuples, we can just join them.
Code:
``` python=
data.groupby('director_name')[['year', 'title']].aggregate(
year_max=('year','max'),
year_min=('year','min'),
title_count=('title','count')
)
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/238/original/n.png?1708848812" width=500 height=450>
\
The columns look good, but we may want to turn back the row labels into a proper column as well.
**Converting row labels into a column using `reset_index` -**
Code:
``` python=
data_agg.reset_index()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/239/original/o.png?1708848833" width=520 height=420>
\
**Using the new features, can we find the most productive director?**
1. First calculate how many years the director has been active.
Code:
``` python=
data_agg["yrs_active"] = data_agg["year_max"] - data_agg["year_min"]
data_agg
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/240/original/p.png?1708848854" width=575 height=450>
\
2. Then calculate rate of directing movies by `title_count`/`yrs_active`.
Code:
``` python=
data_agg["movie_per_yr"] = data_agg["title_count"] / data_agg["yrs_active"]
data_agg
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/241/original/q.png?1708848870" width=660 height=425>
\
3. Finally, sort the values.
Code:
``` python=
data_agg.sort_values("movie_per_yr", ascending=False)
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/242/original/r.png?1708848887" width=600 height=425>
\
**Conclusion:**
- `Tyler Perry` turns out to be truly the most productive director.
---
title: PFizer data
description:
duration: 900
card_type: cue_card
---
For this topic we will be using data of few drugs being developed by **PFizer**.
Dataset: https://drive.google.com/file/d/173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ/view?usp=sharing
Code:
``` python=
!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
```
> **Output**
```
Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: /content/Pfizer_1.csv
0% 0.00/1.51k [00:00<?, ?B/s]
100% 1.51k/1.51k [00:00<00:00, 8.41MB/s]
```
**What is the data about?**
- Temperature (K)
- Pressure \(P)
The data is recorded after an **interval of 1 hour** everyday to monitor the drug stability in a drug development test.
These data points are therefore used to **identify the optimal set of values of parameters** for the stability of the drugs.
Let's explore this dataset -
Code:
``` python=
data = pd.read_csv('Pfizer_1.csv')
data
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/245/original/s.png?1708849997" width=750 height=350>
\
Code:
``` python=
data.info()
```
> **Output**
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 18 non-null object
1 Drug_Name 18 non-null object
2 Parameter 18 non-null object
3 1:30:00 16 non-null float64
4 2:30:00 16 non-null float64
5 3:30:00 12 non-null float64
6 4:30:00 14 non-null float64
7 5:30:00 16 non-null float64
8 6:30:00 18 non-null int64
9 7:30:00 16 non-null float64
10 8:30:00 14 non-null float64
11 9:30:00 16 non-null float64
12 10:30:00 18 non-null int64
13 11:30:00 16 non-null float64
14 12:30:00 18 non-null int64
dtypes: float64(9), int64(3), object(3)
memory usage: 2.2+ KB
```
---
title: Melting
description:
duration: 1500
card_type: cue_card
---
### Melting
As we saw earlier, the dataset has **18 rows** and **15 columns**.
If you notice further, you'll see:
- The columns are `1:30:00`, `2:30:00`, `3:30:00`, ... so on.
- `Temperature` and `Pressure` of each date is in a separate row.
**Can we restructure our data into a better format?**
- Maybe we can have a column for `time`, with `timestamps` as the column value.
**Where will the Temperature/Pressure values go?**
- We can similarly create one column containing the values of these parameters.
- "Melt" the timestamp column into two columns** - timestamp and corresponding values
**How can we restructure our data into having every row corresponding to a single reading?**
Code:
``` python=
pd.melt(data, id_vars=['Date', 'Parameter', 'Drug_Name'])
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/253/original/u.png?1708852935" width=550 height=425>
\
This converts our data from `wide` to `long` format.
Notice that the `id_vars` are set of variables which remain unmelted.
**How does `pd.melt()` work?**
- Pass in the **DataFrame**.
- Pass in the **column names that we don't want to melt**.
But we can provide better names to these new columns.
**How can we rename the columns "variable" and "value" as per our original dataframe?**
Code:
``` python=
data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],
var_name = "time",
value_name = 'reading')
data_melt
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/254/original/v.png?1708853002" width=550 height=425>
\
**Conclusion:**
- The labels of the timestamp columns are conviniently **melted into a single column** - `time`
- It retained all the values in `reading` column.
- The labels of columns such as `1:30:00`, `2:30:00` have now become categories of the `variable` column.
- The values from columns we are melting are stored in the `value` column.
---
title: Quiz-1
description:
duration: 60
card_type: quiz_card
---
# Question
Can we use a list of columns like ["Date", "Parameter"] for the var_name parameter in `pd.melt()` on our Pfizer dataset?
# Choices
- [ ] Yes
- [x] No
---
title: Break & Doubt Resolution
description:
duration: 600
card_type: cue_card
---
#### Quiz-1 Explanation
Our current dataframe has a single index, so we can't pass multiple values to the var_name parameter.
### Break & Doubt Resolution
`Instructor Note:`
* Take this time (up to 5-10 mins) to give a short break to the learners.
* Meanwhile, you can ask the them to share their doubts (if any) regarding the topics covered so far.
---
title: Pivoting
description:
duration: 1500
card_type: cue_card
---
### Pivoting
Now suppose we want to convert our data back to the **wide format**.
The reason could be to maintain the structure for storing or some other purpose.
Notice,
- The variables `Date`, `Drug_Name` and `Parameter` will remain same.
- The column names will be extracted from the column `time`.
- The values will be extracted from the column `readings`.
**How can we restructure our data back to the original wide format?**
Code:
``` python=
data_melt.pivot(index=['Date','Drug_Name','Parameter'], # Columns used to make new frame’s index
columns = 'time', # Column used to make new frame’s columns
values='reading') # Column used for populating new frame’s values.
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/255/original/y.png?1708853437" width=750 height=350>
\
Notice that `pivot()` is the exact opposite of `melt()`.
We are getting **multiple indices** here, but we can get single index again using `reset_index()`.
Code:
``` python=
data_melt.pivot(index=['Date','Drug_Name','Parameter'],
columns = 'time',
values='reading').reset_index()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/256/original/z.png?1708853500" width=750 height=350>
\
Code:
``` python=
data_melt.head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/246/original/x.png?1708850607" width=500 height=200>
\
Now if you notice,
- We are using 2 rows to log readings for a single experiment.
**Can we further restructure our data into dividing the `Parameter` column into T/P?**
- A format like `Date | time | Drug_Name | Pressure | Temperature` would be suitable.
- We want to **split one single column into multiple columns**.
**How can we divide the `Parameter` column again?**
Code:
``` python=
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
columns = 'Parameter',
values='reading')
data_tidy
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/247/original/x1.png?1708850772" width=550 height=450>
\
Notice that a **multi-index** dataframe has been created.
We can use `reset_index()` to remove the multi-index.
Code:
``` python=
data_tidy = data_tidy.reset_index()
data_tidy
```
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/248/original/x2.png?1708851048" width=600 height=400>
\
We can rename our ```index``` column from `Parameter` to simply `None`.
Code:
``` python=
data_tidy.columns.name = None
data_tidy.head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/248/original/x2.png?1708851048" width=600 height=400>
---
title: Pivot Table
description:
duration: 1500
card_type: cue_card
---
### Pivot Table
Now suppose we want to find some insights, like **mean temperature day-wise**.
**Can we use pivot to find the day-wise mean value of temperature for each drug?**
Code:
``` python=
data_tidy.pivot(index=['Drug_Name'],
columns = 'Date',
values=['Temperature'])
```
> **Output**
```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython=-input-25-760683e83d71> in <cell line: 1>()
----> 1 data_tidy.pivot(index=['Drug_Name'],
2 columns = 'Date',
3 values=['Temperature'])
/usr/local/lib/python=3.10/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
332
333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
/usr/local/lib/python=3.10/dist-packages/pandas/core/frame.py in pivot(self, index, columns, values)
8565 from pandas.core.reshape.pivot import pivot
8566
-> 8567 return pivot(self, index=index, columns=columns, values=values)
8568
8569 _shared_docs[
/usr/local/lib/python=3.10/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
332
333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
/usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/pivot.py in pivot(data, index, columns, values)
538 # [List[Any], ExtensionArray, ndarray[Any, Any], Index, Series]"; expected
539 # "Hashable"
--> 540 return indexed.unstack(columns_listlike) # type: ignore[arg-type]
541
542
/usr/local/lib/python=3.10/dist-packages/pandas/core/frame.py in unstack(self, level, fill_value)
9110 from pandas.core.reshape.reshape import unstack
9111
-> 9112 result = unstack(self, level, fill_value)
9113
9114 return result.__finalize__(self, method="unstack")
/usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
474 if isinstance(obj, DataFrame):
475 if isinstance(obj.index, MultiIndex):
--> 476 return _unstack_frame(obj, level, fill_value=fill_value)
477 else:
478 return obj.T.stack(dropna=False)
/usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
497 def _unstack_frame(obj: DataFrame, level, fill_value=None):
498 assert isinstance(obj.index, MultiIndex) # checked by caller
--> 499 unstacker = _Unstacker(obj.index, level=level, constructor=obj._constructor)
500
501 if not obj._can_fast_transpose:
/usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/reshape.py in __init__(self, index, level, constructor)
135 )
136
--> 137 self._make_selectors()
138
139 @cache_readonly
/usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/reshape.py in _make_selectors(self)
187
188 if mask.sum() < len(self.index):
--> 189 raise ValueError("Index contains duplicate entries, cannot reshape")
190
191 self.group_index = comp_index
ValueError: Index contains duplicate entries, cannot reshape
```
**Why did we get an error?**
- We need to find the **average** of temperature values throughout a day.
- If you notice, the error shows **duplicate entries**.
Hence, the index values should be unique entry for each row.
**What can we do to get our required mean values then?**
Code:
``` python=
pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature'], aggfunc=np.mean)
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/250/original/x4.png?1708851728" width=500 height=220>
\
This function is similar to `pivot()`, with an extra feature of an aggregator.
**How does `pivot_table()` work?**
- The initial parameters are same as what we use in `pivot()`.
- As an extra parameter, we pass the **type of aggregator**.
**Note:**
- We could have done this using `groupby` too.
- In fact, `pivot_table` uses `groupby` in the backend to group the data and perform the aggregration.
- The only difference is in the type of output we get using both the functions.
**Similarly, what if we want to find the minimum values of temperature and pressure on a particular date?**
Code:
``` python=
pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature', 'Pressure'], aggfunc=np.min)
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/251/original/x5.png?1708851925" width=700 height=200>
---
title: Quiz-2
description:
duration: 60
card_type: quiz_card
---
# Question
Consider a dataset containing information about sales transactions, with columns '**Date**', '**Product**', '**Quantity**', and '**Revenue**'.
After performing a pivot operation on the '**Product**' column, which of the following statements is most likely to be true?
# Choices
- [ ] The resulting DataFrame will have more rows than the original dataset.
- [x] Each unique value in the 'Product' column will become a separate column in the pivoted DataFrame.
- [ ] The total revenue across all products will remain unchanged after the pivot operation.
---
title: Binning
description:
duration: 1500
card_type: cue_card
---
#### Quiz-2 Explanation
When performing a pivot operation in Pandas, each unique value in the specified columns (in this case, the 'Product' column) becomes a separate column in the pivoted DataFrame.
### Binning
Sometimes, we would want our data to be in **categorical** form instead of **continuous/numerical**.
- Let's say, instead of knowing specific test values of a month, I want to know its type.
- Depending on the level of granularity, we want to have - Low, Medium, High, Very High.
**How can we derive bins/buckets from continous data?**
- use `pd.cut()`
Let's try to use this on our `Temperature` column to categorise the data into bins.
But to define categories, let's first check `min` and `max` temperature values.
Code:
``` python=
data_tidy
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/088/993/original/p1.png?1725868246" width=600 height=450>
\
Code
``` python=
print(data_tidy['Temperature'].min(), data_tidy['Temperature'].max())
```
> **Output**
8.0 58.0
Here,
- Min value = 8
- Max value = 58
Lets's keep some buffer for future values and take the range from 5-60 (instead of 8-58).
We'll divide this data into **4 bins** of 10-15 values each.
Code:
``` python=
temp_points = [5, 20, 35, 50, 60]
temp_labels = ['low','medium','high','very_high'] # labels define the severity of the resultant output of the test
data_tidy['temp_cat'] = pd.cut(data_tidy['Temperature'], bins=temp_points, labels=temp_labels)
data_tidy.head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/088/994/original/p2.png?1725868296" width=600 height=200>
\
Code:
``` python=
data_tidy['temp_cat'].value_counts()
```
> **Output**
low 50
medium 38
high 15
very_high 5
Name: temp_cat, dtype: int64
**Note:** By default, `pd.cut()` creates intervals of the form (x, y] — which includes the right endpoint but excludes the left one.
---
title: Quiz-3
description:
duration: 60
card_type: quiz_card
---
# Question
How does the `pd.cut()` function behave when applied to an already categorical column to further segment it into categories?
# Choices
- [x] An error is raised
- [ ] It bins the values further
- [ ] No change occurs
---
title: Unlock Assignment & ask learner to solve in live class
description:
duration: 1800
card_type: cue_card
---
* <span style=“color:skyblue”>Unlock the assignment for learners</span> by clicking the **“question mark”** button on the top bar.
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/078/685/original/Screenshot_2024-06-19_at_7.17.12_PM.png?1718804854" width=200 />
* If you face any difficulties using this feature, please refer to this video on how to unlock assignments.
* <span style=“color:red”>**Note:** The following video is strictly for instructor reference only. [VIDEO LINK](https://www.loom.com/share/15672134598f4b4c93475beda227fb3d?sid=4fb31191-ae8c-4b18-bf81-468d2ffd9bd4)</span>
### Conducting a Live Assignment Solution Session:
1. Once you unlock the assignments, ask if anyone in the class would like to solve a question live by sharing their screen.
2. Select a learner and grant permission by navigating to <span style=“color:skyblue”>**Settings > Admin > Unmuted Audience Can Share**, then select **Audio, Video, and Screen**.</span>
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/111/113/original/image.png?1740484517" width=400 />
3. Allow the selected learner to share their screen and guide them through solving the question live.
4. Engage with both the learner sharing the screen and other students in the class to foster an interactive learning experience.
### Practice Coding Question(s)
You can pick the following question and solve it during the lecture itself.
This will help the learners to get familiar with the problem solving process and motivate them to solve the assignments.
<span style="background-color: pink;">Make sure to start the doubt session before you solve this question.</span>
Q. https://www.scaler.com/hire/test/problem/103383/ - Year Wise Life Expectancy