owned this note
owned this note
Published
Linked with GitHub
# Pandas 5
---
title: Agenda
description:
duration: 300
card_type: cue_card
---
### Content
- Null/Missing values
- `None` vs `NaN` values
- `isna()` & `isnull()`
- Removing null values
- `dropna()`
- Data Imputation
- `fillna()`
- String methods
- Datetime values
- Writing to a file
---
title: Data Preparation
description:
duration: 900
card_type: cue_card
---
### Data Preparation
Code:
``` python=
# !pip install --upgrade gdown
!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
```
> **Output**
```
Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: /content/Pfizer_1.csv
100% 1.51k/1.51k [00:00<00:00, 6.52MB/s]
```
Code:
``` python=
import pandas as pd
import numpy as np
data = pd.read_csv('Pfizer_1.csv')
data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],
var_name = "time",
value_name = 'reading')
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
columns = 'Parameter',
values='reading')
data_tidy = data_tidy.reset_index()
data_tidy.columns.name = None
```
Code:
``` python=
data.head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/289/original/a1.png?1708870288" width=650 height=220>
\
Code:
``` python=
data_melt.head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/290/original/b1.png?1708870320" width=500 height=200>
\
Code:
``` python=
data_tidy.head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/291/original/c1.png?1708870337" width=500 height=200>
---
title: None vs NaN
description:
duration: 1200
card_type: cue_card
---
### `None` vs `NaN`
If you notice, there are many `NaN` values in our data.
**What are these `NaN` values?**
- They are basically **missing/null values**.
- A null value signifies an **empty cell/no data**.
There can be 2 kinds of missing values:
1. `None`
2. `NaN` (Not a Number)
**Whats the difference between the `None` and `NaN`?**
Both `None` and `NaN` can be used for missing values, but their representation and behaviour may differ based on the **column's data type**.
Code:
``` python=
type(None)
```
> **Output**
NoneType
Code:
``` python=
type(np.nan)
```
> **Output**
float
1. **None in Non-numeric** columns: None can be used directly, and it will appear as None.
2. **None in Numeric** columns: Pandas automatically converts None to NaN.
3. **NaN in Numeric** columns: NaN is used to represent missing values and appears as NaN.
4. **NaN in Non-numeric** Columns: NaN can be used, and it appears as NaN.
Code:
``` python=
pd.Series([1, np.nan, 2, None])
```
> **Output**
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
For **numerical** type, Pandas changes `None` to `NaN`.
Code
``` python=
pd.Series(["1", "np.nan", "2", None])
```
> **Output**
0 1
1 np.nan
2 2
3 None
dtype: object
Code:
``` python=
pd.Series(["1", "np.nan", "2", np.nan])
```
> **Output**
0 1
1 np.nan
2 2
3 NaN
dtype: object
For **object** type, the `None` is preserved and not changed to `NaN`.
---
title: isna() & isnull()
description:
duration: 1200
card_type: cue_card
---
### `isna()` & `isnull()`
**How to get the count of missing values for each row/column?**
- `df.isna()`
- `df.isnull()`
Code:
``` python=
data.isna().head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/292/original/d1.png?1708870500" width=700 height=135>
\
Code:
``` python=
data.isnull().head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/292/original/d1.png?1708870500" width=700 height=135>
\
Notice that both `isna()` and `isnull()` give the same results.
**But why do we have two methods, `isna()` and `isnull()` for the same operation?**
- `isnull()` is just an alias for `isna()`
Code:
``` python=
pd.isnull
```
> **Output**
pandas.core.dtypes.missing.isna
def isna(obj: object) -> bool | npt.NDArray[np.bool_] | NDFrame
Code:
``` python=
pd.isna
```
> **Output**
pandas.core.dtypes.missing.isna
def isna(obj: object) -> bool | npt.NDArray[np.bool_] | NDFrame
As we can see, the function signature is same for both.
- `isna()` returns a **boolean dataframe**, with each cell as a boolean value.
- This value corresponds to **whether the cell has a missing value**.
- On top of this, we can use `.sum()` to find the count of the missing values.
Code:
``` python=
data.isna().sum()
```
> **Output**
Date 0
Drug_Name 0
Parameter 0
1:30:00 2
2:30:00 2
3:30:00 6
4:30:00 4
5:30:00 2
6:30:00 0
7:30:00 2
8:30:00 4
9:30:00 2
10:30:00 0
11:30:00 2
12:30:00 0
dtype: int64
This gives us the total number of missing values in each column.
**How can we get the number of missing values in each row?**
Code:
``` python=
data.isna().sum(axis=1)
```
> **Output**
0 1
1 1
2 4
3 4
4 3
5 3
6 1
7 1
8 1
9 1
10 2
11 2
12 1
13 1
14 0
15 0
16 0
17 0
dtype: int64
**Note:** By default, the value is `axis=0` for `sum()`.
**We now have identified the null count, but how do we deal with them?**
We have two options:
- Delete the rows/columns containing the null values.
- Fill the missing values with some data/estimate.
Let's first look at deleting the rows.
---
title: Removing null values
description:
duration: 1200
card_type: cue_card
---
### Removing null values
**How can we drop rows containing null values?**
Code:
``` python=
data.dropna()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/294/original/e1.png?1708870927" width=700 height=200>
\
Notice that rows with even a single missing value have been deleted.
**What if we want to delete the columns having missing value?**
Code:
``` python=
data.dropna(axis=1)
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/296/original/f1.png?1708871063" width=600 height=600>
\
Notice that every column which had even a single missing value has been deleted.
**But what are the problems with deleting rows/columns?**
- loss of valuable data
So instead of dropping, it would be better to **fill the missing values with some data**.
---
title: Break & Doubt Resolution
description:
duration: 600
card_type: cue_card
---
### Break & Doubt Resolution
`Instructor Note:`
* Take this time (up to 5-10 mins) to give a short break to the learners.
* Meanwhile, you can ask the them to share their doubts (if any) regarding the topics covered so far.
---
title: Data Imputation
description:
duration: 2400
card_type: cue_card
---
### Data Imputation
**How can we fill the missing values with some data?**
Code:
``` python=
data.fillna(0).head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/299/original/g1.png?1708873423" width=700 height=250>
\
**What is `fillna(0)` doing?**
- It fills all the missing values with 0.
We can do the same on a particular column too.
Code:
``` python=
data['2:30:00'].fillna(0)
```
> **Output**
0 22.0
1 13.0
2 17.0
3 22.0
4 0.0
5 0.0
6 35.0
7 19.0
8 47.0
9 24.0
10 9.0
11 12.0
12 19.0
13 4.0
14 13.0
15 22.0
16 14.0
17 9.0
Name: 2:30:00, dtype: float64
**Note:**
Handling missing value completely depends on the business problem.
However, in general practice (assuming you have a large dataset) -
- if the missing values are minimal (\<5% of rows), dropping them is acceptable.
- for substantial missing values (\>10% of rows), use a suitable imputation technique.
- if a column has over 50% of null values, drop that column (unless it's very crucial for the analysis).
**What other values can we use to fill the missing values?**
We can use some kind of estimator too.
- mean (average value)
- median
- mode (most frequently occuring value)
**How would you calculate the mean of the column `2:30:00`?**
Code:
``` python=
data['2:30:00'].mean()
```
> **Output**
18.8125
Now let's fill the `NaN` values with the mean value of the column.
Code:
``` python=
data['2:30:00'].fillna(data['2:30:00'].mean())
```
> **Output**
0 22.0000
1 13.0000
2 17.0000
3 22.0000
4 18.8125
5 18.8125
6 35.0000
7 19.0000
8 47.0000
9 24.0000
10 9.0000
11 12.0000
12 19.0000
13 4.0000
14 13.0000
15 22.0000
16 14.0000
17 9.0000
Name: 2:30:00, dtype: float64
But this doesn't feel right. What could be wrong with this?
**Can we use the mean of all compounds as average for our estimator?**
- Different drugs have different characteristics.
- We can't simply do an average and fill the null values.
**Then what could be the solution here?**
We could fill the null values of respective compounds with their respective means.
**How can we form a column with mean temperature of respective compounds?**
- We can use `apply()`
Let's first create a function to calculate the mean.
Code:
``` python=
def temp_mean(x):
x['Temperature_avg'] = x['Temperature'].mean()
return x
```
Now we can form a new column based on the average values of temperature for each drug.
Code:
``` python=
data_tidy = data_tidy.groupby(["Drug_Name"], group_keys=False).apply(temp_mean)
data_tidy
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/316/original/z3.png?1708874958" width=700 height=400>
\
Now we fill the null values in `Temperature` using this new column.
Code:
``` python=
data_tidy['Temperature'].fillna(data_tidy["Temperature_avg"], inplace=True)
data_tidy
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/317/original/z4.png?1708875052" width=700 height=400>
\
Code
``` python=
data_tidy.isna().sum()
```
> **Output**
Date 0
time 0
Drug_Name 0
Pressure 13
Temperature 0
Temperature_avg 0
dtype: int64
Great!
We have removed the null values from our `Temperature` column.
Let's do the same for `Pressure`.
Code:
``` python=
def pr_mean(x):
x['Pressure_avg'] = x['Pressure'].mean()
return x
data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean)
data_tidy['Pressure'].fillna(data_tidy["Pressure_avg"], inplace=True)
data_tidy
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/318/original/z5.png?1708875175" width=700 height=375>
\
Code
``` python=
data_tidy.isna().sum()
```
> **Output**
Date 0
time 0
Drug_Name 0
Pressure 0
Temperature 0
Temperature_avg 0
Pressure_avg 0
dtype: int64
**How to decide if we should impute the missing values with `mean`, `median` or `mode`?**
1. `Mean`: Use when dealing with numerical data that is normally distributed and not heavily skewed by outliers.
2. `Median`: Preferable when data is skewed or contains outliers. It's suitable for ordinal or interval data.
3. `Mode`: Suitable for categorical or nominal data where there are distinct categories.
---
title: Quiz-1
description:
duration: 60
card_type: quiz_card
---
# Question
What would the value at the 4th index after running the following code snippet?
```python=
sample = pd.Series(['1', '2', '3', np.NaN, None])
sample.fillna(0)
```
# Choices
- [x] 0
- [ ] None
- [ ] NaN
- [ ] Error
---
title: Quiz-2
description:
duration: 60
card_type: quiz_card
---
# Question
Based on the given DataFrame, which of the following statements regarding data imputation is mostly accurate?
```
| CustomerID | TransactionAmount | Gender | Age | ProductCategory |
|----------------|---------------------|----------------|-------|-------------------|
| 101 | 20 | Male | 35 | Apparel |
| 102 | NaN | Female | 28 | NaN |
| 103 | 15 | Female | NaN | Electronics |
| 104 | 30 | NaN | 42 | Electronics |
| 105 | 150 | Male | 30 | Apparel |
```
# Choices
- [ ] Imputing missing values in the "TransactionAmount" column using the mean of the available values may not be suitable due to potential skewness caused by outliers.
- [ ] Imputing missing values in the "TransactionAmount" column using the median of the available values may be suitable to handle skewness due to outliers.
- [ ] The presence of missing values in the "Gender" column can be effectively handled by imputing the most frequent category (mode).
- [x] All of the above
---
title: String methods
description:
duration: 1200
card_type: cue_card
---
#### Quiz-2 Explanation
* Option A is correct because imputing missing values in the "TransactionAmount" column with the mean may not be appropriate if the data contains outliers. Outliers can significantly skew the mean, leading to inaccurate imputations.
* Option B is correct because as the data is skewed, the median that is roubst to outliers can better impute the missing data
* Option C is correct because for the "Gender" categorical column, the most frequently occuring category can be used to impute as gender is unlikely to exhibit significant variation in a dataset of customer transactions.
### String methods
**What kind of questions can we use string methods for?**
- Find rows which contains a particular string.
Say,
**How you can you filter rows containing "hydrochloride" in their drug name?**
Code:
``` python=
data_tidy.loc[data_tidy['Drug_Name'].str.contains('hydrochloride')].head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/313/original/p1.png?1708874201" width=700 height=160></br>
- So in general, we will be using the following format: `Series.str.function()`
- `Series.str` can be used to access the values of the series as strings and apply several methods to it.
Now suppose we want to form a new column based on the year of the experiments?
**What can we do form a column containing the year?**
Code:
``` python=
data_tidy['Date'].str.split('-')
```
> **Output**
0 [15, 10, 2020]
1 [15, 10, 2020]
2 [15, 10, 2020]
3 [15, 10, 2020]
4 [15, 10, 2020]
...
103 [17, 10, 2020]
104 [17, 10, 2020]
105 [17, 10, 2020]
106 [17, 10, 2020]
107 [17, 10, 2020]
Name: Date, Length: 108, dtype: object
To extract the year, we need to select the last element of each list.
Code:
``` python=
data_tidy['Date'].str.split('-').apply(lambda x:x[2])
```
> **Output**
0 2020
1 2020
2 2020
3 2020
4 2020
...
103 2020
104 2020
105 2020
106 2020
107 2020
Name: Date, Length: 108, dtype: object
But there are certain problems with this approach.
- The **dtype of the output is still an object**, we would prefer a number type.
- The date format will always **not be in day-month-year**, it can vary.
Thus, to work with such date-time type of data, we can use a special method from Pandas.
---
title: Datetime
description:
duration: 1200
card_type: cue_card
---
### Datetime
**How can we handle datetime data types?**
- We can use the `to_datetime()` function of Pandas
- It takes as input:
- Array/Scalars with values having proper date/time format
- `dayfirst`: Indicating if the day comes first in the date format used
- `yearfirst`: Indicates if year comes first in the date format used
Let's first merge our `Date` and `Time` columns into a new `timestamp` column.
Code:
``` python=
data_tidy['timestamp'] = data_tidy['Date']+ " "+ data_tidy['time']
data_tidy.drop(['Date', 'time'], axis=1, inplace=True)
data_tidy.head()
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/312/original/q1.png?1708874145" width=700 height=160>
\
Now let's convert our `timestamp` column into **datetime**.
Code:
``` python=
data_tidy['timestamp'] = pd.to_datetime(data_tidy['timestamp'])
data_tidy
```
> **Output**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/311/original/s1.png?1708873759" width=700 height=335>
\
Code
``` python=
data_tidy.info()
```
> **Output**
<class 'pandas.core.frame.DataFrame'>
Int64Index: 108 entries, 0 to 107
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Drug_Name 108 non-null object
1 Pressure 108 non-null float64
2 Temperature 108 non-null float64
3 Temperature_avg 108 non-null float64
4 Pressure_avg 108 non-null float64
5 temp_cat 108 non-null category
6 timestamp 108 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(4), object(1)
memory usage: 10.3+ KB
The type of `timestamp` column has been changed from `object` to `datetime`.
Now, let's look at a single timestamp using Pandas.
**How can we extract information from a single timestamp using Pandas?**
Code:
``` python=
ts = data_tidy['timestamp'][0]
ts
```
> **Output**
Timestamp('2020-10-15 10:30:00')
Code:
``` python=
ts.year, ts.month, ts.day, ts.month_name()
```
> **Output**
(2020, 10, 15, 'October')
Code:
```python=
ts.hour, ts.minute, ts.second
```
> **Output**
(10, 30, 0)
This data parsing from `string` to `datetime` makes it easier to work with such data.
We can use this data from the columns as a whole using `.dt` object.
We can use this data from the columns as a whole using `.dt` object
Code:
``` python=
data_tidy['timestamp'].dt
```
> **Output**
<pandas.core.indexes.accessors.DatetimeProperties object at 0x7c2e78c72b60>
- `dt` gives properties of values in a column.
- From this `DatetimeProperties` of column `'end'`, we can extract `year`.
Code:
``` python=
data_tidy['timestamp'].dt.year
```
> **Output**
0 2020
1 2020
2 2020
3 2020
4 2020
...
103 2020
104 2020
105 2020
106 2020
107 2020
Name: timestamp, Length: 108, dtype: int64
We can use `strfttime` (**short for stringformat time**), to modify our datetime format.
Let's learn this with the help of few examples.
Code:
``` python=
data_tidy['timestamp'][0]
```
> **Output**
Timestamp('2020-10-15 10:30:00')
Code:
``` python=
print(data_tidy['timestamp'][0].strftime('%Y')) # formatter for year
```
> **Output**
2020
Similarly we can combine the format types to modify the datetime format as per our convinience.
A comprehensive list of other formats can be found here: https://pandas.pydata.org/docs/reference/api/pandas.Period.strftime.html
Code:
``` python=
data_tidy['timestamp'][0].strftime('%m-%d')
```
> **Output**
'10-15'
---
title: Quiz-3
description:
duration: 60
card_type: quiz_card
---
# Question
Given the following dataset:
```python=
df = pd.DataFrame([[1, '2020-01-01'], [2, '1998-01-12'], [3, '2012-11-05'],
[4, '2000-12-03'], [5, '1960-04-23'], [6, '2008-08-15']],
columns=["ID", "birth_dates"])
```
What would be the output of the following code?
```python=
df.iloc[2]['birth_dates'].year - df.iloc[1]['birth_dates'].year
```
# Choices
- [ ] 8
- [x] 14
- [ ] 22
- [ ] -22
---
title: Writing to a file
description:
duration: 600
card_type: cue_card
---
### Writing to a file
**How can we write our dataframe to a CSV file?**
- We have to provide the `path` and `file_name` in which we want to store the data.
Code:
``` python=
data_tidy.to_csv('pfizer_tidy.csv', sep=",", index=False)
```
Setting `index=False` will not inlcude the index column while writing.
### Extra-reading material
- [**Coding Exercise (Pandas)**](https://colab.research.google.com/drive/1yn1OGCBJJQJp1sIljkmJwdf0uySIJUhO?usp=sharing)
---
title: Launch a feedback poll
description: To gather valuable feedback regarding pace adjustment
duration: 30
card_type: poll_card
---
# Description
Which of the following best depicts your current level of confidence about the pace and difficulty of the material covered in the last 3 lectures?
# Choices
- Super confident: Feeling super confident and comfortable with pace, ready to conquer the next lesson.
- Somewhat confident: Grasping most concepts and comfortable with content & pace, but a few concepts need brushing up.
- Not so confident: While I understand some concepts, I'm finding the pace a bit too fast at times.
- Feeling a bit lost: I'm finding it difficult to keep up with the pace or grasp certain topics.
- Completely lost: I'm struggling significantly with the pace and difficulty of the material.
---
title: Unlock Assignment & ask learner to solve in live class
description:
duration: 1800
card_type: cue_card
---
* <span style=“color:skyblue”>Unlock the assignment for learners</span> by clicking the **“question mark”** button on the top bar.
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/078/685/original/Screenshot_2024-06-19_at_7.17.12_PM.png?1718804854" width=200 />
* If you face any difficulties using this feature, please refer to this video on how to unlock assignments.
* <span style=“color:red”>**Note:** The following video is strictly for instructor reference only. [VIDEO LINK](https://www.loom.com/share/15672134598f4b4c93475beda227fb3d?sid=4fb31191-ae8c-4b18-bf81-468d2ffd9bd4)</span>
### Conducting a Live Assignment Solution Session:
1. Once you unlock the assignments, ask if anyone in the class would like to solve a question live by sharing their screen.
2. Select a learner and grant permission by navigating to <span style=“color:skyblue”>**Settings > Admin > Unmuted Audience Can Share**, then select **Audio, Video, and Screen**.</span>
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/111/113/original/image.png?1740484517" width=400 />
3. Allow the selected learner to share their screen and guide them through solving the question live.
4. Engage with both the learner sharing the screen and other students in the class to foster an interactive learning experience.
### Practice Coding Question(s)
You can pick the following question and solve it during the lecture itself.
This will help the learners to get familiar with the problem solving process and motivate them to solve the assignments.
<span style="background-color: pink;">Make sure to start the doubt session before you solve this question.</span>
Q. https://www.scaler.com/hire/test/problem/102720/ - Change Date Format