Lec 9: Pandas 5

# Pandas 5 --- title: Agenda description: duration: 300 card_type: cue_card --- ### Content - Null/Missing values - `None` vs `NaN` values - `isna()` & `isnull()` - Removing null values - `dropna()` - Data Imputation - `fillna()` - String methods - Datetime values - Writing to a file --- title: Data Preparation description: duration: 900 card_type: cue_card --- ### Data Preparation Code: ``` python= # !pip install --upgrade gdown !gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ ``` > **Output** ``` Downloading... From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ To: /content/Pfizer_1.csv 100% 1.51k/1.51k [00:00<00:00, 6.52MB/s] ``` Code: ``` python= import pandas as pd import numpy as np data = pd.read_csv('Pfizer_1.csv') data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'], var_name = "time", value_name = 'reading') data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'], columns = 'Parameter', values='reading') data_tidy = data_tidy.reset_index() data_tidy.columns.name = None ``` Code: ``` python= data.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/289/original/a1.png?1708870288" width=650 height=220> \ Code: ``` python= data_melt.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/290/original/b1.png?1708870320" width=500 height=200> \ Code: ``` python= data_tidy.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/291/original/c1.png?1708870337" width=500 height=200> --- title: None vs NaN description: duration: 1200 card_type: cue_card --- ### `None` vs `NaN` If you notice, there are many `NaN` values in our data. **What are these `NaN` values?** - They are basically **missing/null values**. - A null value signifies an **empty cell/no data**. There can be 2 kinds of missing values: 1. `None` 2. `NaN` (Not a Number) **Whats the difference between the `None` and `NaN`?** Both `None` and `NaN` can be used for missing values, but their representation and behaviour may differ based on the **column's data type**. Code: ``` python= type(None) ``` > **Output** NoneType Code: ``` python= type(np.nan) ``` > **Output** float 1. **None in Non-numeric** columns: None can be used directly, and it will appear as None. 2. **None in Numeric** columns: Pandas automatically converts None to NaN. 3. **NaN in Numeric** columns: NaN is used to represent missing values and appears as NaN. 4. **NaN in Non-numeric** Columns: NaN can be used, and it appears as NaN. Code: ``` python= pd.Series([1, np.nan, 2, None]) ``` > **Output** 0 1.0 1 NaN 2 2.0 3 NaN dtype: float64 For **numerical** type, Pandas changes `None` to `NaN`. Code ``` python= pd.Series(["1", "np.nan", "2", None]) ``` > **Output** 0 1 1 np.nan 2 2 3 None dtype: object Code: ``` python= pd.Series(["1", "np.nan", "2", np.nan]) ``` > **Output** 0 1 1 np.nan 2 2 3 NaN dtype: object For **object** type, the `None` is preserved and not changed to `NaN`. --- title: isna() & isnull() description: duration: 1200 card_type: cue_card --- ### `isna()` & `isnull()` **How to get the count of missing values for each row/column?** - `df.isna()` - `df.isnull()` Code: ``` python= data.isna().head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/292/original/d1.png?1708870500" width=700 height=135> \ Code: ``` python= data.isnull().head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/292/original/d1.png?1708870500" width=700 height=135> \ Notice that both `isna()` and `isnull()` give the same results. **But why do we have two methods, `isna()` and `isnull()` for the same operation?** - `isnull()` is just an alias for `isna()` Code: ``` python= pd.isnull ``` > **Output** pandas.core.dtypes.missing.isna def isna(obj: object) -> bool | npt.NDArray[np.bool_] | NDFrame Code: ``` python= pd.isna ``` > **Output** pandas.core.dtypes.missing.isna def isna(obj: object) -> bool | npt.NDArray[np.bool_] | NDFrame As we can see, the function signature is same for both. - `isna()` returns a **boolean dataframe**, with each cell as a boolean value. - This value corresponds to **whether the cell has a missing value**. - On top of this, we can use `.sum()` to find the count of the missing values. Code: ``` python= data.isna().sum() ``` > **Output** Date 0 Drug_Name 0 Parameter 0 1:30:00 2 2:30:00 2 3:30:00 6 4:30:00 4 5:30:00 2 6:30:00 0 7:30:00 2 8:30:00 4 9:30:00 2 10:30:00 0 11:30:00 2 12:30:00 0 dtype: int64 This gives us the total number of missing values in each column. **How can we get the number of missing values in each row?** Code: ``` python= data.isna().sum(axis=1) ``` > **Output** 0 1 1 1 2 4 3 4 4 3 5 3 6 1 7 1 8 1 9 1 10 2 11 2 12 1 13 1 14 0 15 0 16 0 17 0 dtype: int64 **Note:** By default, the value is `axis=0` for `sum()`. **We now have identified the null count, but how do we deal with them?** We have two options: - Delete the rows/columns containing the null values. - Fill the missing values with some data/estimate. Let's first look at deleting the rows. --- title: Removing null values description: duration: 1200 card_type: cue_card --- ### Removing null values **How can we drop rows containing null values?** Code: ``` python= data.dropna() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/294/original/e1.png?1708870927" width=700 height=200> \ Notice that rows with even a single missing value have been deleted. **What if we want to delete the columns having missing value?** Code: ``` python= data.dropna(axis=1) ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/296/original/f1.png?1708871063" width=600 height=600> \ Notice that every column which had even a single missing value has been deleted. **But what are the problems with deleting rows/columns?** - loss of valuable data So instead of dropping, it would be better to **fill the missing values with some data**. --- title: Break & Doubt Resolution description: duration: 600 card_type: cue_card --- ### Break & Doubt Resolution `Instructor Note:` * Take this time (up to 5-10 mins) to give a short break to the learners. * Meanwhile, you can ask the them to share their doubts (if any) regarding the topics covered so far. --- title: Data Imputation description: duration: 2400 card_type: cue_card --- ### Data Imputation **How can we fill the missing values with some data?** Code: ``` python= data.fillna(0).head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/299/original/g1.png?1708873423" width=700 height=250> \ **What is `fillna(0)` doing?** - It fills all the missing values with 0. We can do the same on a particular column too. Code: ``` python= data['2:30:00'].fillna(0) ``` > **Output** 0 22.0 1 13.0 2 17.0 3 22.0 4 0.0 5 0.0 6 35.0 7 19.0 8 47.0 9 24.0 10 9.0 11 12.0 12 19.0 13 4.0 14 13.0 15 22.0 16 14.0 17 9.0 Name: 2:30:00, dtype: float64 **Note:** Handling missing value completely depends on the business problem. However, in general practice (assuming you have a large dataset) - - if the missing values are minimal (\<5% of rows), dropping them is acceptable. - for substantial missing values (\>10% of rows), use a suitable imputation technique. - if a column has over 50% of null values, drop that column (unless it's very crucial for the analysis). **What other values can we use to fill the missing values?** We can use some kind of estimator too. - mean (average value) - median - mode (most frequently occuring value) **How would you calculate the mean of the column `2:30:00`?** Code: ``` python= data['2:30:00'].mean() ``` > **Output** 18.8125 Now let's fill the `NaN` values with the mean value of the column. Code: ``` python= data['2:30:00'].fillna(data['2:30:00'].mean()) ``` > **Output** 0 22.0000 1 13.0000 2 17.0000 3 22.0000 4 18.8125 5 18.8125 6 35.0000 7 19.0000 8 47.0000 9 24.0000 10 9.0000 11 12.0000 12 19.0000 13 4.0000 14 13.0000 15 22.0000 16 14.0000 17 9.0000 Name: 2:30:00, dtype: float64 But this doesn't feel right. What could be wrong with this? **Can we use the mean of all compounds as average for our estimator?** - Different drugs have different characteristics. - We can't simply do an average and fill the null values. **Then what could be the solution here?** We could fill the null values of respective compounds with their respective means. **How can we form a column with mean temperature of respective compounds?** - We can use `apply()` Let's first create a function to calculate the mean. Code: ``` python= def temp_mean(x): x['Temperature_avg'] = x['Temperature'].mean() return x ``` Now we can form a new column based on the average values of temperature for each drug. Code: ``` python= data_tidy = data_tidy.groupby(["Drug_Name"], group_keys=False).apply(temp_mean) data_tidy ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/316/original/z3.png?1708874958" width=700 height=400> \ Now we fill the null values in `Temperature` using this new column. Code: ``` python= data_tidy['Temperature'].fillna(data_tidy["Temperature_avg"], inplace=True) data_tidy ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/317/original/z4.png?1708875052" width=700 height=400> \ Code ``` python= data_tidy.isna().sum() ``` > **Output** Date 0 time 0 Drug_Name 0 Pressure 13 Temperature 0 Temperature_avg 0 dtype: int64 Great! We have removed the null values from our `Temperature` column. Let's do the same for `Pressure`. Code: ``` python= def pr_mean(x): x['Pressure_avg'] = x['Pressure'].mean() return x data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean) data_tidy['Pressure'].fillna(data_tidy["Pressure_avg"], inplace=True) data_tidy ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/318/original/z5.png?1708875175" width=700 height=375> \ Code ``` python= data_tidy.isna().sum() ``` > **Output** Date 0 time 0 Drug_Name 0 Pressure 0 Temperature 0 Temperature_avg 0 Pressure_avg 0 dtype: int64 **How to decide if we should impute the missing values with `mean`, `median` or `mode`?** 1. `Mean`: Use when dealing with numerical data that is normally distributed and not heavily skewed by outliers. 2. `Median`: Preferable when data is skewed or contains outliers. It's suitable for ordinal or interval data. 3. `Mode`: Suitable for categorical or nominal data where there are distinct categories. --- title: Quiz-1 description: duration: 60 card_type: quiz_card --- # Question What would the value at the 4th index after running the following code snippet? ```python= sample = pd.Series(['1', '2', '3', np.NaN, None]) sample.fillna(0) ``` # Choices - [x] 0 - [ ] None - [ ] NaN - [ ] Error --- title: Quiz-2 description: duration: 60 card_type: quiz_card --- # Question Based on the given DataFrame, which of the following statements regarding data imputation is mostly accurate? ``` | CustomerID | TransactionAmount | Gender | Age | ProductCategory | |----------------|---------------------|----------------|-------|-------------------| | 101 | 20 | Male | 35 | Apparel | | 102 | NaN | Female | 28 | NaN | | 103 | 15 | Female | NaN | Electronics | | 104 | 30 | NaN | 42 | Electronics | | 105 | 150 | Male | 30 | Apparel | ``` # Choices - [ ] Imputing missing values in the "TransactionAmount" column using the mean of the available values may not be suitable due to potential skewness caused by outliers. - [ ] Imputing missing values in the "TransactionAmount" column using the median of the available values may be suitable to handle skewness due to outliers. - [ ] The presence of missing values in the "Gender" column can be effectively handled by imputing the most frequent category (mode). - [x] All of the above --- title: String methods description: duration: 1200 card_type: cue_card --- #### Quiz-2 Explanation * Option A is correct because imputing missing values in the "TransactionAmount" column with the mean may not be appropriate if the data contains outliers. Outliers can significantly skew the mean, leading to inaccurate imputations. * Option B is correct because as the data is skewed, the median that is roubst to outliers can better impute the missing data * Option C is correct because for the "Gender" categorical column, the most frequently occuring category can be used to impute as gender is unlikely to exhibit significant variation in a dataset of customer transactions. ### String methods **What kind of questions can we use string methods for?** - Find rows which contains a particular string. Say, **How you can you filter rows containing "hydrochloride" in their drug name?** Code: ``` python= data_tidy.loc[data_tidy['Drug_Name'].str.contains('hydrochloride')].head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/313/original/p1.png?1708874201" width=700 height=160> - So in general, we will be using the following format: `Series.str.function()` - `Series.str` can be used to access the values of the series as strings and apply several methods to it. Now suppose we want to form a new column based on the year of the experiments? **What can we do form a column containing the year?** Code: ``` python= data_tidy['Date'].str.split('-') ``` > **Output** 0 [15, 10, 2020] 1 [15, 10, 2020] 2 [15, 10, 2020] 3 [15, 10, 2020] 4 [15, 10, 2020] ... 103 [17, 10, 2020] 104 [17, 10, 2020] 105 [17, 10, 2020] 106 [17, 10, 2020] 107 [17, 10, 2020] Name: Date, Length: 108, dtype: object To extract the year, we need to select the last element of each list. Code: ``` python= data_tidy['Date'].str.split('-').apply(lambda x:x[2]) ``` > **Output** 0 2020 1 2020 2 2020 3 2020 4 2020 ... 103 2020 104 2020 105 2020 106 2020 107 2020 Name: Date, Length: 108, dtype: object But there are certain problems with this approach. - The **dtype of the output is still an object**, we would prefer a number type. - The date format will always **not be in day-month-year**, it can vary. Thus, to work with such date-time type of data, we can use a special method from Pandas. --- title: Datetime description: duration: 1200 card_type: cue_card --- ### Datetime **How can we handle datetime data types?** - We can use the `to_datetime()` function of Pandas - It takes as input: - Array/Scalars with values having proper date/time format - `dayfirst`: Indicating if the day comes first in the date format used - `yearfirst`: Indicates if year comes first in the date format used Let's first merge our `Date` and `Time` columns into a new `timestamp` column. Code: ``` python= data_tidy['timestamp'] = data_tidy['Date']+ " "+ data_tidy['time'] data_tidy.drop(['Date', 'time'], axis=1, inplace=True) data_tidy.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/312/original/q1.png?1708874145" width=700 height=160> \ Now let's convert our `timestamp` column into **datetime**. Code: ``` python= data_tidy['timestamp'] = pd.to_datetime(data_tidy['timestamp']) data_tidy ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/311/original/s1.png?1708873759" width=700 height=335> \ Code ``` python= data_tidy.info() ``` > **Output** <class 'pandas.core.frame.DataFrame'> Int64Index: 108 entries, 0 to 107 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Drug_Name 108 non-null object 1 Pressure 108 non-null float64 2 Temperature 108 non-null float64 3 Temperature_avg 108 non-null float64 4 Pressure_avg 108 non-null float64 5 temp_cat 108 non-null category 6 timestamp 108 non-null datetime64[ns] dtypes: category(1), datetime64[ns](1), float64(4), object(1) memory usage: 10.3+ KB The type of `timestamp` column has been changed from `object` to `datetime`. Now, let's look at a single timestamp using Pandas. **How can we extract information from a single timestamp using Pandas?** Code: ``` python= ts = data_tidy['timestamp'][0] ts ``` > **Output** Timestamp('2020-10-15 10:30:00') Code: ``` python= ts.year, ts.month, ts.day, ts.month_name() ``` > **Output** (2020, 10, 15, 'October') Code: ```python= ts.hour, ts.minute, ts.second ``` > **Output** (10, 30, 0) This data parsing from `string` to `datetime` makes it easier to work with such data. We can use this data from the columns as a whole using `.dt` object. We can use this data from the columns as a whole using `.dt` object Code: ``` python= data_tidy['timestamp'].dt ``` > **Output** <pandas.core.indexes.accessors.DatetimeProperties object at 0x7c2e78c72b60> - `dt` gives properties of values in a column. - From this `DatetimeProperties` of column `'end'`, we can extract `year`. Code: ``` python= data_tidy['timestamp'].dt.year ``` > **Output** 0 2020 1 2020 2 2020 3 2020 4 2020 ... 103 2020 104 2020 105 2020 106 2020 107 2020 Name: timestamp, Length: 108, dtype: int64 We can use `strfttime` (**short for stringformat time**), to modify our datetime format. Let's learn this with the help of few examples. Code: ``` python= data_tidy['timestamp'][0] ``` > **Output** Timestamp('2020-10-15 10:30:00') Code: ``` python= print(data_tidy['timestamp'][0].strftime('%Y')) # formatter for year ``` > **Output** 2020 Similarly we can combine the format types to modify the datetime format as per our convinience. A comprehensive list of other formats can be found here: https://pandas.pydata.org/docs/reference/api/pandas.Period.strftime.html Code: ``` python= data_tidy['timestamp'][0].strftime('%m-%d') ``` > **Output** '10-15' --- title: Quiz-3 description: duration: 60 card_type: quiz_card --- # Question Given the following dataset: ```python= df = pd.DataFrame([[1, '2020-01-01'], [2, '1998-01-12'], [3, '2012-11-05'], [4, '2000-12-03'], [5, '1960-04-23'], [6, '2008-08-15']], columns=["ID", "birth_dates"]) ``` What would be the output of the following code? ```python= df.iloc[2]['birth_dates'].year - df.iloc[1]['birth_dates'].year ``` # Choices - [ ] 8 - [x] 14 - [ ] 22 - [ ] -22 --- title: Writing to a file description: duration: 600 card_type: cue_card --- ### Writing to a file **How can we write our dataframe to a CSV file?** - We have to provide the `path` and `file_name` in which we want to store the data. Code: ``` python= data_tidy.to_csv('pfizer_tidy.csv', sep=",", index=False) ``` Setting `index=False` will not inlcude the index column while writing. ### Extra-reading material - [**Coding Exercise (Pandas)**](https://colab.research.google.com/drive/1yn1OGCBJJQJp1sIljkmJwdf0uySIJUhO?usp=sharing) --- title: Launch a feedback poll description: To gather valuable feedback regarding pace adjustment duration: 30 card_type: poll_card --- # Description Which of the following best depicts your current level of confidence about the pace and difficulty of the material covered in the last 3 lectures? # Choices - Super confident: Feeling super confident and comfortable with pace, ready to conquer the next lesson. - Somewhat confident: Grasping most concepts and comfortable with content & pace, but a few concepts need brushing up. - Not so confident: While I understand some concepts, I'm finding the pace a bit too fast at times. - Feeling a bit lost: I'm finding it difficult to keep up with the pace or grasp certain topics. - Completely lost: I'm struggling significantly with the pace and difficulty of the material. --- title: Unlock Assignment & ask learner to solve in live class description: duration: 1800 card_type: cue_card --- * Unlock the assignment for learners by clicking the **“question mark”** button on the top bar. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/078/685/original/Screenshot_2024-06-19_at_7.17.12_PM.png?1718804854" width=200 /> * If you face any difficulties using this feature, please refer to this video on how to unlock assignments. * **Note:** The following video is strictly for instructor reference only. [VIDEO LINK](https://www.loom.com/share/15672134598f4b4c93475beda227fb3d?sid=4fb31191-ae8c-4b18-bf81-468d2ffd9bd4) ### Conducting a Live Assignment Solution Session: 1. Once you unlock the assignments, ask if anyone in the class would like to solve a question live by sharing their screen. 2. Select a learner and grant permission by navigating to **Settings > Admin > Unmuted Audience Can Share**, then select **Audio, Video, and Screen**. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/111/113/original/image.png?1740484517" width=400 /> 3. Allow the selected learner to share their screen and guide them through solving the question live. 4. Engage with both the learner sharing the screen and other students in the class to foster an interactive learning experience. ### Practice Coding Question(s) You can pick the following question and solve it during the lecture itself. This will help the learners to get familiar with the problem solving process and motivate them to solve the assignments. Make sure to start the doubt session before you solve this question. Q. https://www.scaler.com/hire/test/problem/102720/ - Change Date Format

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.