Pandas - 5 - HackMD

--- title: Importing and preparing Data description: duration: 300 card_type: cue_card --- ## Importing and preparing data (from previous lecture) Code ``` python= import pandas as pd import numpy as np !gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ data = pd.read_csv('Pfizer_1.csv') data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'], var_name = "time", value_name = 'reading') data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'], columns = 'Parameter', values='reading') data_tidy = data_tidy.reset_index() data_tidy.columns.name = None ``` ``` Downloading... From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ To: /content/Pfizer_1.csv 0% 0.00/1.51k [00:00<?, ?B/s] 100% 1.51k/1.51k [00:00<00:00, 6.05MB/s] ``` Code ``` python= data.head() ``` Date Drug_Name Parameter 1:30:00 2:30:00 \ 0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 2 15-10-2020 docetaxel injection Temperature NaN 17.0 3 15-10-2020 docetaxel injection Pressure NaN 22.0 4 15-10-2020 ketamine hydrochloride Temperature 24.0 NaN 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 \ 0 NaN 21.0 21.0 22 23.0 21.0 22.0 20 1 NaN 11.0 13.0 14 16.0 16.0 24.0 18 2 18.0 NaN 17.0 18 NaN NaN 23.0 23 3 22.0 NaN 22.0 23 NaN NaN 27.0 26 4 NaN 27.0 NaN 26 25.0 24.0 23.0 22 11:30:00 12:30:00 0 20.0 21 1 19.0 20 2 25.0 25 3 29.0 28 4 21.0 20 Code ``` python= data_melt.head() ``` Date Drug_Name Parameter time reading 0 15-10-2020 diltiazem hydrochloride Temperature 1:30:00 23.0 1 15-10-2020 diltiazem hydrochloride Pressure 1:30:00 12.0 2 15-10-2020 docetaxel injection Temperature 1:30:00 NaN 3 15-10-2020 docetaxel injection Pressure 1:30:00 NaN 4 15-10-2020 ketamine hydrochloride Temperature 1:30:00 24.0 Code ``` python= data_tidy.head() ``` Date time Drug_Name Pressure Temperature 0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 --- title: Handling Missing Values description: duration: 200 card_type: cue_card --- # Handling Missing Values If you notive, there are many "NaN" values in our data Code ``` python= data.head() ``` Date Drug_Name Parameter 1:30:00 2:30:00 \ 0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 2 15-10-2020 docetaxel injection Temperature NaN 17.0 3 15-10-2020 docetaxel injection Pressure NaN 22.0 4 15-10-2020 ketamine hydrochloride Temperature 24.0 NaN 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 \ 0 NaN 21.0 21.0 22 23.0 21.0 22.0 20 1 NaN 11.0 13.0 14 16.0 16.0 24.0 18 2 18.0 NaN 17.0 18 NaN NaN 23.0 23 3 22.0 NaN 22.0 23 NaN NaN 27.0 26 4 NaN 27.0 NaN 26 25.0 24.0 23.0 22 11:30:00 12:30:00 0 20.0 21 1 19.0 20 2 25.0 25 3 29.0 28 4 21.0 20 #### What are these "NaN" values? They are basically **missing values** #### What are missing values? A Missing Value signifies an **empty cell/no data** There can be 2 kinds of missing values: 1. `None` 2. `NaN` (short for Not a Number) #### Whats the difference between the "None" and "NaN"? The diff mainly lies in their datatype Code ``` python= type(None) ``` NoneType Code ``` python= type(np.nan) ``` float **None type** is for missing values in a column with **non-number entries** - E.g.-strings **NaN** occurs for columns with **number entries** Note: Pandas uses these values nearly **interchangeably**, converting between them where appropriate, based on column datatype Code ``` python= pd.Series([1, np.nan, 2, None]) ``` 0 1.0 1 NaN 2 2.0 3 NaN dtype: float64 For **numerical** types, Pandas changes **None to NaN** type Code ``` python= pd.Series(["1", "np.nan", "2", None]) ``` 0 1 1 np.nan 2 2 3 None dtype: object Code ``` python= pd.Series(["1", "np.nan", "2", np.nan]) ``` 0 1 1 np.nan 2 2 3 NaN dtype: object For **object** type, the **None is preserved** and not changed to NaN --- title: Checking missing values using isna and isnull description: duration: 200 card_type: cue_card --- #### How to know the count of missing values for each row/column? Code ``` python= data.isna().head() ``` Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 \ 0 False False False False False True False False 1 False False False False False True False False 2 False False False True False False True False 3 False False False True False False True False 4 False False False False True True False True 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 11:30:00 12:30:00 0 False False False False False False False 1 False False False False False False False 2 False True True False False False False 3 False True True False False False False 4 False False False False False False False We can also use isnull to get the same results Code ``` python= data.isnull().head() ``` Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 \ 0 False False False False False True False False 1 False False False False False True False False 2 False False False True False False True False 3 False False False True False False True False 4 False False False False True True False True 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 11:30:00 12:30:00 0 False False False False False False False 1 False False False False False False False 2 False True True False False False False 3 False True True False False False False 4 False False False False False False False #### But, why do we have two methods, "isna" and "isnull" for the same operation? isnull() is just an alias for isna() Code ``` python= pd.isnull ``` <function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'> Code ``` python= pd.isna ``` <function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'> As we can see, function signature is same for both `isna()` returns a **boolean dataframe**, with each cell as a boolean value This value corresponds to **whether the cell has a missing value** On top of this, we can use `.sum()` to find the count Code ``` python= data.isna().sum() ``` Date 0 Drug_Name 0 Parameter 0 1:30:00 2 2:30:00 2 3:30:00 6 4:30:00 4 5:30:00 2 6:30:00 0 7:30:00 2 8:30:00 4 9:30:00 2 10:30:00 0 11:30:00 2 12:30:00 0 dtype: int64 This gives us the total number of missing values in each column #### Can we also get the number of missing values in each row? Code ``` python= data.isna().sum(axis=1) ``` 0 1 1 1 2 4 3 4 4 3 5 3 6 1 7 1 8 1 9 1 10 2 11 2 12 1 13 1 14 0 15 0 16 0 17 0 dtype: int64 Note: By default the value is `axis=0` in sum() #### We have identified the null count, but how do we deal with them? We have two options: - delete the rows/columns containing the null values - fill the missing values with some data/estimate Let's first look at deleting the rows --- title: Removing null values description: duration: 200 card_type: cue_card --- #### How can we drop rows containing null values? Code ``` python= data.dropna() ``` Date Drug_Name Parameter 1:30:00 2:30:00 \ 14 17-10-2020 docetaxel injection Temperature 12.0 13.0 15 17-10-2020 docetaxel injection Pressure 20.0 22.0 16 17-10-2020 ketamine hydrochloride Temperature 13.0 14.0 17 17-10-2020 ketamine hydrochloride Pressure 8.0 9.0 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 \ 14 14.0 15.0 16.0 17 18.0 19.0 20.0 21 15 22.0 22.0 22.0 23 25.0 26.0 27.0 28 16 15.0 16.0 17.0 18 19.0 20.0 21.0 22 17 10.0 11.0 11.0 12 12.0 11.0 12.0 13 11:30:00 12:30:00 14 22.0 23 15 29.0 28 16 23.0 24 17 14.0 15 Rows with **even a single missing value** have been deleted #### What if we want to delete the columns having missing value? Code ``` python= data.dropna(axis=1) ``` Date Drug_Name Parameter 6:30:00 10:30:00 \ 0 15-10-2020 diltiazem hydrochloride Temperature 22 20 1 15-10-2020 diltiazem hydrochloride Pressure 14 18 2 15-10-2020 docetaxel injection Temperature 18 23 3 15-10-2020 docetaxel injection Pressure 23 26 4 15-10-2020 ketamine hydrochloride Temperature 26 22 5 15-10-2020 ketamine hydrochloride Pressure 9 9 6 16-10-2020 diltiazem hydrochloride Temperature 38 40 7 16-10-2020 diltiazem hydrochloride Pressure 23 24 8 16-10-2020 docetaxel injection Temperature 49 56 9 16-10-2020 docetaxel injection Pressure 27 28 10 16-10-2020 ketamine hydrochloride Temperature 12 13 11 16-10-2020 ketamine hydrochloride Pressure 15 16 12 17-10-2020 diltiazem hydrochloride Temperature 16 14 13 17-10-2020 diltiazem hydrochloride Pressure 8 11 14 17-10-2020 docetaxel injection Temperature 17 21 15 17-10-2020 docetaxel injection Pressure 23 28 16 17-10-2020 ketamine hydrochloride Temperature 18 22 17 17-10-2020 ketamine hydrochloride Pressure 12 13 12:30:00 0 21 1 20 2 25 3 28 4 20 5 11 6 42 7 27 8 58 9 30 10 15 11 18 12 10 13 14 14 23 15 28 16 24 17 15 => Every column which had even a single missing value has been deleted #### But what are the problems with deleting rows/columns? One of the major problems: - loss of data Instead of dropping, it would be better to **fill the missing values with some data** --- title: Replacing null values using fillna description: duration: 200 card_type: cue_card --- #### How can we fill the missing values with some data? Code ``` python= data.fillna(0).head() ``` Date Drug_Name Parameter 1:30:00 2:30:00 \ 0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 2 15-10-2020 docetaxel injection Temperature 0.0 17.0 3 15-10-2020 docetaxel injection Pressure 0.0 22.0 4 15-10-2020 ketamine hydrochloride Temperature 24.0 0.0 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 \ 0 0.0 21.0 21.0 22 23.0 21.0 22.0 20 1 0.0 11.0 13.0 14 16.0 16.0 24.0 18 2 18.0 0.0 17.0 18 0.0 0.0 23.0 23 3 22.0 0.0 22.0 23 0.0 0.0 27.0 26 4 0.0 27.0 0.0 26 25.0 24.0 23.0 22 11:30:00 12:30:00 0 20.0 21 1 19.0 20 2 25.0 25 3 29.0 28 4 21.0 20 **What is fillna(0) doing?** It fills all missing values with 0 We can do the same on a particular column too Code ``` python= data['2:30:00'].fillna(0) ``` 0 22.0 1 13.0 2 17.0 3 22.0 4 0.0 5 0.0 6 35.0 7 19.0 8 47.0 9 24.0 10 9.0 11 12.0 12 19.0 13 4.0 14 13.0 15 22.0 16 14.0 17 9.0 Name: 2:30:00, dtype: float64 #### What other values can we use to fill the missing values ? We can use some **kind of estimator** too - An estimator like **mean or median** #### How would you calculate the mean of the column `2:30:00`? Code ``` python= data['2:30:00'].mean() ``` 18.8125 Now let's fill the NaN values with the mean value of the column Code ``` python= data['2:30:00'].fillna(data['2:30:00'].mean()) ``` 0 22.0000 1 13.0000 2 17.0000 3 22.0000 4 18.8125 5 18.8125 6 35.0000 7 19.0000 8 47.0000 9 24.0000 10 9.0000 11 12.0000 12 19.0000 13 4.0000 14 13.0000 15 22.0000 16 14.0000 17 9.0000 Name: 2:30:00, dtype: float64 But this doesn't feel right. What could be wrong with this? #### Can we use the mean of all compounds as average for our estimator? - **Different drugs** have **different characteristics** - We can't simply do an average and fill the null values **Then what could be a solution here?** We could fill the null values of **respective compounds with their respective means** #### How can we form a column with mean temperature of respective compounds? We can use `apply` that we learnt earlier Let's first create a function to calculate the mean Code ``` python= def temp_mean(x): x['Temperature_avg'] = x['Temperature'].mean() # We will name the new col Temperature_avg return x ``` Now we can form a new column based on the average values of temperature for each drug Code ``` python= data_tidy=data_tidy.groupby(["Drug_Name"], group_keys=False).apply(temp_mean) data_tidy ``` Date time Drug_Name Pressure Temperature \ 0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 .. ... ... ... ... ... 103 17-10-2020 8:30:00 docetaxel injection 26.0 19.0 104 17-10-2020 8:30:00 ketamine hydrochloride 11.0 20.0 105 17-10-2020 9:30:00 diltiazem hydrochloride 9.0 13.0 106 17-10-2020 9:30:00 docetaxel injection 27.0 20.0 107 17-10-2020 9:30:00 ketamine hydrochloride 12.0 21.0 Temperature_avg 0 24.848485 1 30.387097 2 17.709677 3 24.848485 4 30.387097 .. ... 103 30.387097 104 17.709677 105 24.848485 106 30.387097 107 17.709677 [108 rows x 6 columns] Now we fill the null values in Temperature using this new column! Code ``` python= data_tidy['Temperature'].fillna(data_tidy["Temperature_avg"], inplace=True) data_tidy ``` Date time Drug_Name Pressure Temperature \ 0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 .. ... ... ... ... ... 103 17-10-2020 8:30:00 docetaxel injection 26.0 19.0 104 17-10-2020 8:30:00 ketamine hydrochloride 11.0 20.0 105 17-10-2020 9:30:00 diltiazem hydrochloride 9.0 13.0 106 17-10-2020 9:30:00 docetaxel injection 27.0 20.0 107 17-10-2020 9:30:00 ketamine hydrochloride 12.0 21.0 Temperature_avg 0 24.848485 1 30.387097 2 17.709677 3 24.848485 4 30.387097 .. ... 103 30.387097 104 17.709677 105 24.848485 106 30.387097 107 17.709677 [108 rows x 6 columns] Code ``` python= data_tidy.isna().sum() ``` Date 0 time 0 Drug_Name 0 Pressure 13 Temperature 0 Temperature_avg 0 dtype: int64 Great!! We have removed the null values of our Temperature column Let's do the same for Pressure Code ``` python= def pr_mean(x): x['Pressure_avg'] = x['Pressure'].mean() return x data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean) data_tidy['Pressure'].fillna(data_tidy["Pressure_avg"], inplace=True) data_tidy ``` <ipython=-input-27-df55c441df36>:4: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object. To preserve the previous behavior, use >>> .groupby(..., group_keys=False) To adopt the future behavior and silence this warning, use >>> .groupby(..., group_keys=True) data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean) Date time Drug_Name Pressure Temperature \ 0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 .. ... ... ... ... ... 103 17-10-2020 8:30:00 docetaxel injection 26.0 19.0 104 17-10-2020 8:30:00 ketamine hydrochloride 11.0 20.0 105 17-10-2020 9:30:00 diltiazem hydrochloride 9.0 13.0 106 17-10-2020 9:30:00 docetaxel injection 27.0 20.0 107 17-10-2020 9:30:00 ketamine hydrochloride 12.0 21.0 Temperature_avg Pressure_avg 0 24.848485 15.424242 1 30.387097 25.483871 2 17.709677 11.935484 3 24.848485 15.424242 4 30.387097 25.483871 .. ... ... 103 30.387097 25.483871 104 17.709677 11.935484 105 24.848485 15.424242 106 30.387097 25.483871 107 17.709677 11.935484 [108 rows x 7 columns] Code ``` python= data_tidy.isna().sum() ``` Date 0 time 0 Drug_Name 0 Pressure 0 Temperature 0 Temperature_avg 0 Pressure_avg 0 dtype: int64 --- title: Quiz-1 description: Quiz-1 duration: 60 card_type: quiz_card --- # Question What would the value at the 4th index after running the following code snippet? ```python= sample = pd.Series(['1', '2', '3', np.NaN, None]) sample.fillna(0) ``` # Choices - [x] 0 - [ ] None - [ ] NaN - [ ] Error --- title: Pandas Cut description: duration: 200 card_type: cue_card --- ### Explanation of Quiz-1: Since pandas will interpret None as a missing value type to be filled with fillna only. # Pandas Cut Sometimes, we would want our data to be in **categorical format instead of continous data**. Lets say, instead of knowing specific test values of a month, I want to know its type. Depends on level of granularity we want to have - Low, Medium, High, V High We could have defined more (or less) categories > But how can bucketisation of continous data help? - Since, we can get the count of different categories - We can get a idea of the bin which category (range of values) most of the temperature values lie. Let's try to us this on our max (temp) column to categorise the data into bins But, to define categories, lets first check min and max temp values Code ``` python= data_tidy ``` Date time Drug_Name Pressure Temperature \ 0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 .. ... ... ... ... ... 103 17-10-2020 8:30:00 docetaxel injection 26.0 19.0 104 17-10-2020 8:30:00 ketamine hydrochloride 11.0 20.0 105 17-10-2020 9:30:00 diltiazem hydrochloride 9.0 13.0 106 17-10-2020 9:30:00 docetaxel injection 27.0 20.0 107 17-10-2020 9:30:00 ketamine hydrochloride 12.0 21.0 Temperature_avg Pressure_avg 0 24.848485 15.424242 1 30.387097 25.483871 2 17.709677 11.935484 3 24.848485 15.424242 4 30.387097 25.483871 .. ... ... 103 30.387097 25.483871 104 17.709677 11.935484 105 24.848485 15.424242 106 30.387097 25.483871 107 17.709677 11.935484 [108 rows x 7 columns] Code ``` python= print(data_tidy['Temperature'].min(), data_tidy['Temperature'].max()) ``` 8.0 58.0 Min value = 8, Max value is 58. - Lets's keep some buffer for future values and take the range from 5-60(instead of 8-58) - Lets divide this data into 4 bins of 10-15 values each Code ``` python= temp_points = [5, 20, 35, 50, 60] temp_labels = ['low','medium','high','very_high'] # Here labels define the severity of the resultant output of the test data_tidy['temp_cat'] = pd.cut(data_tidy['Temperature'], bins=temp_points, labels=temp_labels) data_tidy.head() ``` Date time Drug_Name Pressure Temperature \ 0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 Temperature_avg Pressure_avg temp_cat 0 24.848485 15.424242 low 1 30.387097 25.483871 medium 2 17.709677 11.935484 medium 3 24.848485 15.424242 low 4 30.387097 25.483871 medium Code ``` python= data_tidy['temp_cat'].value_counts() ``` low 50 medium 38 high 15 very_high 5 Name: temp_cat, dtype: int64 --- title: Quiz-2 description: Quiz-2 duration: 60 card_type: quiz_card --- # Question What happens when we use `pd.cut()` on an already categorical column, like controlled_cut, to say try to categorize it even further? # Choices - [x] Error - [ ] It further bins the values - [ ] No change --- title: String function and motivation for datetime description: duration: 200 card_type: cue_card --- ### Explanation of Quiz-2 since categorical column will be a string/object dtype, and we can't compare integer values (bin_edges) with string values # String function and motivation for datetime #### What kind of questions can we use string methods for? Find rows which contains a particular string Say, #### How you can you filter rows containing "hydrochloric" in their drug name? Code ``` python= data_tidy.loc[data_tidy['Drug_Name'].str.contains('hydrochloride')].head() ``` Date time Drug_Name Pressure Temperature \ 0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 5 15-10-2020 11:30:00 ketamine hydrochloride 9.0 21.0 6 15-10-2020 12:30:00 diltiazem hydrochloride 20.0 21.0 Temperature_avg Pressure_avg temp_cat 0 24.848485 15.424242 low 2 17.709677 11.935484 medium 3 24.848485 15.424242 low 5 17.709677 11.935484 medium 6 24.848485 15.424242 medium So in general, we will be using the following format: > Series.str.function() Series.str can be used to **access the values of the series as strings** and apply several methods to it. Now suppose we want to form a new column based on the year of the experiments? #### What can we do form a column containing the year? Code ``` python= data_tidy['Date'].str.split('-') ``` 0 [15, 10, 2020] 1 [15, 10, 2020] 2 [15, 10, 2020] 3 [15, 10, 2020] 4 [15, 10, 2020] ... 103 [17, 10, 2020] 104 [17, 10, 2020] 105 [17, 10, 2020] 106 [17, 10, 2020] 107 [17, 10, 2020] Name: Date, Length: 108, dtype: object To extract the year we need to select the last element of each list Code ``` python= data_tidy['Date'].str.split('-').apply(lambda x:x[2]) ``` 0 2020 1 2020 2 2020 3 2020 4 2020 ... 103 2020 104 2020 105 2020 106 2020 107 2020 Name: Date, Length: 108, dtype: object But there are certain problems with this approach: - The **dtype of the output is still an object**, we would prefer a number type - The date format will always **not be in day-month-year**, it can vary Thus, to work with such date-time type of data, we can use a special method of pandas --- title: Datetime description: duration: 200 card_type: cue_card --- # Datetime #### How can we handle handle date-time data-types? - We can do using the `to_datetime()` function of pandas - It takes as input: - Array/Scalars with values having proper date/time format - `dayfirst`: Indicating if the day comes first in the date format used - `yearfirst`: Indicates if year comes first in the date format Let's first merge our `Date` and `time` columns into a new timestamp column Code ``` python= data_tidy['timestamp'] = data_tidy['Date']+ " "+ data_tidy['time'] ``` Code ``` python= data_tidy.drop(['Date', 'time'], axis=1, inplace=True) ``` Code ``` python= data_tidy.head() ``` Drug_Name Pressure Temperature Temperature_avg \ 0 diltiazem hydrochloride 18.0 20.0 24.848485 1 docetaxel injection 26.0 23.0 30.387097 2 ketamine hydrochloride 9.0 22.0 17.709677 3 diltiazem hydrochloride 19.0 20.0 24.848485 4 docetaxel injection 29.0 25.0 30.387097 Pressure_avg temp_cat timestamp 0 15.424242 low 15-10-2020 10:30:00 1 25.483871 medium 15-10-2020 10:30:00 2 11.935484 medium 15-10-2020 10:30:00 3 15.424242 low 15-10-2020 11:30:00 4 25.483871 medium 15-10-2020 11:30:00 Lets convert our `timestamp` col now Code ``` python= data_tidy['timestamp'] = pd.to_datetime(data_tidy['timestamp']) # will leave to explore how you can mention datetime format by your own data_tidy ``` Drug_Name Pressure Temperature Temperature_avg \ 0 diltiazem hydrochloride 18.0 20.0 24.848485 1 docetaxel injection 26.0 23.0 30.387097 2 ketamine hydrochloride 9.0 22.0 17.709677 3 diltiazem hydrochloride 19.0 20.0 24.848485 4 docetaxel injection 29.0 25.0 30.387097 .. ... ... ... ... 103 docetaxel injection 26.0 19.0 30.387097 104 ketamine hydrochloride 11.0 20.0 17.709677 105 diltiazem hydrochloride 9.0 13.0 24.848485 106 docetaxel injection 27.0 20.0 30.387097 107 ketamine hydrochloride 12.0 21.0 17.709677 Pressure_avg temp_cat timestamp 0 15.424242 low 2020-10-15 10:30:00 1 25.483871 medium 2020-10-15 10:30:00 2 11.935484 medium 2020-10-15 10:30:00 3 15.424242 low 2020-10-15 11:30:00 4 25.483871 medium 2020-10-15 11:30:00 .. ... ... ... 103 25.483871 low 2020-10-17 08:30:00 104 11.935484 low 2020-10-17 08:30:00 105 15.424242 low 2020-10-17 09:30:00 106 25.483871 low 2020-10-17 09:30:00 107 11.935484 medium 2020-10-17 09:30:00 [108 rows x 7 columns] Code ``` python= data_tidy.info() ``` <class 'pandas.core.frame.DataFrame'> Int64Index: 108 entries, 0 to 107 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Drug_Name 108 non-null object 1 Pressure 108 non-null float64 2 Temperature 108 non-null float64 3 Temperature_avg 108 non-null float64 4 Pressure_avg 108 non-null float64 5 temp_cat 108 non-null category 6 timestamp 108 non-null datetime64[ns] dtypes: category(1), datetime64[ns](1), float64(4), object(1) memory usage: 10.3+ KB The **type of `timestamp` column** has been **changed to `datetime`** from `object` Now, Let's look at a single timestamp using Pandas #### How can we **extract information** from a single **timestamp** using Pandas? Code ``` python= ts = data_tidy['timestamp'][0] ts ``` Timestamp('2020-10-15 10:30:00') #### Extracting individual information from date Code ``` python= ts.year, ts.month, ts.day, ts.month_name() ``` (2020, 10, 15, 'October') ... and so on We can similarly extract minutes and seconds #### This data parsing from string to date-time makes it easier to work with data We can use this data from the columns as a whole using `.dt` object Code ``` python= data_tidy['timestamp'].dt ``` <pandas.core.indexes.accessors.DatetimeProperties object at 0x7c2e78c72b60> - **`dt` gives properties of values in a column** - From this **`DatetimeProperties` of column `'end'`**, we can **extract `year`** Code ``` python= data_tidy['timestamp'].dt.year ``` 0 2020 1 2020 2 2020 3 2020 4 2020 ... 103 2020 104 2020 105 2020 106 2020 107 2020 Name: timestamp, Length: 108, dtype: int64 We can use strfttime, short for stringformat time, to modify our datetime format Let's learn this with the help of few examples Code ``` python= data_tidy['timestamp'][0] ``` Timestamp('2020-10-15 10:30:00') Code ``` python= print(data_tidy['timestamp'][0].strftime('%Y')) # Formatter for year ``` 2020 A comprehensive list of other formats can be found here: <https://pandas.pydata.org/docs/reference/api/pandas.Period.strftime.html> Similarly we can combine the format types to modify the date-time format as per our convinience Code ``` python= data_tidy['timestamp'][0].strftime('%m-%d') ``` ``` {"type":"string"} ``` --- title: Quiz-3 description: Quiz-3 duration: 60 card_type: quiz_card --- # Question Given the following dataset: ```python= df = pd.DataFrame([[1, '2020-01-01'], [2, '1998-01-12'], [3, '2012-11-05'], [4, '2000-12-03'], [5, '1960-04-23'], [6, '2008-08-15']], columns=["ID", "birth_dates"]) df['birth_dates'] = pd.to_datetime(df['birth_dates']) ``` What would be the output of the following code? ```python= df.iloc[2]['birth_dates'].year - df.iloc[1]['birth_dates'].year ``` # Choices - [ ] 8 - [x] 14 - [x] 22 - [ ] -22 --- title: Writing to file description: duration: 200 card_type: cue_card --- # Writing to file #### How can we write our dataframe to a csv file? - We have to **provide the path and file_name** in which you want to store the data Code ``` python= data_tidy.to_csv('pfizer_tidy.csv', sep=",", index=False) # setting index=False will not inlcude the index column while writing ```