Pandas - 4 (1 hour and 30 minutes)

# Pandas 4 --- title: Agenda description: duration: 300 card_type: cue_card --- ### Content - Multi-indexing - Melting - `pd.melt()` - Pivoting - `pd.pivot()` - `pd.pivot_table()` - Binning - `pd.cut()` --- title: Multi-Indexing description: duration: 2400 card_type: cue_card --- Code: ``` python= # !pip install --upgrade gdown !gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd !gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm ``` > **Output** ``` Downloading... From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd To: /content/movies.csv 100% 112k/112k [00:00<00:00, 25.1MB/s] Downloading... From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm To: /content/directors.csv 100% 65.4k/65.4k [00:00<00:00, 64.8MB/s] ``` Code: ```python= import pandas as pd import numpy as np movies = pd.read_csv('movies.csv', index_col=0) directors = pd.read_csv('directors.csv', index_col=0) data = movies.merge(directors, how='left', left_on='director_id', right_on='id') data.drop(['director_id','id_y'], axis=1, inplace=True) ``` ### Multi-Indexing **Which director according to you should be considered as most productive?** - Should we decide based on the **number of movies** directed? - Or take the **quality of the movies** into consideration as well? - Or maybe look at the the **amount of business** the movie is doing? To simplify, let's calculate who has directed maximum number of movies. Code: ``` python= data.groupby(['director_name'])['title'].count().sort_values(ascending=False) ``` > **Output** ``` director_name Steven Spielberg 26 Clint Eastwood 19 Martin Scorsese 19 Woody Allen 18 Robert Rodriguez 16 .. Paul Weitz 5 John Madden 5 Paul Verhoeven 5 John Whitesell 5 Kevin Reynolds 5 Name: title, Length: 199, dtype: int64 ``` `Steven Spielberg` has directed maximum number of movies. **But does it make `Steven` the most productive director?** - Chances are, he might be active for more years than the other directors. **Calculating the active years for every director?** - We can subtract both `min` and `max` of year. Code: ``` python= data_agg = data.groupby(['director_name'])[["year", "title"]].aggregate({"year":['min','max'], "title": "count"}) data_agg ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/235/original/k.png?1708848751" width=350 height=450> \ Notice, - `director_name` column has turned into **row labels**. - There are multiple levels for the column names. This is called a **Multi-index DataFrame**. - It can have **multiple indexes along a dimension**. - The no. of dimensions remain same though. - Multi-level indexes are **possible both for rows and columns**. Code: ``` python= data_agg.columns ``` > **Output** ``` MultiIndex([( 'year', 'min'), ( 'year', 'max'), ('title', 'count')], ) ``` The level-1 column names are `year` and `title`. **What would happen if we print the column `year` of this multi-index dataframe?** Code: ``` python= data_agg["year"] ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/236/original/l.png?1708848778" width=325 height=450> \ **How can we convert multi-level back to only one level of columns?** - e.g. `year_min`, `year_max`, `title_count` Code: ``` python= data_agg = data.groupby(['director_name'])[["year","title"]].aggregate( {"year":['min', 'max'], "title": "count"}) data_agg.columns = ['_'.join(col) for col in data_agg.columns] data_agg ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/237/original/m.png?1708848795" width=500 height=450> \ Since these were tuples, we can just join them. Code: ``` python= data.groupby('director_name')[['year', 'title']].aggregate( year_max=('year','max'), year_min=('year','min'), title_count=('title','count') ) ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/238/original/n.png?1708848812" width=500 height=450> \ The columns look good, but we may want to turn back the row labels into a proper column as well. **Converting row labels into a column using `reset_index` -** Code: ``` python= data_agg.reset_index() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/239/original/o.png?1708848833" width=520 height=420> \ **Using the new features, can we find the most productive director?** 1. First calculate how many years the director has been active. Code: ``` python= data_agg["yrs_active"] = data_agg["year_max"] - data_agg["year_min"] data_agg ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/240/original/p.png?1708848854" width=575 height=450> \ 2. Then calculate rate of directing movies by `title_count`/`yrs_active`. Code: ``` python= data_agg["movie_per_yr"] = data_agg["title_count"] / data_agg["yrs_active"] data_agg ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/241/original/q.png?1708848870" width=660 height=425> \ 3. Finally, sort the values. Code: ``` python= data_agg.sort_values("movie_per_yr", ascending=False) ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/242/original/r.png?1708848887" width=600 height=425> \ **Conclusion:** - `Tyler Perry` turns out to be truly the most productive director. --- title: PFizer data description: duration: 900 card_type: cue_card --- For this topic we will be using data of few drugs being developed by **PFizer**. Dataset: https://drive.google.com/file/d/173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ/view?usp=sharing Code: ``` python= !gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ ``` > **Output** ``` Downloading... From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ To: /content/Pfizer_1.csv 0% 0.00/1.51k [00:00<?, ?B/s] 100% 1.51k/1.51k [00:00<00:00, 8.41MB/s] ``` **What is the data about?** - Temperature (K) - Pressure \(P) The data is recorded after an **interval of 1 hour** everyday to monitor the drug stability in a drug development test. These data points are therefore used to **identify the optimal set of values of parameters** for the stability of the drugs. Let's explore this dataset - Code: ``` python= data = pd.read_csv('Pfizer_1.csv') data ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/245/original/s.png?1708849997" width=750 height=350> \ Code: ``` python= data.info() ``` > **Output** ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 18 entries, 0 to 17 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 18 non-null object 1 Drug_Name 18 non-null object 2 Parameter 18 non-null object 3 1:30:00 16 non-null float64 4 2:30:00 16 non-null float64 5 3:30:00 12 non-null float64 6 4:30:00 14 non-null float64 7 5:30:00 16 non-null float64 8 6:30:00 18 non-null int64 9 7:30:00 16 non-null float64 10 8:30:00 14 non-null float64 11 9:30:00 16 non-null float64 12 10:30:00 18 non-null int64 13 11:30:00 16 non-null float64 14 12:30:00 18 non-null int64 dtypes: float64(9), int64(3), object(3) memory usage: 2.2+ KB ``` --- title: Melting description: duration: 1500 card_type: cue_card --- ### Melting As we saw earlier, the dataset has **18 rows** and **15 columns**. If you notice further, you'll see: - The columns are `1:30:00`, `2:30:00`, `3:30:00`, ... so on. - `Temperature` and `Pressure` of each date is in a separate row. **Can we restructure our data into a better format?** - Maybe we can have a column for `time`, with `timestamps` as the column value. **Where will the Temperature/Pressure values go?** - We can similarly create one column containing the values of these parameters. - "Melt" the timestamp column into two columns** - timestamp and corresponding values **How can we restructure our data into having every row corresponding to a single reading?** Code: ``` python= pd.melt(data, id_vars=['Date', 'Parameter', 'Drug_Name']) ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/253/original/u.png?1708852935" width=550 height=425> \ This converts our data from `wide` to `long` format. Notice that the `id_vars` are set of variables which remain unmelted. **How does `pd.melt()` work?** - Pass in the **DataFrame**. - Pass in the **column names that we don't want to melt**. But we can provide better names to these new columns. **How can we rename the columns "variable" and "value" as per our original dataframe?** Code: ``` python= data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'], var_name = "time", value_name = 'reading') data_melt ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/254/original/v.png?1708853002" width=550 height=425> \ **Conclusion:** - The labels of the timestamp columns are conviniently **melted into a single column** - `time` - It retained all the values in `reading` column. - The labels of columns such as `1:30:00`, `2:30:00` have now become categories of the `variable` column. - The values from columns we are melting are stored in the `value` column. --- title: Quiz-1 description: duration: 60 card_type: quiz_card --- # Question Can we use a list of columns like ["Date", "Parameter"] for the var_name parameter in `pd.melt()` on our Pfizer dataset? # Choices - [ ] Yes - [x] No --- title: Break & Doubt Resolution description: duration: 600 card_type: cue_card --- #### Quiz-1 Explanation Our current dataframe has a single index, so we can't pass multiple values to the var_name parameter. ### Break & Doubt Resolution `Instructor Note:` * Take this time (up to 5-10 mins) to give a short break to the learners. * Meanwhile, you can ask the them to share their doubts (if any) regarding the topics covered so far. --- title: Pivoting description: duration: 1500 card_type: cue_card --- ### Pivoting Now suppose we want to convert our data back to the **wide format**. The reason could be to maintain the structure for storing or some other purpose. Notice, - The variables `Date`, `Drug_Name` and `Parameter` will remain same. - The column names will be extracted from the column `time`. - The values will be extracted from the column `readings`. **How can we restructure our data back to the original wide format?** Code: ``` python= data_melt.pivot(index=['Date','Drug_Name','Parameter'], # Columns used to make new frame’s index columns = 'time', # Column used to make new frame’s columns values='reading') # Column used for populating new frame’s values. ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/255/original/y.png?1708853437" width=750 height=350> \ Notice that `pivot()` is the exact opposite of `melt()`. We are getting **multiple indices** here, but we can get single index again using `reset_index()`. Code: ``` python= data_melt.pivot(index=['Date','Drug_Name','Parameter'], columns = 'time', values='reading').reset_index() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/256/original/z.png?1708853500" width=750 height=350> \ Code: ``` python= data_melt.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/246/original/x.png?1708850607" width=500 height=200> \ Now if you notice, - We are using 2 rows to log readings for a single experiment. **Can we further restructure our data into dividing the `Parameter` column into T/P?** - A format like `Date | time | Drug_Name | Pressure | Temperature` would be suitable. - We want to **split one single column into multiple columns**. **How can we divide the `Parameter` column again?** Code: ``` python= data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'], columns = 'Parameter', values='reading') data_tidy ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/247/original/x1.png?1708850772" width=550 height=450> \ Notice that a **multi-index** dataframe has been created. We can use `reset_index()` to remove the multi-index. Code: ``` python= data_tidy = data_tidy.reset_index() data_tidy ``` <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/248/original/x2.png?1708851048" width=600 height=400> \ We can rename our ```index``` column from `Parameter` to simply `None`. Code: ``` python= data_tidy.columns.name = None data_tidy.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/248/original/x2.png?1708851048" width=600 height=400> --- title: Pivot Table description: duration: 1500 card_type: cue_card --- ### Pivot Table Now suppose we want to find some insights, like **mean temperature day-wise**. **Can we use pivot to find the day-wise mean value of temperature for each drug?** Code: ``` python= data_tidy.pivot(index=['Drug_Name'], columns = 'Date', values=['Temperature']) ``` > **Output** ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython=-input-25-760683e83d71> in <cell line: 1>() ----> 1 data_tidy.pivot(index=['Drug_Name'], 2 columns = 'Date', 3 values=['Temperature']) /usr/local/lib/python=3.10/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/local/lib/python=3.10/dist-packages/pandas/core/frame.py in pivot(self, index, columns, values) 8565 from pandas.core.reshape.pivot import pivot 8566 -> 8567 return pivot(self, index=index, columns=columns, values=values) 8568 8569 _shared_docs[ /usr/local/lib/python=3.10/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/pivot.py in pivot(data, index, columns, values) 538 # [List[Any], ExtensionArray, ndarray[Any, Any], Index, Series]"; expected 539 # "Hashable" --> 540 return indexed.unstack(columns_listlike) # type: ignore[arg-type] 541 542 /usr/local/lib/python=3.10/dist-packages/pandas/core/frame.py in unstack(self, level, fill_value) 9110 from pandas.core.reshape.reshape import unstack 9111 -> 9112 result = unstack(self, level, fill_value) 9113 9114 return result.__finalize__(self, method="unstack") /usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value) 474 if isinstance(obj, DataFrame): 475 if isinstance(obj.index, MultiIndex): --> 476 return _unstack_frame(obj, level, fill_value=fill_value) 477 else: 478 return obj.T.stack(dropna=False) /usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value) 497 def _unstack_frame(obj: DataFrame, level, fill_value=None): 498 assert isinstance(obj.index, MultiIndex) # checked by caller --> 499 unstacker = _Unstacker(obj.index, level=level, constructor=obj._constructor) 500 501 if not obj._can_fast_transpose: /usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/reshape.py in __init__(self, index, level, constructor) 135 ) 136 --> 137 self._make_selectors() 138 139 @cache_readonly /usr/local/lib/python=3.10/dist-packages/pandas/core/reshape/reshape.py in _make_selectors(self) 187 188 if mask.sum() < len(self.index): --> 189 raise ValueError("Index contains duplicate entries, cannot reshape") 190 191 self.group_index = comp_index ValueError: Index contains duplicate entries, cannot reshape ``` **Why did we get an error?** - We need to find the **average** of temperature values throughout a day. - If you notice, the error shows **duplicate entries**. Hence, the index values should be unique entry for each row. **What can we do to get our required mean values then?** Code: ``` python= pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature'], aggfunc=np.mean) ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/250/original/x4.png?1708851728" width=500 height=220> \ This function is similar to `pivot()`, with an extra feature of an aggregator. **How does `pivot_table()` work?** - The initial parameters are same as what we use in `pivot()`. - As an extra parameter, we pass the **type of aggregator**. **Note:** - We could have done this using `groupby` too. - In fact, `pivot_table` uses `groupby` in the backend to group the data and perform the aggregration. - The only difference is in the type of output we get using both the functions. **Similarly, what if we want to find the minimum values of temperature and pressure on a particular date?** Code: ``` python= pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature', 'Pressure'], aggfunc=np.min) ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/066/251/original/x5.png?1708851925" width=700 height=200> --- title: Quiz-2 description: duration: 60 card_type: quiz_card --- # Question Consider a dataset containing information about sales transactions, with columns '**Date**', '**Product**', '**Quantity**', and '**Revenue**'. After performing a pivot operation on the '**Product**' column, which of the following statements is most likely to be true? # Choices - [ ] The resulting DataFrame will have more rows than the original dataset. - [x] Each unique value in the 'Product' column will become a separate column in the pivoted DataFrame. - [ ] The total revenue across all products will remain unchanged after the pivot operation. --- title: Binning description: duration: 1500 card_type: cue_card --- #### Quiz-2 Explanation When performing a pivot operation in Pandas, each unique value in the specified columns (in this case, the 'Product' column) becomes a separate column in the pivoted DataFrame. ### Binning Sometimes, we would want our data to be in **categorical** form instead of **continuous/numerical**. - Let's say, instead of knowing specific test values of a month, I want to know its type. - Depending on the level of granularity, we want to have - Low, Medium, High, Very High. **How can we derive bins/buckets from continous data?** - use `pd.cut()` Let's try to use this on our `Temperature` column to categorise the data into bins. But to define categories, let's first check `min` and `max` temperature values. Code: ``` python= data_tidy ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/088/993/original/p1.png?1725868246" width=600 height=450> \ Code ``` python= print(data_tidy['Temperature'].min(), data_tidy['Temperature'].max()) ``` > **Output** 8.0 58.0 Here, - Min value = 8 - Max value = 58 Lets's keep some buffer for future values and take the range from 5-60 (instead of 8-58). We'll divide this data into **4 bins** of 10-15 values each. Code: ``` python= temp_points = [5, 20, 35, 50, 60] temp_labels = ['low','medium','high','very_high'] # labels define the severity of the resultant output of the test data_tidy['temp_cat'] = pd.cut(data_tidy['Temperature'], bins=temp_points, labels=temp_labels) data_tidy.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/088/994/original/p2.png?1725868296" width=600 height=200> \ Code: ``` python= data_tidy['temp_cat'].value_counts() ``` > **Output** low 50 medium 38 high 15 very_high 5 Name: temp_cat, dtype: int64 **Note:** By default, `pd.cut()` creates intervals of the form (x, y] — which includes the right endpoint but excludes the left one. --- title: Quiz-3 description: duration: 60 card_type: quiz_card --- # Question How does the `pd.cut()` function behave when applied to an already categorical column to further segment it into categories? # Choices - [x] An error is raised - [ ] It bins the values further - [ ] No change occurs --- title: Unlock Assignment & ask learner to solve in live class description: duration: 1800 card_type: cue_card --- * Unlock the assignment for learners by clicking the **“question mark”** button on the top bar. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/078/685/original/Screenshot_2024-06-19_at_7.17.12_PM.png?1718804854" width=200 /> * If you face any difficulties using this feature, please refer to this video on how to unlock assignments. * **Note:** The following video is strictly for instructor reference only. [VIDEO LINK](https://www.loom.com/share/15672134598f4b4c93475beda227fb3d?sid=4fb31191-ae8c-4b18-bf81-468d2ffd9bd4) ### Conducting a Live Assignment Solution Session: 1. Once you unlock the assignments, ask if anyone in the class would like to solve a question live by sharing their screen. 2. Select a learner and grant permission by navigating to **Settings > Admin > Unmuted Audience Can Share**, then select **Audio, Video, and Screen**. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/111/113/original/image.png?1740484517" width=400 /> 3. Allow the selected learner to share their screen and guide them through solving the question live. 4. Engage with both the learner sharing the screen and other students in the class to foster an interactive learning experience. ### Practice Coding Question(s) You can pick the following question and solve it during the lecture itself. This will help the learners to get familiar with the problem solving process and motivate them to solve the assignments. Make sure to start the doubt session before you solve this question. Q. https://www.scaler.com/hire/test/problem/103383/ - Year Wise Life Expectancy