Pandas 1 - HackMD

# Pandas 1 --- title: Agenda description: duration: 300 card_type: cue_card --- ### Agenda - Introduction to Pandas - DataFrame & Series - Creating DataFrame from Scratch (Post-read) - Basic ops on a DataFrame - Basic ops on Columns - Accessing column(s) - Check for unique values - Rename column - Deleting column(s) - Creating new column(s) - Basic ops on Rows - Implicit/Explicit index - Indexing in Series - Slicing in Series - loc/iloc - Indexing/Slicing in DataFrame --- title: Introduction to Pandas description: duration: 900 card_type: cue_card --- ### Pandas Installation ``` python= !pip install pandas ``` ### Importing Pandas - You should be able to import Pandas after installing it. - We'll import `pandas` using its **alias name** `pd`. ``` python= import pandas as pd import numpy as np ``` ### Why use Pandas? - The major **limitation of numpy** is that it can only work with one datatype at a time. - Most real-world datasets contain a mix of different datatypes. - **names of a place would be string** - **population of a place would be int** It is difficult to work with data having **heterogeneous values** using Numpy. On the other hand, Pandas can work with numbers and strings together. ### Problem Statement - Imagine that you are a Data Scientist with McKinsey. - McKinsey wants to understand the relation between GDP per capita and life expectancy for their clients. - The company has obtained data from various surveys conducted in different countries over several years. - The acquired data includes information on - - Country - Population Size - Life Expectancy - GDP per Capita - We have to analyse the data and draw inferences that are meaningful to the company. ### Loading the dataset Dataset: https://drive.google.com/file/d/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_/view?usp=sharing Code: ``` python= !wget "https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_" -O mckinsey.csv ``` > **Output** ``` --2023-01-20 17:42:26-- https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_ Resolving drive.google.com (drive.google.com)... 173.194.202.139, 173.194.202.113, 173.194.202.138, ... Connecting to drive.google.com (drive.google.com)|173.194.202.139|:443... connected. HTTP request sent, awaiting response... 303 See Other Location: https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/butcf5ctf36986lddjpg19e59js0o9vk/1674236475000/14302370361230157278/*/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_?e=download&uuid=9df7f5c8-a0e7-4340-a593-8f51f15cb620 [following] Warning: wildcards not supported in HTTP. --2023-01-20 17:42:27-- https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/butcf5ctf36986lddjpg19e59js0o9vk/1674236475000/14302370361230157278/*/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_?e=download&uuid=9df7f5c8-a0e7-4340-a593-8f51f15cb620 Resolving doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)... 173.194.203.132, 2607:f8b0:400e:c05::84 Connecting to doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)|173.194.203.132|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 83785 (82K) [text/csv] Saving to: ‘mckinsey.csv’ mckinsey.csv 100%[===================>] 81.82K --.-KB/s in 0.001s 2023-01-20 17:42:27 (78.4 MB/s) - ‘mckinsey.csv’ saved [83785/83785] ``` **Now how should we read this dataset?** Pandas makes it very easy to work with these kinds of files. Code: ``` python= df = pd.read_csv('mckinsey.csv') # storing the data in df df ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/065/852/original/i.png?1708491297" width=600 height=425> --- title: DataFrame and Series description: duration: 1500 card_type: cue_card --- ### DataFrame and Series **What can we observe from the above dataset?** We can see that it has: - 6 columns - 1704 rows **What do you think is the datatype of `df`?** Code: ``` python= type(df) ``` > **Output** ``` pandas.core.frame.DataFrame ``` It is a **Pandas DataFrame** #### What is a Pandas DataFrame? - A DataFrame is a **table-like (structured)** representation of data in Pandas. - Considered as a **counterpart of 2D matrix** in Numpy. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/822/original/Image_2.png?1694454212" width=650 height=450> \ **How can we access a column, say `country` of the dataframe?** Code: ``` python= df["country"] ``` > **Output** ``` 0 Afghanistan 1 Afghanistan 2 Afghanistan 3 Afghanistan 4 Afghanistan ... 1699 Zimbabwe 1700 Zimbabwe 1701 Zimbabwe 1702 Zimbabwe 1703 Zimbabwe Name: country, Length: 1704, dtype: object ``` As you can see, we get all the values present in the **country** column. **What is the data-type of a column?** Code: ``` python= type(df["country"]) ``` > **Output** ``` pandas.core.series.Series ``` It is a **Pandas Series** #### What is a Pandas Series? - A **Series** in Pandas is what a **Vector** is in Numpy. **What exactly does that mean?** - It means that a Series is a **single column of data**. - Multiple Series are stacked together to form a DataFrame. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/823/original/Image_3.png?1694454319" width=600 height=250> \ Now we have understood what Series and DataFrame are. **How can we find the datatype, name, total entries in each column?** Code: ``` python= df.info() ``` > **Output** ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 1704 entries, 0 to 1703 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 1704 non-null object 1 year 1704 non-null int64 2 population 1704 non-null int64 3 continent 1704 non-null object 4 life_exp 1704 non-null float64 5 gdp_cap 1704 non-null float64 dtypes: float64(2), int64(2), object(2) memory usage: 80.0+ KB ``` `df.info()` gives a list of columns with: - **Name** of columns - **How many non-null values (blank cells)** each column has. - **Type of values** in each column - int, float, etc. **By default**, it shows **Dtype** as `object` for anything other than **int or float**. **What if we want to see the first few rows in the dataset?** Code: ``` python= df.head() ``` > **Output** <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/824/original/Image_4.png?1694454638 height=225 width=550> **`df.head()` prints the top 5 rows by default.** We can also pass in number of rows that we want to see. Code: ``` python= df.head(20) ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/825/original/Image_5.png?1694454995" width="550" height="700"> Similarly, we can use **`df.tail()` if we wish to see the last few rows**. Code: ``` python= df.tail(20) ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/826/original/Image_6.png?1694455315" width="520" height="600"> \ **How can we find the shape of a dataframe?** Code: ``` python= df.shape ``` > **Output** ``` (1704, 6) ``` Similar to Numpy, it gives the **no. of rows and columns**. ### Post-read - [**DataFrame from Scratch**](https://colab.research.google.com/drive/1x3ct95RtIIQTJeGbyuuYaMociVp90ww6?usp=sharing) --- title: Basic operations on columns description: duration: 2400 card_type: cue_card --- ### Basic operations on Columns **What operations can we do using columns?** - Add a column - Delete a column - Rename a column We can see that our dataset has 6 columns. **How can we get the names of all these cols?** We can do it in two ways: 1. `df.columns` 2. `df.keys` Code: ``` python= df.columns # using attribute `columns` of dataframe ``` > **Output** ``` Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object') ``` Code: ``` python= df.keys() # using method `keys()` of dataframe ``` > **Output** ``` Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object') ``` **Note:** - Here, `Index` is a type of Pandas class used to store the `address` of the series/dataframe. - It is an immutable sequence used for indexing. **How can we access these columns?** Code: ``` python= df['country'].head() # accessing a single column ``` > **Output** ``` 0 Afghanistan 1 Afghanistan 2 Afghanistan 3 Afghanistan 4 Afghanistan Name: country, dtype: object ``` Code: ``` python= df[['country', 'life_exp']].head() # accessing multiple columns ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/828/original/Image_8.png?1694455741" width=210 height=200> \ **And what if we pass a single column name?** ``` python= df[['country']].head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/829/original/Image_9.png?1694455883" width=130 height=200> \ **Note:** - Notice how this output type is different from our earlier output using `df['country']` - `['country']` gives a Series while `[['country']]` gives a DataFrame. **How can we find the countries that have been surveyed?** We can find the unique values in the `country` column. Code: ``` python= df['country'].unique() ``` > **Output** ``` array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium', 'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia', 'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana', 'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti', 'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.', 'Korea, Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania', 'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman', 'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia', 'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan', 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan', 'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay', 'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe'], dtype=object) ``` **What if you also want to check the count of occurence of each country in the dataframe?** Code: ``` python= df['country'].value_counts() ``` > **Output** ``` Afghanistan 12 Pakistan 12 New Zealand 12 Nicaragua 12 Niger 12 .. Eritrea 12 Equatorial Guinea 12 El Salvador 12 Egypt 12 Zimbabwe 12 Name: country, Length: 142, dtype: int64 ``` **Note:** `value_counts()` shows the output in **decreasing order of frequency**. **What if we want to change the name of a column?** We can rename the column by - passing the dictionary with `old_name:new_name` pair - specifying `axis=1` Code: ``` python= df.rename({"population":"Population", "country":"Country" }, axis=1) ``` >**Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/831/original/Image_10.png?1694456046" width=500 height=400> \ Alternatively, we can also rename the column - without specifying `axis` - by using the `column` parameter Code: ``` python= df.rename(columns={"country":"Country"}) ``` >**Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/832/original/Image_11.png?1694456119" width=500 height=400> \ If we try and check the original dataframe `df` - ```python= df ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/833/original/Image_12.png?1694456346" width=500 height=400> \ We can clearly see that the column names are still the same and have not changed. The changes doesn't happen in original dataframe unless we specify a parameter called `inplace` as True. ``` python= df.rename({"country":"Country"}, axis=1, inplace=True) df ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/837/original/Image_13.png?1694456744" width=500 height=400> \ **Note** - `.rename` has default value of axis=0 - If two columns have the **same name**, then `df['column']` will display both columns. There's another way of accessing the column values. ``` python= df.Country ``` >**Output** ``` 0 Afghanistan 1 Afghanistan 2 Afghanistan 3 Afghanistan 4 Afghanistan ... 1699 Zimbabwe 1700 Zimbabwe 1701 Zimbabwe 1702 Zimbabwe 1703 Zimbabwe Name: Country, Length: 1704, dtype: object ``` This however doesn't work everytime. **What do you think could be the problem here?** - If the column names are **not strings** - Starting with **number**: e.g., `2nd` - Contains a **whitespace**: e.g., `Roll Number` - If the column names conflict with **methods of the DataFrame** - e.g. `shape` We already know the continents in which each country lies. So we probably don't need this column. **How can we delete columns from a dataframe?** Code: ``` python= df.drop('continent', axis=1) ``` >**Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/838/original/Image_14.png?1694456901" width=450 height=400> \ The `drop()` function takes two parameters: - column name - axis By default, the value of `axis` is 0. An alternative to the above approach is using the "columns" parameter as we did in `rename()`. Code: ``` python= df.drop(columns=['continent']) ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/839/original/Image_15.png?1694457185" width=450 height=400> \ As you can see, the column `contintent` is dropped. **Has the column been permanently deleted?** Code: ``` python= df.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/840/original/Image_16.png?1694457584" width=500 height=200> \ No, the column `continent` is still there in the original dataframe. **Do you see what's happening here?** We only got a **view of dataframe** with column `continent` dropped. **How can we permanently drop the column?** - We can either **re-assign** it `df = df.drop('continent', axis=1)` - Or we can **set the parameter `inplace=True`** - By default, `inplace=False`. Code: ``` python= df.drop('continent', axis=1, inplace=True) ``` **What if we want to create a new column?** - We can either use values from **existing columns**. - Or we can create our own values. **How to create a column using values from an existing column?** Code: ``` python= df["year+7"] = df["year"] + 7 df.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/843/original/Image_18.png?1694458249" width=450 height=200> \ As we see, a new column `year+7` is created from the column `year`. We can also use values from two columns to form a new column. **Which two columns can we use to create a new column `gdp`?** Code: ``` python= df['gdp']=df['gdp_cap'] * df['population'] df.head() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/846/original/Image_19.png?1694458469" width=550 height=200> \ As you can see - An additional column has been created. - Values in this column are **product of respective values in `gdp_cap` and `population` columns**. **What other operations we can use?** - Addition - Subtraction - Division **How can we create a new column from our own values?** - We can either **create a list**. - Or we can **create a Pandas Series** from a list/numpy array for our new column. Code: ``` python= df["Own"] = [i for i in range(1704)] # count of these values should be correct df ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/848/original/Image_20.png?1694458598" width=650 height=400> \ Before we move to ops on rows, let's drop the newly created columns. ``` python= df.drop(columns=["Own",'gdp', 'year+7'], axis=1, inplace=True) df ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/850/original/Image_21.png?1694458701" width=450 height=400> --- title: Quiz-1 description: duration: 60 card_type: quiz_card --- # Question Given a dataframe consisting of 5 columns, which is the correct code to drop the 3rd column from the start? # Choices - [x] df.drop(df.columns[-3], axis=1) - [ ] df.drop(df.columns[3], axis=1) - [ ] df.drop(df.columns[-3], axis=0) - [ ] df.drop(df.columns[3], axis=0) --- title: Break & Doubt Resolution description: duration: 600 card_type: cue_card --- ### Break & Doubt Resolution `Instructor Note:` * Take this time (up to 5-10 mins) to give a short break to the learners. * Meanwhile, you can ask the them to share their doubts (if any) regarding the topics covered so far. --- title: Basic operations on rows description: duration: 1800 card_type: cue_card --- ### Basic operations on Rows **Just like columns, do rows also have labels? Yes.** - **Can we change row labels (like we did for columns)?** - **What if we want to start indexing from 1 (instead of 0)?** Code: ``` python= df.index = list(range(1, df.shape[0]+1)) # create a list of indices of same length df ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/851/original/Image_22.png?1694458818" width=450 height=400> \ As you can see the indexing now starts from 1 instead of 0. ### Explicit and Implicit Indices **What are these row labels/indices exactly?** - They can be called identifiers of a particular row. - Specifically known as **explicit indices**. **Additionally, can a series/dataframe also use Python style indexing? Yes.** - The Python style indices are known as **implicit indices**. **How can we access explicit index of a particular row?** - using `df.index[]` - Takes **impicit index** of row to give its **explicit index**. Code: ``` python= df.index[1] # implicit index 1 gave explicit index 2 ``` > **Output** ``` 2 ``` **But why not use just implicit indexing?** Explicit indices can be changed to any value of any datatype. - e.g. explicit index of 1st row can be changed to `first` - Or something like a floating point value, say `1.0` Code: ``` python= df.index = np.arange(1, df.shape[0]+1, dtype='float') df ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/852/original/Image_23.png?1694459033" width=450 height=400> \ As we can see, the indices are now floating point values. Now to understand string indices, let's take a small subset of our original dataframe. ``` python= sample = df.head() sample ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/854/original/Image_24.png?1694459095" width=450 height=220> \ **What if we want to use string indices?** Code: ``` python= sample.index = ['a', 'b', 'c', 'd', 'e'] sample ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/855/original/Image_25.png?1694459166" width=450 height=220> \ This shows us that we can use almost anything as our explicit index. Now, let's reset our indices back to integers. Code: ``` python= df.index = np.arange(1, df.shape[0]+1, dtype='int') ``` **What if we want to access any particular row (say first row)?** Let's first see for one column. Later, we can generalise the same for the entire dataframe. Code: ``` python= ser = df["Country"] ser.head(20) ``` > **Output** ``` 1 Afghanistan 2 Afghanistan 3 Afghanistan 4 Afghanistan 5 Afghanistan 6 Afghanistan 7 Afghanistan 8 Afghanistan 9 Afghanistan 10 Afghanistan 11 Afghanistan 12 Afghanistan 13 Albania 14 Albania 15 Albania 16 Albania 17 Albania 18 Albania 19 Albania 20 Albania Name: Country, dtype: object ``` We can simply use its indices much like we do in a Numpy array. **So, how will be then access the 13th element?** Code: ``` python= ser[12] ``` > **Output** ``` `Afghanistan` ``` **What about accessing a subset of rows (say 6th to 15th)?** Code: ``` python= ser[5:15] ``` > **Output** ``` 6 Afghanistan 7 Afghanistan 8 Afghanistan 9 Afghanistan 10 Afghanistan 11 Afghanistan 12 Afghanistan 13 Albania 14 Albania 15 Albania Name: Country, dtype: object ``` This is known as `Slicing`. Notice something different though? - **Indexing in Series** used **explicit indices** - **Slicing** however used **implicit indices** Let's try the same for the dataframe. **How can we access a row in a dataframe?** ``` python= df[0] ``` > **Output** ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: /usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() /usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 0 The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-62-ad11118bc8f3> in <module> ----> 1 df[0] /usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in __getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) 3459 if is_integer(indexer): 3460 indexer = [indexer] /usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not self.hasnans: KeyError: 0 ``` Notice that this syntax is exactly same as how we tried accessing a column. - `df[x]` looks for column with name `x` **How can we access a slice of rows in the dataframe?** Code: ``` python= df[5:15] ``` Woah, so the slicing works. This can be a cause for confusion. To avoid this, Pandas provides special indexers, `loc` and `iloc`. --- title: loc description: duration: 600 card_type: cue_card --- ### 1. `loc` - Allows indexing and slicing that always references the explicit index. Code: ``` python= df.loc[1] ``` > **Output** ``` Country Afghanistan year 1952 population 8425333 life_exp 28.801 gdp_cap 779.445314 Name: 1, dtype: object ``` Code: ``` python= df.loc[1:3] ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/856/original/Image_26.png?1694459518" width=400 height=150> \ Did you notice something strange here? - The **range is inclusive** of **end point** for `loc`. - **Row with label 3** is **included** in the result. --- title: Quiz-2 description: duration: 60 card_type: quiz_card --- # Question For the given Series: ```python= demo = pd.Series(['a','b','c','d','e'], index=[1,5,3,7,3]) ``` What would `demo.loc[1:3]` return? # Choices - [ ] First 3 elements - [ ] First 5 elements - [x] Error --- title: Quiz-3 description: duration: 60 card_type: quiz_card --- # Question What would the following code print? ``` python df.loc[9::-3] ``` # Choices - [ ] rows 0, 3, 6, 9, in this order - [x] rows 9, 6, 3, 0 in this order - [ ] None of these --- title: iloc description: duration: 600 card_type: cue_card --- ### 2. `iloc` - Allows indexing and slicing that always references the implicit index. Code: ``` python= df.iloc[1] ``` > **Output** ``` Country Afghanistan year 1957 population 9240934 life_exp 30.332 gdp_cap 820.85303 Name: 2, dtype: object ``` **Will `iloc` also consider the range inclusive?** Code: ``` python= df.iloc[0:2] ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/047/857/original/Image_27.png?1694459781" width=420 height=100> \ No, because **`iloc` works with implicit Python-style indices**. **Which one should we use?** - Generally, explicit indexing is considered to be better than implicit indexing. - But it is recommended to always use both `loc` and `iloc` to avoid any confusions. **What if we want to access multiple non-consecutive rows at same time?** Code: ``` python= df.iloc[[1, 10, 100]] ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/065/877/original/i_24.png?1708492176" width=400 height=125> \ We can just **pack the indices in `[]`** and pass it in `loc` or `iloc`. **What about negative index? Which would work between `iloc` and `loc`?** Code: ``` python= df.iloc[-1] # Works and gives last row in dataframe ``` > **Output** ``` Country Zimbabwe year 2007 population 12311143 life_exp 43.487 gdp_cap 469.709298 Name: 1704, dtype: object ``` Code: ``` python= df.loc[-1] # Does not work ``` > **Output** ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3801 try: -> 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: /usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() /usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() KeyError: -1 The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-50-b14e7cd6c153> in <cell line: 1>() ----> 1 df.loc[-1] 2 3 # Does not work /usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in __getitem__(self, key) 1071 1072 maybe_callable = com.apply_if_callable(key, self.obj) -> 1073 return self._getitem_axis(maybe_callable, axis=axis) 1074 1075 def _is_scalar_access(self, key: tuple): /usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis) 1310 # fall thru to straight lookup 1311 self._validate_key(key, axis) -> 1312 return self._get_label(key, axis=axis) 1313 1314 def _get_slice_axis(self, slice_obj: slice, axis: int): /usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _get_label(self, label, axis) 1258 def _get_label(self, label, axis: int): 1259 # GH#5567 this will fail if the label is not present in the axis. -> 1260 return self.obj.xs(label, axis=axis) 1261 1262 def _handle_lowerdim_multi_index_axis0(self, tup: tuple): /usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level) 4054 new_index = index[loc] 4055 else: -> 4056 loc = index.get_loc(key) 4057 4058 if isinstance(loc, np.ndarray): /usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: -> 3804 raise KeyError(key) from err 3805 except TypeError: 3806 # If we have a listlike key, _check_indexing_error will raise KeyError: -1 ``` **So, why did `iloc[-1]` worked, but `loc[-1]` didn't?** - Because **`iloc` works with positional indices, while `loc` with assigned labels**. - `[-1]` here points to the **row at last position** in `iloc`. --- title: Quiz-4 description: duration: 60 card_type: quiz_card --- # Question How to select records from 30th to 40th row for the last 3 columns using `iloc`? # Choices - [x] df.iloc[29:40,-3:] - [ ] df.iloc[30:39,-3:] - [ ] df.iloc[29:30,:-3] - [ ] df.iloc[31:41,:-3] --- title: Columns as Row Index description: duration: 400 card_type: cue_card --- **Can we use one of the columns as row index?** Code: ``` python= temp = df.set_index("Country") temp ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/065/878/original/i_25.png?1708492193" width=425 height=450> \ **Note:** In earlier versions of Pandas, `drop=True` has to be provided to delete the column being used as new index. **Now what would the row corresponding to index `Afghanistan` give?** Code: ``` python= temp.loc['Afghanistan'] ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/065/879/original/i_26.png?1708492205" width=425 height=425> \ As you can see, we got the rows all having index `Afghanistan`. Generally, it is advisable to keep unique indices. But it also depends on the use-case. **How can we reset our indices back to integers?** Code ``` python= df.reset_index() ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/065/880/original/i_27.png?1708492217" width=500 height=400> \ Notice that it's creating a new column `index`. **How can we reset our index without creating this new column?** Code: ``` python= df.reset_index(drop=True) # by using drop=True we can prevent creating a new column ``` > **Output** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/065/881/original/i_28.png?1708492236" width=450 height=400> \ Great! Now let's do this in place. Code: ``` python df.reset_index(drop=True, inplace=True) ``` --- title: Quiz-5 description: duration: 60 card_type: quiz_card --- # Question What will the following code print? ```python= a = pd.Series(['a','b','c'], index=[1,2,2]) print(a[2]) ``` # Choices - [x] both b and c - [ ] only b - [ ] only c --- title: Unlock Assignment & ask learner to solve in live class description: duration: 1800 card_type: cue_card --- * Unlock the assignment for learners by clicking the **“question mark”** button on the top bar. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/078/685/original/Screenshot_2024-06-19_at_7.17.12_PM.png?1718804854" width=200 /> * If you face any difficulties using this feature, please refer to this video on how to unlock assignments. * **Note:** The following video is strictly for instructor reference only. [VIDEO LINK](https://www.loom.com/share/15672134598f4b4c93475beda227fb3d?sid=4fb31191-ae8c-4b18-bf81-468d2ffd9bd4) ### Conducting a Live Assignment Solution Session: 1. Once you unlock the assignments, ask if anyone in the class would like to solve a question live by sharing their screen. 2. Select a learner and grant permission by navigating to **Settings > Admin > Unmuted Audience Can Share**, then select **Audio, Video, and Screen**. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/111/113/original/image.png?1740484517" width=400 /> 3. Allow the selected learner to share their screen and guide them through solving the question live. 4. Engage with both the learner sharing the screen and other students in the class to foster an interactive learning experience. ### Practice Coding Question(s) You can pick the following question and solve it during the lecture itself. This will help the learners to get familiar with the problem solving process and motivate them to solve the assignments. Make sure to start the doubt session before you solve this question. Q. https://www.scaler.com/hire/test/problem/23260/ - The selected rows