--- title: Virgil - Intro To Pandas Seaborn - S31 Load And Overview Data tags: Virgil, LearnWorld, IntroPandasSeaborn --- <a target="_blank" href="https://colab.research.google.com/drive/1HfzWnrxn42wL575iuHwa_OZynSEdwjpi"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> ***Quick tips about Google Colab*** To run a code cell: `cmd/ctrl + Enter` To comment out a code line: `cmd/ctrl + /` To move a code block to the right: `cmd/ctrl + ]` To move a code block to the right: `cmd/ctrl + [` # **INTRODUCTION TO PANDAS** ![](https://i.pinimg.com/474x/6a/18/df/6a18dff64059bb388ed1046c0f2cc350.jpg) **Python** is the most popular programming language used in Data Science. Not only the incredible speed, Python also offers a good amount of libraries that are dedicated for certain jobs in Data Science, from data analysing to running statistical tests and so on. From today (until the rest of the course), we will use Pandas as a primary tool to load and analyse data. In today session, let's get our hands on some of the very basic concepts, including: 1. Pandas Components: DataFrame, Series, Index 2. Load and Overview Data 3. Selection and Filter 4. Sort Let's get started! ## 1. Import libraries ```python # Import Pandas import pandas as pd ``` ## 2. Read and Overview #### Load .csv ```python # Load CSV file # Tips: If your data is on Dropbox, change the link's ending part from dl=0 to dl=1 df = pd.read_csv('https://www.dropbox.com/s/zhxqmtf7fr3sabt/demographic_data.csv?dl=1') df ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Country Name</th> <th>Country Code</th> <th>Birth rate</th> <th>Internet users</th> <th>Income Group</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>Aruba</td> <td>ABW</td> <td>10.244</td> <td>78.9</td> <td>High income</td> </tr> <tr> <th>1</th> <td>Afghanistan</td> <td>AFG</td> <td>35.253</td> <td>5.9</td> <td>Low income</td> </tr> <tr> <th>2</th> <td>Angola</td> <td>AGO</td> <td>45.985</td> <td>19.1</td> <td>Upper middle income</td> </tr> <tr> <th>3</th> <td>Albania</td> <td>ALB</td> <td>12.877</td> <td>57.2</td> <td>Upper middle income</td> </tr> <tr> <th>4</th> <td>United Arab Emirates</td> <td>ARE</td> <td>11.044</td> <td>88.0</td> <td>High income</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>190</th> <td>Yemen, Rep.</td> <td>YEM</td> <td>32.947</td> <td>20.0</td> <td>Lower middle income</td> </tr> <tr> <th>191</th> <td>South Africa</td> <td>ZAF</td> <td>20.850</td> <td>46.5</td> <td>Upper middle income</td> </tr> <tr> <th>192</th> <td>Congo, Dem. Rep.</td> <td>COD</td> <td>42.394</td> <td>2.2</td> <td>Low income</td> </tr> <tr> <th>193</th> <td>Zambia</td> <td>ZMB</td> <td>40.471</td> <td>15.4</td> <td>Lower middle income</td> </tr> <tr> <th>194</th> <td>Zimbabwe</td> <td>ZWE</td> <td>35.715</td> <td>18.5</td> <td>Low income</td> </tr> </tbody> </table> <p>195 rows × 5 columns</p> </div> ```python # Set new index df = df.set_index('Country Name') df ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Country Code</th> <th>Birth rate</th> <th>Internet users</th> <th>Income Group</th> </tr> <tr> <th>Country Name</th> <th></th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>Aruba</th> <td>ABW</td> <td>10.244</td> <td>78.9</td> <td>High income</td> </tr> <tr> <th>Afghanistan</th> <td>AFG</td> <td>35.253</td> <td>5.9</td> <td>Low income</td> </tr> <tr> <th>Angola</th> <td>AGO</td> <td>45.985</td> <td>19.1</td> <td>Upper middle income</td> </tr> <tr> <th>Albania</th> <td>ALB</td> <td>12.877</td> <td>57.2</td> <td>Upper middle income</td> </tr> <tr> <th>United Arab Emirates</th> <td>ARE</td> <td>11.044</td> <td>88.0</td> <td>High income</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>Yemen, Rep.</th> <td>YEM</td> <td>32.947</td> <td>20.0</td> <td>Lower middle income</td> </tr> <tr> <th>South Africa</th> <td>ZAF</td> <td>20.850</td> <td>46.5</td> <td>Upper middle income</td> </tr> <tr> <th>Congo, Dem. Rep.</th> <td>COD</td> <td>42.394</td> <td>2.2</td> <td>Low income</td> </tr> <tr> <th>Zambia</th> <td>ZMB</td> <td>40.471</td> <td>15.4</td> <td>Lower middle income</td> </tr> <tr> <th>Zimbabwe</th> <td>ZWE</td> <td>35.715</td> <td>18.5</td> <td>Low income</td> </tr> </tbody> </table> <p>195 rows × 4 columns</p> </div> #### Overview ```python # Selection --> [] # Hàm/function --> () # Show the first 5 rows df.head(10) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Country Code</th> <th>Birth rate</th> <th>Internet users</th> <th>Income Group</th> </tr> <tr> <th>Country Name</th> <th></th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>Aruba</th> <td>ABW</td> <td>10.244</td> <td>78.9000</td> <td>High income</td> </tr> <tr> <th>Afghanistan</th> <td>AFG</td> <td>35.253</td> <td>5.9000</td> <td>Low income</td> </tr> <tr> <th>Angola</th> <td>AGO</td> <td>45.985</td> <td>19.1000</td> <td>Upper middle income</td> </tr> <tr> <th>Albania</th> <td>ALB</td> <td>12.877</td> <td>57.2000</td> <td>Upper middle income</td> </tr> <tr> <th>United Arab Emirates</th> <td>ARE</td> <td>11.044</td> <td>88.0000</td> <td>High income</td> </tr> <tr> <th>Argentina</th> <td>ARG</td> <td>17.716</td> <td>59.9000</td> <td>High income</td> </tr> <tr> <th>Armenia</th> <td>ARM</td> <td>13.308</td> <td>41.9000</td> <td>Lower middle income</td> </tr> <tr> <th>Antigua and Barbuda</th> <td>ATG</td> <td>16.447</td> <td>63.4000</td> <td>High income</td> </tr> <tr> <th>Australia</th> <td>AUS</td> <td>13.200</td> <td>83.0000</td> <td>High income</td> </tr> <tr> <th>Austria</th> <td>AUT</td> <td>9.400</td> <td>80.6188</td> <td>High income</td> </tr> </tbody> </table> </div> ```python # Show the last 5 rows df.tail() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Country Code</th> <th>Birth rate</th> <th>Internet users</th> <th>Income Group</th> </tr> <tr> <th>Country Name</th> <th></th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>Yemen, Rep.</th> <td>YEM</td> <td>32.947</td> <td>20.0</td> <td>Lower middle income</td> </tr> <tr> <th>South Africa</th> <td>ZAF</td> <td>20.850</td> <td>46.5</td> <td>Upper middle income</td> </tr> <tr> <th>Congo, Dem. Rep.</th> <td>COD</td> <td>42.394</td> <td>2.2</td> <td>Low income</td> </tr> <tr> <th>Zambia</th> <td>ZMB</td> <td>40.471</td> <td>15.4</td> <td>Lower middle income</td> </tr> <tr> <th>Zimbabwe</th> <td>ZWE</td> <td>35.715</td> <td>18.5</td> <td>Low income</td> </tr> </tbody> </table> </div> ```python # Show 5 random rows df.sample(5) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Country Code</th> <th>Birth rate</th> <th>Internet users</th> <th>Income Group</th> </tr> <tr> <th>Country Name</th> <th></th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>Burundi</th> <td>BDI</td> <td>44.151</td> <td>1.30</td> <td>Low income</td> </tr> <tr> <th>Indonesia</th> <td>IDN</td> <td>20.297</td> <td>14.94</td> <td>Lower middle income</td> </tr> <tr> <th>Algeria</th> <td>DZA</td> <td>24.738</td> <td>16.50</td> <td>Upper middle income</td> </tr> <tr> <th>Bangladesh</th> <td>BGD</td> <td>20.142</td> <td>6.63</td> <td>Lower middle income</td> </tr> <tr> <th>Cuba</th> <td>CUB</td> <td>10.400</td> <td>27.93</td> <td>Upper middle income</td> </tr> </tbody> </table> </div> ```python # Show shape of the dataframe (không có dấu ngoặt tròn) df.shape ``` (195, 4) ```python # Show info df.info() ``` <class 'pandas.core.frame.DataFrame'> Index: 195 entries, Aruba to Zimbabwe Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country Code 195 non-null object 1 Birth rate 195 non-null float64 2 Internet users 195 non-null float64 3 Income Group 195 non-null object dtypes: float64(2), object(2) memory usage: 7.6+ KB ```python # Overview of numerical columns df.describe() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Birth rate</th> <th>Internet users</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>195.000000</td> <td>195.000000</td> </tr> <tr> <th>mean</th> <td>21.469928</td> <td>42.076471</td> </tr> <tr> <th>std</th> <td>10.605467</td> <td>29.030788</td> </tr> <tr> <th>min</th> <td>7.900000</td> <td>0.900000</td> </tr> <tr> <th>25%</th> <td>12.120500</td> <td>14.520000</td> </tr> <tr> <th>50%</th> <td>19.680000</td> <td>41.000000</td> </tr> <tr> <th>75%</th> <td>29.759500</td> <td>66.225000</td> </tr> <tr> <th>max</th> <td>49.661000</td> <td>96.546800</td> </tr> </tbody> </table> </div> ```python # Choose a column df['Income Group'] ``` Country Name Aruba High income Afghanistan Low income Angola Upper middle income Albania Upper middle income United Arab Emirates High income ... Yemen, Rep. Lower middle income South Africa Upper middle income Congo, Dem. Rep. Low income Zambia Lower middle income Zimbabwe Low income Name: Income Group, Length: 195, dtype: object ```python # Overview of categorical columns df['Income Group'].describe() ``` count 195 unique 4 top High income freq 67 Name: Income Group, dtype: object ```python # value_counts, nunique, unique df['Income Group'].value_counts() ``` High income 67 Lower middle income 50 Upper middle income 48 Low income 30 Name: Income Group, dtype: int64 ```python df['Income Group'].unique() ``` array(['High income', 'Low income', 'Upper middle income', 'Lower middle income'], dtype=object) ```python df['Income Group'].nunique() ``` 4