---
title: Virgil - Intro To Pandas Seaborn - S31 Load And Overview Data
tags: Virgil, LearnWorld, IntroPandasSeaborn
---
<a target="_blank" href="https://colab.research.google.com/drive/1HfzWnrxn42wL575iuHwa_OZynSEdwjpi"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
***Quick tips about Google Colab***
To run a code cell: `cmd/ctrl + Enter`
To comment out a code line: `cmd/ctrl + /`
To move a code block to the right: `cmd/ctrl + ]`
To move a code block to the right: `cmd/ctrl + [`
# **INTRODUCTION TO PANDAS**

**Python** is the most popular programming language used in Data Science. Not only the incredible speed, Python also offers a good amount of libraries that are dedicated for certain jobs in Data Science, from data analysing to running statistical tests and so on.
From today (until the rest of the course), we will use Pandas as a primary tool to load and analyse data. In today session, let's get our hands on some of the very basic concepts, including:
1. Pandas Components: DataFrame, Series, Index
2. Load and Overview Data
3. Selection and Filter
4. Sort
Let's get started!
## 1. Import libraries
```python
# Import Pandas
import pandas as pd
```
## 2. Read and Overview
#### Load .csv
```python
# Load CSV file
# Tips: If your data is on Dropbox, change the link's ending part from dl=0 to dl=1
df = pd.read_csv('https://www.dropbox.com/s/zhxqmtf7fr3sabt/demographic_data.csv?dl=1')
df
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Country Name</th>
<th>Country Code</th>
<th>Birth rate</th>
<th>Internet users</th>
<th>Income Group</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Aruba</td>
<td>ABW</td>
<td>10.244</td>
<td>78.9</td>
<td>High income</td>
</tr>
<tr>
<th>1</th>
<td>Afghanistan</td>
<td>AFG</td>
<td>35.253</td>
<td>5.9</td>
<td>Low income</td>
</tr>
<tr>
<th>2</th>
<td>Angola</td>
<td>AGO</td>
<td>45.985</td>
<td>19.1</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>3</th>
<td>Albania</td>
<td>ALB</td>
<td>12.877</td>
<td>57.2</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>4</th>
<td>United Arab Emirates</td>
<td>ARE</td>
<td>11.044</td>
<td>88.0</td>
<td>High income</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>190</th>
<td>Yemen, Rep.</td>
<td>YEM</td>
<td>32.947</td>
<td>20.0</td>
<td>Lower middle income</td>
</tr>
<tr>
<th>191</th>
<td>South Africa</td>
<td>ZAF</td>
<td>20.850</td>
<td>46.5</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>192</th>
<td>Congo, Dem. Rep.</td>
<td>COD</td>
<td>42.394</td>
<td>2.2</td>
<td>Low income</td>
</tr>
<tr>
<th>193</th>
<td>Zambia</td>
<td>ZMB</td>
<td>40.471</td>
<td>15.4</td>
<td>Lower middle income</td>
</tr>
<tr>
<th>194</th>
<td>Zimbabwe</td>
<td>ZWE</td>
<td>35.715</td>
<td>18.5</td>
<td>Low income</td>
</tr>
</tbody>
</table>
<p>195 rows × 5 columns</p>
</div>
```python
# Set new index
df = df.set_index('Country Name')
df
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Country Code</th>
<th>Birth rate</th>
<th>Internet users</th>
<th>Income Group</th>
</tr>
<tr>
<th>Country Name</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Aruba</th>
<td>ABW</td>
<td>10.244</td>
<td>78.9</td>
<td>High income</td>
</tr>
<tr>
<th>Afghanistan</th>
<td>AFG</td>
<td>35.253</td>
<td>5.9</td>
<td>Low income</td>
</tr>
<tr>
<th>Angola</th>
<td>AGO</td>
<td>45.985</td>
<td>19.1</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>Albania</th>
<td>ALB</td>
<td>12.877</td>
<td>57.2</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>United Arab Emirates</th>
<td>ARE</td>
<td>11.044</td>
<td>88.0</td>
<td>High income</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>Yemen, Rep.</th>
<td>YEM</td>
<td>32.947</td>
<td>20.0</td>
<td>Lower middle income</td>
</tr>
<tr>
<th>South Africa</th>
<td>ZAF</td>
<td>20.850</td>
<td>46.5</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>Congo, Dem. Rep.</th>
<td>COD</td>
<td>42.394</td>
<td>2.2</td>
<td>Low income</td>
</tr>
<tr>
<th>Zambia</th>
<td>ZMB</td>
<td>40.471</td>
<td>15.4</td>
<td>Lower middle income</td>
</tr>
<tr>
<th>Zimbabwe</th>
<td>ZWE</td>
<td>35.715</td>
<td>18.5</td>
<td>Low income</td>
</tr>
</tbody>
</table>
<p>195 rows × 4 columns</p>
</div>
#### Overview
```python
# Selection --> []
# Hàm/function --> ()
# Show the first 5 rows
df.head(10)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Country Code</th>
<th>Birth rate</th>
<th>Internet users</th>
<th>Income Group</th>
</tr>
<tr>
<th>Country Name</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Aruba</th>
<td>ABW</td>
<td>10.244</td>
<td>78.9000</td>
<td>High income</td>
</tr>
<tr>
<th>Afghanistan</th>
<td>AFG</td>
<td>35.253</td>
<td>5.9000</td>
<td>Low income</td>
</tr>
<tr>
<th>Angola</th>
<td>AGO</td>
<td>45.985</td>
<td>19.1000</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>Albania</th>
<td>ALB</td>
<td>12.877</td>
<td>57.2000</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>United Arab Emirates</th>
<td>ARE</td>
<td>11.044</td>
<td>88.0000</td>
<td>High income</td>
</tr>
<tr>
<th>Argentina</th>
<td>ARG</td>
<td>17.716</td>
<td>59.9000</td>
<td>High income</td>
</tr>
<tr>
<th>Armenia</th>
<td>ARM</td>
<td>13.308</td>
<td>41.9000</td>
<td>Lower middle income</td>
</tr>
<tr>
<th>Antigua and Barbuda</th>
<td>ATG</td>
<td>16.447</td>
<td>63.4000</td>
<td>High income</td>
</tr>
<tr>
<th>Australia</th>
<td>AUS</td>
<td>13.200</td>
<td>83.0000</td>
<td>High income</td>
</tr>
<tr>
<th>Austria</th>
<td>AUT</td>
<td>9.400</td>
<td>80.6188</td>
<td>High income</td>
</tr>
</tbody>
</table>
</div>
```python
# Show the last 5 rows
df.tail()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Country Code</th>
<th>Birth rate</th>
<th>Internet users</th>
<th>Income Group</th>
</tr>
<tr>
<th>Country Name</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Yemen, Rep.</th>
<td>YEM</td>
<td>32.947</td>
<td>20.0</td>
<td>Lower middle income</td>
</tr>
<tr>
<th>South Africa</th>
<td>ZAF</td>
<td>20.850</td>
<td>46.5</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>Congo, Dem. Rep.</th>
<td>COD</td>
<td>42.394</td>
<td>2.2</td>
<td>Low income</td>
</tr>
<tr>
<th>Zambia</th>
<td>ZMB</td>
<td>40.471</td>
<td>15.4</td>
<td>Lower middle income</td>
</tr>
<tr>
<th>Zimbabwe</th>
<td>ZWE</td>
<td>35.715</td>
<td>18.5</td>
<td>Low income</td>
</tr>
</tbody>
</table>
</div>
```python
# Show 5 random rows
df.sample(5)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Country Code</th>
<th>Birth rate</th>
<th>Internet users</th>
<th>Income Group</th>
</tr>
<tr>
<th>Country Name</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Burundi</th>
<td>BDI</td>
<td>44.151</td>
<td>1.30</td>
<td>Low income</td>
</tr>
<tr>
<th>Indonesia</th>
<td>IDN</td>
<td>20.297</td>
<td>14.94</td>
<td>Lower middle income</td>
</tr>
<tr>
<th>Algeria</th>
<td>DZA</td>
<td>24.738</td>
<td>16.50</td>
<td>Upper middle income</td>
</tr>
<tr>
<th>Bangladesh</th>
<td>BGD</td>
<td>20.142</td>
<td>6.63</td>
<td>Lower middle income</td>
</tr>
<tr>
<th>Cuba</th>
<td>CUB</td>
<td>10.400</td>
<td>27.93</td>
<td>Upper middle income</td>
</tr>
</tbody>
</table>
</div>
```python
# Show shape of the dataframe (không có dấu ngoặt tròn)
df.shape
```
(195, 4)
```python
# Show info
df.info()
```
<class 'pandas.core.frame.DataFrame'>
Index: 195 entries, Aruba to Zimbabwe
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Code 195 non-null object
1 Birth rate 195 non-null float64
2 Internet users 195 non-null float64
3 Income Group 195 non-null object
dtypes: float64(2), object(2)
memory usage: 7.6+ KB
```python
# Overview of numerical columns
df.describe()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Birth rate</th>
<th>Internet users</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>195.000000</td>
<td>195.000000</td>
</tr>
<tr>
<th>mean</th>
<td>21.469928</td>
<td>42.076471</td>
</tr>
<tr>
<th>std</th>
<td>10.605467</td>
<td>29.030788</td>
</tr>
<tr>
<th>min</th>
<td>7.900000</td>
<td>0.900000</td>
</tr>
<tr>
<th>25%</th>
<td>12.120500</td>
<td>14.520000</td>
</tr>
<tr>
<th>50%</th>
<td>19.680000</td>
<td>41.000000</td>
</tr>
<tr>
<th>75%</th>
<td>29.759500</td>
<td>66.225000</td>
</tr>
<tr>
<th>max</th>
<td>49.661000</td>
<td>96.546800</td>
</tr>
</tbody>
</table>
</div>
```python
# Choose a column
df['Income Group']
```
Country Name
Aruba High income
Afghanistan Low income
Angola Upper middle income
Albania Upper middle income
United Arab Emirates High income
...
Yemen, Rep. Lower middle income
South Africa Upper middle income
Congo, Dem. Rep. Low income
Zambia Lower middle income
Zimbabwe Low income
Name: Income Group, Length: 195, dtype: object
```python
# Overview of categorical columns
df['Income Group'].describe()
```
count 195
unique 4
top High income
freq 67
Name: Income Group, dtype: object
```python
# value_counts, nunique, unique
df['Income Group'].value_counts()
```
High income 67
Lower middle income 50
Upper middle income 48
Low income 30
Name: Income Group, dtype: int64
```python
df['Income Group'].unique()
```
array(['High income', 'Low income', 'Upper middle income',
'Lower middle income'], dtype=object)
```python
df['Income Group'].nunique()
```
4