---
title: Pandas
tags: Python Cheatbook
---
# Pandas
Pandas is a library usually used for processing tabular data from various data sources such as `.csv` or SQL database.
# Dataframe
A Pandas `DataFrame` is bascially a table. It contains rows and columns (and also indexes). Here is a example below.
CSV file:
| Name | Age |
| -------- | --- |
| Mr.Pizza | 23 |
| Ms.Cola | 26 |
Imported as pandas `DataFrame`, where first column is assigned as non-labeled index:
| | Name | Age |
|:----- | -------- | --- |
| 0 | Mr.Pizza | 23 |
| 1 | Ms.Cola | 26 |
Dataframe are made with `Series` object, in this case if we do `df['Age']` we will get a `Series` object with all the values in `Age` column.
Initialize empty dataframe for no reason.
```python=
import pandas as pd
df = pd.DataFrame()
```
==**Checking your dataframe**==
```python=
print(df.info()) # check datatypes
print(df.head()) # check top 5
print(df.tail()) # check bottom 5
```
:::danger
Always check datatypes of your dataframe, otherwise you won't be able to do proper plotting or advance functions. :warning:
:::
# Creation
## Read seperated value files (CSV, TSV, etc.)
```python=
import pandas as pd
import os
df = pd.read_csv(
os.path.join(path, file),
sep='\t',
header=None,
comment='#',
names=['A', 'B', 'C', 'D', 'E'])
# you can also do this: names=list('abcdef')
```
- `pd.read_csv(..)` reads a csv file with following arguments
- arg1 : filepath.
- `sep:str` character that seperates the values usually `,` or `\t`.
- `header:int` header row.
- `comment:str` lines that are to be ignored if they start with this character.
- `names:[str]` list of column names.
:::info
There is alot more to `read_csv`, please check the manual page: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
:::
# Dataframe Manipulation
==**Append a new dataframe to existing dataframe**==
```python
df = df.append(new_df, ignore_index=True)
```
- `ignore_index` will not preserve the indexes.
**Set multi-index or change indexes**
```python
# Change index
df.set_index('col1', inplace=True)
# Multi-index
df.set_index(['col1', 'col2'], inplace=True)
```
- `inplace` overwrites the exsiting object, which means it doesn't require the object to be reassigned. Set it false if you want to return a new object.
**Get unique index values**
You can use this to get a unique list of index values on a given index. For example, if I want to know all the distinct `col2` values, I can do the following:
```python
df.index.unique(level='col2')
```
==**Removing columns from your dataframe**==
```python
df = df.drop(columns=['z'])
```
- Note how I reassigned the `df` because I am not using `inpalce=True`.
**Changing data types**
```python
df = df.astype({'x': 'float', 'y': 'float'})
```
> Text Type: str
> Numeric Types: int, float, complex
> Sequence Types: list, tuple, range
> Mapping Type: dict
> Set Types: set, frozenset
> Boolean Type: bool
> Binary Types: bytes, bytearray, memoryview
https://www.w3schools.com/python/python_datatypes.asp
==**Modifying all values in a specific column**==
For example if we want to make every one 10 years older. It will update `Age` column inside your dataframe (it preserves the dataframe, so don't worry about losing it).
```python
df['Age'] = df['Age'].apply(lambda x: x+10)
```
# Selecting data
## by Index
There are two common options
- `loc` select by labeled index.
- `iloc` select by integer index.
```python
df.iloc[0] # integer index
df.loc['col1'] # label index
```
Note there are two ways to return data
- `iloc[0]` returns a series
- `iloc[[0]]` returns a dataframe
This page does a good job in explaining all the ways to select data using loc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
### Using Cross-section
In some cases where the data has multi-index it gets harder to extract using `loc`. Thats where cross-section comes in. For example if I want to extract data based on 2nd index, I can run the following function.
```python
ds.xs('col2', level=1)
```
- `level` describes the index level, starts from 0.
More here: https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.xs.html?highlight=cross%20section
# Plotting
## Scatter
```python
import matplotlib.pyplot as plt
df = ...
...
df.plot.scatter(x='col1', y='col2', c='DarkBlue')
plt.imshow(plt.imread('optional_img.png'), zorder=-1) # optional img for background
plt.show()
```

plot with multiple colours for each item in the dataframe.
```python
df.plot.scatter(x='x', y='y', c=[i for i, v in enumerate(df.index.to_list())], colormap='viridis')
```

To save the plots to a file call `plt.savefig` instead of `plt.show`.
```python
plt.savefig(os.path.join('my_folder','figure_name'), format='png')
```
# Resources
1. Pandas API reference: https://pandas.pydata.org/pandas-docs/stable/reference/index.html