Pandas - HackMD

--- title: Pandas tags: Python Cheatbook --- # Pandas Pandas is a library usually used for processing tabular data from various data sources such as `.csv` or SQL database. # Dataframe A Pandas `DataFrame` is bascially a table. It contains rows and columns (and also indexes). Here is a example below. CSV file: | Name | Age | | -------- | --- | | Mr.Pizza | 23 | | Ms.Cola | 26 | Imported as pandas `DataFrame`, where first column is assigned as non-labeled index: | | Name | Age | |:----- | -------- | --- | | 0 | Mr.Pizza | 23 | | 1 | Ms.Cola | 26 | Dataframe are made with `Series` object, in this case if we do `df['Age']` we will get a `Series` object with all the values in `Age` column. Initialize empty dataframe for no reason. ```python= import pandas as pd df = pd.DataFrame() ``` ==**Checking your dataframe**== ```python= print(df.info()) # check datatypes print(df.head()) # check top 5 print(df.tail()) # check bottom 5 ``` :::danger Always check datatypes of your dataframe, otherwise you won't be able to do proper plotting or advance functions. :warning: ::: # Creation ## Read seperated value files (CSV, TSV, etc.) ```python= import pandas as pd import os df = pd.read_csv( os.path.join(path, file), sep='\t', header=None, comment='#', names=['A', 'B', 'C', 'D', 'E']) # you can also do this: names=list('abcdef') ``` - `pd.read_csv(..)` reads a csv file with following arguments - arg1 : filepath. - `sep:str` character that seperates the values usually `,` or `\t`. - `header:int` header row. - `comment:str` lines that are to be ignored if they start with this character. - `names:[str]` list of column names. :::info There is alot more to `read_csv`, please check the manual page: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html ::: # Dataframe Manipulation ==**Append a new dataframe to existing dataframe**== ```python df = df.append(new_df, ignore_index=True) ``` - `ignore_index` will not preserve the indexes. **Set multi-index or change indexes** ```python # Change index df.set_index('col1', inplace=True) # Multi-index df.set_index(['col1', 'col2'], inplace=True) ``` - `inplace` overwrites the exsiting object, which means it doesn't require the object to be reassigned. Set it false if you want to return a new object. **Get unique index values** You can use this to get a unique list of index values on a given index. For example, if I want to know all the distinct `col2` values, I can do the following: ```python df.index.unique(level='col2') ``` ==**Removing columns from your dataframe**== ```python df = df.drop(columns=['z']) ``` - Note how I reassigned the `df` because I am not using `inpalce=True`. **Changing data types** ```python df = df.astype({'x': 'float', 'y': 'float'}) ``` > Text Type: str > Numeric Types: int, float, complex > Sequence Types: list, tuple, range > Mapping Type: dict > Set Types: set, frozenset > Boolean Type: bool > Binary Types: bytes, bytearray, memoryview https://www.w3schools.com/python/python_datatypes.asp ==**Modifying all values in a specific column**== For example if we want to make every one 10 years older. It will update `Age` column inside your dataframe (it preserves the dataframe, so don't worry about losing it). ```python df['Age'] = df['Age'].apply(lambda x: x+10) ``` # Selecting data ## by Index There are two common options - `loc` select by labeled index. - `iloc` select by integer index. ```python df.iloc[0] # integer index df.loc['col1'] # label index ``` Note there are two ways to return data - `iloc[0]` returns a series - `iloc[[0]]` returns a dataframe This page does a good job in explaining all the ways to select data using loc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html ### Using Cross-section In some cases where the data has multi-index it gets harder to extract using `loc`. Thats where cross-section comes in. For example if I want to extract data based on 2nd index, I can run the following function. ```python ds.xs('col2', level=1) ``` - `level` describes the index level, starts from 0. More here: https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.xs.html?highlight=cross%20section # Plotting ## Scatter ```python import matplotlib.pyplot as plt df = ... ... df.plot.scatter(x='col1', y='col2', c='DarkBlue') plt.imshow(plt.imread('optional_img.png'), zorder=-1) # optional img for background plt.show() ``` ![](https://i.imgur.com/mBBIqMG.png) plot with multiple colours for each item in the dataframe. ```python df.plot.scatter(x='x', y='y', c=[i for i, v in enumerate(df.index.to_list())], colormap='viridis') ``` ![](https://i.imgur.com/dfmYwId.png) To save the plots to a file call `plt.savefig` instead of `plt.show`. ```python plt.savefig(os.path.join('my_folder','figure_name'), format='png') ``` # Resources 1. Pandas API reference: https://pandas.pydata.org/pandas-docs/stable/reference/index.html