# Python data science libraries (41-54)
41. Pandas indexing and slicing - demo
42. Operations on single data frames/series
43. Operations on single data frames/series - demo
44. Operations between data frames/series
45. Operations between data frames/series - demo
46. Other useful pandas functionalities
47. Pandas data types
48. Pandas data types - demo
49. Pandas group by statement
50. Pandas group by statement - demo
51. ---- Part 3 - Data visualisations ----
52. Matplotlib basics
53. Seaborn basics
54. Chapter summary
# 41. Pandas indexing and slicing - demo
## Slicing and modifying pandas data frames/series
Lecture agenda :
- Slicing pandas dataframes
- Modifying pandas dataframes
- Slicing pandas data series
Nothing special
---
# 42. Operations on single data frames/series
► Summarization \
► Summarization with numpy \
► Operations on data frame elements \
► Operations with single variables \
► Comparison operations \
► Pandas series
## Summarization
> opeartion: \
related to columns (default)
> Again \
axis 0(row)(default) 1(columns)
> Summmarization types\
Function
>> For numerical data:
.min() - Minimum.\
.median() - Median.\
.sum() - Sum.\
describe() - Provides several useful statistics.
>> For categories:
.nunique()\
.mode()
>> For single columns :
.unique()\
.value_counts()
## Numpy and Padnas
> ► Numpy functions can be used with pandas.\
► Numpy operation must be compatible with pandas column data types
> Transofrming pandas to numpy
>>DataframeName["",""].to_numpy()
## TRANSFORMING DATAFRAME ELEMENTS
>► In data frames, these operations are usually performed on separate columns.\
► Operations must be compatible with column data type.
NUMPY CAN BE USED!
## Operations on strings
> .str
>> works on a sigle column (series)\
multiple columns by using loop
>>> e.g. Dataframe[""].str.upper()
## Operations with single variables
> just use: + (do not forgot to use '' for string)
## Comparison operations
> Creating boolean columns
>> ► Pandas has vide array of functionalities for performing comparisons for single / multiple columns.\
Operations must be compatible with column data type.\
Very useful when slicing pandas data frame !
> Example
>> Classical comparisons (==, >, ..)\
.between(low, high)\
.isin(['value 1', 'value 2', ...])\
.str.contains('string')
## Pandas series
Not much different
---
# 43. Operations on single data frames/series - demo
## Comparison operators
useful!
df
df[['total_bill', 'tip']] > 10
df['total_bill'].between(20, 30)
df[['smoker', 'day']].isin(['No', 'Thur'])
df['smoker'].str.contains('Y')
df[df['total_bill'].between(20, 30)]
---
# 44. Operations between data frames/series
> Basic rules - data frames
>> Operations between data frames occur on elements with matching index and column names.\
Missing values will be generated for unpaired rows/columns.
>Pandas methods
Operations between pandas series\
Data frames and data series\
Merging
## Basic rules - data frames
ordering is not that import, addition based on index and column name\
missing value(name or column) will result in nan
>blooleans
>> light switch! ++=+;+-=+;-+=+;--=-

## Pandas methods
Can be used to avoid None values by using "fill_value" argument.
> add
>> df1.add(df2, fill_value=0)
## Operations between pandas series
Operations between two pandas series are based on elements with the same index.
> data series name can be different
Missing values will be generated for unpaired indexes. (nan) \
Operation must be compatible with series data types !\
Pandas series is equivalent to a single column of a pandas data frame.\
Operations between columns within the same data frame are performed frequently.
## Data frames and data series
> Operation can be performed:
>> By matching data series index with data frame column names.\
By matching data series index with data frame row index.\
> Behaviour can be controlled by using pandas methods.
>> axis = 0 - match data series index with data frame index.\
axis = 1 - match data series index with data frame columns.
when df1.add(se1,axis=0), this will add to all variables(columns) in that row(case/index)
## Merging
>> pd.concat
>Axis argument can be used to control the merge.
## Question

---
# 45. Operations between data frames/series - demo
## Ignore index
df1 = base_df_1[['A', 'B']]
df2 = base_df_2[['A', 'B']]
df3 = pd.concat([df1, df2], axis=0, ignore_index=True)
print('First df : ')
display(df1)
print('Second df :')
display(df2)
print('Result :')
display(df3)
## Non matching columns
df1 = base_df_1[['A', 'B', 'C']]
df2 = base_df_2[['A', 'B']]
df3 = pd.concat([df1, df2], axis=0)
print('First df : ')
display(df1)
print('Second df :')
display(df2)
print('Result :')
display(df3)
---
# 46. Other useful pandas functionalities
> Usefull functionalities
>> - Head and tail
>> - Handling missing values
>> - Handling duplicates
>> - Droping rows/columns
>> - Sorting
>> - Renaming rows/columns
>> - Mapping
## Head and tail
> by case(s)
>> df.head(indexnumber)\
>> df.tail(indexnumber)
## Handling missing values
None can represent missing value
>var = None
>> Can be detected with == 'print(var == None)' \
>> Can be detected with is None 'print(var is None)'
np.nan is also used to represent missing values
> but print(var==np.nan): FLASE \
print(np.isnan(var)): TRUE \
print(var is np.nan): TRUE
## Handling duplicates
df.duplicated(subset=['Name'])\
df.drop_duplicates(ignore_index=True)
## Droping rows/columns
df = df.drop(index=[2])\
df = df.filter(['b', 'C'], axis=1)
## Sorting
> Sort by index
>> df.sort_index()\
>> (ascending=False) <-Reverse order
> Sort by value
>> df.sort_values(by='Age', ignore_index=True)
> df.sort_values(by=['Age', 'Grade'])
`data = {`
`'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Harry'],`
` 'Age': [22, 19, 21, 20, 22, 21, 19, 20],`
` 'Grade': [85, 95, 77, 88, 92, 76, 99, 89],`
` 'Subject': ['Math', 'Physics', 'Chemistry', 'Math', 'Physics', 'Chemistry', 'Math', 'Physics']`
`}`
`df = pd.DataFrame(data)`
`df`
## Renaming rows/columns
`df = df.rename(mapper={'A': 'P', 'B':'b'}, axis=1)`
`df = pd.DataFrame({ `
` 'A': [1, 2, np.nan, 4, 5],`
` 'B': [np.nan, 7, 8, 9, 10],`
` 'C': ['a', 'b', 'c', 'd', 'e'],`
` 'D': [100, 200, 300, 400, 500]`
`})`
`df`
### Rename columns
df = df.rename(mapper={'A': 'P', 'B':'b'}, axis=1)
### Rename rows
df = df.rename(mapper={0: 'zero', 1: 'one'}, axis=0)
## Mapping
`data = {`
` 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Harry'],`
` 'Age': [22, 19, 21, 20, 22, 21, 19, 20],`
` 'Grade': [20, 95, 77, 88, 30, 76, 99, 89],`
` 'Subject': ['Math', 'Physics', 'Chemistry', 'Math', 'Physics', 'Chemistry', 'Math', 'Physics']`
`}`
`df = pd.DataFrame(data)`
`df`
map_dict = {
'Math' : 'M102',
'Physics': 'P102',
'Chemistry': 'C102'
}
df['Subject'] = df['Subject'].map(map_dict)
df
## Question
```
let assume 'Math' : 'M102',
'Physics': 'P102',
'Chemistry': 'C102' is the location
qdf = pd.DataFrame(data)
qdf
map_dict = {
'Math' : 'M102',
'Physics': 'P102',
'Chemistry': 'C102'
}
qdf['location'] = qdf['Subject'].map(map_dict)
qdf
#will overwrite, cann not qdfa['location'] = qdf['Subject'].map(map_dict)
```
## check
```
def check_grade(grade):
if grade > 50:
return 'passed'
else:
return 'failed'
df['Status_1'] = df['Grade'].map(check_grade)
df
df['Status'] = df['Grade'].map(lambda x: 'passed' if x > 50 else 'failed')
df
```
---
# 47. Pandas data types
DATA TYPE INFERENCE - READ CSV\
► When loading data frame with "read_csv" pandas infers data types automatically.\
► We can provide data types with "dtype" argument:\
► Single data type.\
► Dict of data types.
Common:
> NUMERICAL DATA AND BOOLEANS
>> Numerical data: \ int32, int64, ... \ float32, float 64, ...
>> Booleans: bool
> OBJECT DATA TYPE
>> Flexible datatype: Strings. Objects of custom classes. Lists. Sets...
► If a single column contains multiple data types, it will be assigned the object data type.\
▸ Operations on elements in columns with mixed data types will be based on their actual, underlying data types.\
► Behaviour can sometimes be unpredictable !\
not suggested
## CHECKING DATA TYPES
For a dataframe:
> df.dtypes\
> df.info()
Single column (series):
> df['col'].dtype
## Data type conversions
checking data types
> df.astype(new_data_type)
---
# 48. Pandas data types - demo
## Dtype conversions (D)
if floats-> integers just ignored .

---
# 49. Pandas group by statement
## Grouping
> based on criteria \ criteria defined by column(s)
After grouping, subsets can be processed separately (even summarizations)
> df.groupby('col')
## Prcessing groups
process one typle per loop iteration \
e.g.: groupby('col')['targetcol'].sum() or . count()
---
# 50. Pandas group by statement - demo
Not much different!
`import pandas as pd`
`import numpy as np`
`import seaborn as sns`
`import random `
`random.seed(2)`
---
# 51. ---- Part 3 - Data visualisations ----
Data visualizations
Scatter plots
Line plots
Histograms
Box plots
> whisker lengths: 1.5 IQR
>> IQR = Q3(75%) - Q1(25%) <- which means is 150%
>> ....not max and min
Bar charts\
Phython lib:
> Matplotlib (normally import as plt)\
> Seaborn (normally import as sns)
>> Specialized for plotting data from data frames \
Can be used for creating variety of plots\
faster than matplotlib but less flexible
---
# 52. Matplotlib basics
## Lecture agenda
- Single plots
- Plotting in a grid
`import numpy as np`
`import matplotlib.pyplot as plt`
`import seaborn as sns`
`sns.set_theme()`
`plt.plot([1, 2, 3, 4], [1, 4, 2, 3])`
just like R, nothing much different....
---
# 53. Seaborn basics
`import seaborn as sns`
`import matplotlib.pyplot as plt`
`sns.set_theme() `
> Load the built-in 'tips' dataset
`tips = sns.load_dataset("tips")`
`tips.head()`
> Total bill vs tip
` sns.scatterplot(data=tips, x="total_bill", y="tip", hue='time')`
## add the regreesion line for them make much more sense, but how can I? Thank AI bot
` sns.lmplot(data=tips, x="total_bill", y="tip", hue='time') `
looks terrible. time for mantra!
`custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set(context='notebook', style='white', palette=None, font='sans-serif', font_scale=1, color_codes=False, rc=None)
sns.set_theme(style="ticks", rc=custom_params)
sns.lmplot(data=tips, x="total_bill", y="tip", hue='time', markers=['o', 'v'], legend=True, legend_out=None)
`
---
# 54. Chapter summary
nothing special