HackMD - Collaborative Markdown Knowledge Base

# Python data science libraries (41-54) 41. Pandas indexing and slicing - demo 42. Operations on single data frames/series 43. Operations on single data frames/series - demo 44. Operations between data frames/series 45. Operations between data frames/series - demo 46. Other useful pandas functionalities 47. Pandas data types 48. Pandas data types - demo 49. Pandas group by statement 50. Pandas group by statement - demo 51. ---- Part 3 - Data visualisations ---- 52. Matplotlib basics 53. Seaborn basics 54. Chapter summary # 41. Pandas indexing and slicing - demo ## Slicing and modifying pandas data frames/series Lecture agenda : - Slicing pandas dataframes - Modifying pandas dataframes - Slicing pandas data series Nothing special --- # 42. Operations on single data frames/series ► Summarization \ ► Summarization with numpy \ ► Operations on data frame elements \ ► Operations with single variables \ ► Comparison operations \ ► Pandas series ## Summarization > opeartion: \ related to columns (default) > Again \ axis 0(row)(default) 1(columns) > Summmarization types\ Function >> For numerical data: .min() - Minimum.\ .median() - Median.\ .sum() - Sum.\ describe() - Provides several useful statistics. >> For categories: .nunique()\ .mode() >> For single columns : .unique()\ .value_counts() ## Numpy and Padnas > ► Numpy functions can be used with pandas.\ ► Numpy operation must be compatible with pandas column data types > Transofrming pandas to numpy >>DataframeName["",""].to_numpy() ## TRANSFORMING DATAFRAME ELEMENTS >► In data frames, these operations are usually performed on separate columns.\ ► Operations must be compatible with column data type. NUMPY CAN BE USED! ## Operations on strings > .str >> works on a sigle column (series)\ multiple columns by using loop >>> e.g. Dataframe[""].str.upper() ## Operations with single variables > just use: + (do not forgot to use '' for string) ## Comparison operations > Creating boolean columns >> ► Pandas has vide array of functionalities for performing comparisons for single / multiple columns.\ Operations must be compatible with column data type.\ Very useful when slicing pandas data frame ! > Example >> Classical comparisons (==, >, ..)\ .between(low, high)\ .isin(['value 1', 'value 2', ...])\ .str.contains('string') ## Pandas series Not much different --- # 43. Operations on single data frames/series - demo ## Comparison operators useful! df df[['total_bill', 'tip']] > 10 df['total_bill'].between(20, 30) df[['smoker', 'day']].isin(['No', 'Thur']) df['smoker'].str.contains('Y') df[df['total_bill'].between(20, 30)] --- # 44. Operations between data frames/series > Basic rules - data frames >> Operations between data frames occur on elements with matching index and column names.\ Missing values will be generated for unpaired rows/columns. >Pandas methods Operations between pandas series\ Data frames and data series\ Merging ## Basic rules - data frames ordering is not that import, addition based on index and column name\ missing value(name or column) will result in nan >blooleans >> light switch! ++=+;+-=+;-+=+;--=- ![image](https://hackmd.io/_uploads/HJBwFOfY0.png) ## Pandas methods Can be used to avoid None values by using "fill_value" argument. > add >> df1.add(df2, fill_value=0) ## Operations between pandas series Operations between two pandas series are based on elements with the same index. > data series name can be different Missing values will be generated for unpaired indexes. (nan) \ Operation must be compatible with series data types !\ Pandas series is equivalent to a single column of a pandas data frame.\ Operations between columns within the same data frame are performed frequently. ## Data frames and data series > Operation can be performed: >> By matching data series index with data frame column names.\ By matching data series index with data frame row index.\ > Behaviour can be controlled by using pandas methods. >> axis = 0 - match data series index with data frame index.\ axis = 1 - match data series index with data frame columns. when df1.add(se1,axis=0), this will add to all variables(columns) in that row(case/index) ## Merging >> pd.concat >Axis argument can be used to control the merge. ## Question ![image](https://hackmd.io/_uploads/Skp9YOMtR.png) --- # 45. Operations between data frames/series - demo ## Ignore index df1 = base_df_1[['A', 'B']] df2 = base_df_2[['A', 'B']] df3 = pd.concat([df1, df2], axis=0, ignore_index=True) print('First df : ') display(df1) print('Second df :') display(df2) print('Result :') display(df3) ## Non matching columns df1 = base_df_1[['A', 'B', 'C']] df2 = base_df_2[['A', 'B']] df3 = pd.concat([df1, df2], axis=0) print('First df : ') display(df1) print('Second df :') display(df2) print('Result :') display(df3) --- # 46. Other useful pandas functionalities > Usefull functionalities >> - Head and tail >> - Handling missing values >> - Handling duplicates >> - Droping rows/columns >> - Sorting >> - Renaming rows/columns >> - Mapping ## Head and tail > by case(s) >> df.head(indexnumber)\ >> df.tail(indexnumber) ## Handling missing values None can represent missing value >var = None >> Can be detected with == 'print(var == None)' \ >> Can be detected with is None 'print(var is None)' np.nan is also used to represent missing values > but print(var==np.nan): FLASE \ print(np.isnan(var)): TRUE \ print(var is np.nan): TRUE ## Handling duplicates df.duplicated(subset=['Name'])\ df.drop_duplicates(ignore_index=True) ## Droping rows/columns df = df.drop(index=[2])\ df = df.filter(['b', 'C'], axis=1) ## Sorting > Sort by index >> df.sort_index()\ >> (ascending=False) <-Reverse order > Sort by value >> df.sort_values(by='Age', ignore_index=True) > df.sort_values(by=['Age', 'Grade']) `data = {` `'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Harry'],` ` 'Age': [22, 19, 21, 20, 22, 21, 19, 20],` ` 'Grade': [85, 95, 77, 88, 92, 76, 99, 89],` ` 'Subject': ['Math', 'Physics', 'Chemistry', 'Math', 'Physics', 'Chemistry', 'Math', 'Physics']` `}` `df = pd.DataFrame(data)` `df` ## Renaming rows/columns `df = df.rename(mapper={'A': 'P', 'B':'b'}, axis=1)` `df = pd.DataFrame({ ` ` 'A': [1, 2, np.nan, 4, 5],` ` 'B': [np.nan, 7, 8, 9, 10],` ` 'C': ['a', 'b', 'c', 'd', 'e'],` ` 'D': [100, 200, 300, 400, 500]` `})` `df` ### Rename columns df = df.rename(mapper={'A': 'P', 'B':'b'}, axis=1) ### Rename rows df = df.rename(mapper={0: 'zero', 1: 'one'}, axis=0) ## Mapping `data = {` ` 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Harry'],` ` 'Age': [22, 19, 21, 20, 22, 21, 19, 20],` ` 'Grade': [20, 95, 77, 88, 30, 76, 99, 89],` ` 'Subject': ['Math', 'Physics', 'Chemistry', 'Math', 'Physics', 'Chemistry', 'Math', 'Physics']` `}` `df = pd.DataFrame(data)` `df` map_dict = { 'Math' : 'M102', 'Physics': 'P102', 'Chemistry': 'C102' } df['Subject'] = df['Subject'].map(map_dict) df ## Question ``` let assume 'Math' : 'M102', 'Physics': 'P102', 'Chemistry': 'C102' is the location qdf = pd.DataFrame(data) qdf map_dict = { 'Math' : 'M102', 'Physics': 'P102', 'Chemistry': 'C102' } qdf['location'] = qdf['Subject'].map(map_dict) qdf #will overwrite, cann not qdfa['location'] = qdf['Subject'].map(map_dict) ``` ## check ``` def check_grade(grade): if grade > 50: return 'passed' else: return 'failed' df['Status_1'] = df['Grade'].map(check_grade) df df['Status'] = df['Grade'].map(lambda x: 'passed' if x > 50 else 'failed') df ``` --- # 47. Pandas data types DATA TYPE INFERENCE - READ CSV\ ► When loading data frame with "read_csv" pandas infers data types automatically.\ ► We can provide data types with "dtype" argument:\ ► Single data type.\ ► Dict of data types. Common: > NUMERICAL DATA AND BOOLEANS >> Numerical data: \ int32, int64, ... \ float32, float 64, ... >> Booleans: bool > OBJECT DATA TYPE >> Flexible datatype: Strings. Objects of custom classes. Lists. Sets... ► If a single column contains multiple data types, it will be assigned the object data type.\ ▸ Operations on elements in columns with mixed data types will be based on their actual, underlying data types.\ ► Behaviour can sometimes be unpredictable !\ not suggested ## CHECKING DATA TYPES For a dataframe: > df.dtypes\ > df.info() Single column (series): > df['col'].dtype ## Data type conversions checking data types > df.astype(new_data_type) --- # 48. Pandas data types - demo ## Dtype conversions (D) if floats-> integers just ignored . ![image](https://hackmd.io/_uploads/Byn4odMtR.png) --- # 49. Pandas group by statement ## Grouping > based on criteria \ criteria defined by column(s) After grouping, subsets can be processed separately (even summarizations) > df.groupby('col') ## Prcessing groups process one typle per loop iteration \ e.g.: groupby('col')['targetcol'].sum() or . count() --- # 50. Pandas group by statement - demo Not much different! `import pandas as pd` `import numpy as np` `import seaborn as sns` `import random ` `random.seed(2)` --- # 51. ---- Part 3 - Data visualisations ---- Data visualizations Scatter plots Line plots Histograms Box plots > whisker lengths: 1.5 IQR >> IQR = Q3(75%) - Q1(25%) <- which means is 150% >> ....not max and min Bar charts\ Phython lib: > Matplotlib (normally import as plt)\ > Seaborn (normally import as sns) >> Specialized for plotting data from data frames \ Can be used for creating variety of plots\ faster than matplotlib but less flexible --- # 52. Matplotlib basics ## Lecture agenda - Single plots - Plotting in a grid `import numpy as np` `import matplotlib.pyplot as plt` `import seaborn as sns` `sns.set_theme()` `plt.plot([1, 2, 3, 4], [1, 4, 2, 3])` just like R, nothing much different.... --- # 53. Seaborn basics `import seaborn as sns` `import matplotlib.pyplot as plt` `sns.set_theme() ` > Load the built-in 'tips' dataset `tips = sns.load_dataset("tips")` `tips.head()` > Total bill vs tip ` sns.scatterplot(data=tips, x="total_bill", y="tip", hue='time')` ## add the regreesion line for them make much more sense, but how can I? Thank AI bot ` sns.lmplot(data=tips, x="total_bill", y="tip", hue='time') ` looks terrible. time for mantra! `custom_params = {"axes.spines.right": False, "axes.spines.top": False} sns.set(context='notebook', style='white', palette=None, font='sans-serif', font_scale=1, color_codes=False, rc=None) sns.set_theme(style="ticks", rc=custom_params) sns.lmplot(data=tips, x="total_bill", y="tip", hue='time', markers=['o', 'v'], legend=True, legend_out=None) ` --- # 54. Chapter summary nothing special