# Clustering And Dimensionality Reduction - Deep Dive # Python data science libraries (27-40) 1 27. Chapter agenda 28. ---- Part 1 - Numpy ---- 29. Indexing & slicing in numpy 30. Indexing & slicing in numpy - demo 31. Operations on single numpy arrays 32. Operations on single numpy arrays - demo 33. Operations between numpy arrays & broadcasting 34. Operations between numpy arrays & broadcasting - demo 35. Merging numpy arrays 36. Data types in numpy 37. Matrix operations in numpy 38. ---- Part 2 - pandas ---- 39. Pandas indexing and slicing 40. Creating data frames --- ## 27. Chapter agenda For data processing & manipulation. ► Numpy. (28-37) ► Pandas.(38-50) For data visualization: (51-53) ► Matplotlib. ► Seaborn. Summary(54) --- ## 28. ---- Part 1 - Numpy ---- Numpy Numpy topic NUMPY ARRAYS Numpy stores data by using multi-dimensional arrays. ► All the data in a single numpy array must be of the same data type.(efficienct) Numpy arrays are objects of "numpy.ndarray" class NUMPY ARRAY PROCESSING ► NumPy comes with a wide collection of utilities that allow us to process numpy arrays. ► Numpy can perform mathematical operations on entire arrays all at once, no need for loops. ► Performing computations in numpy is much faster and handier than doing the same task by using basic iterable data types. ► Numpy is backbone of other important data science libraries(e.g.: pandas, scipy) Topic: ► Creating. ► Slicing. ► Modifying. ► Searching for elements. ► Operations on single numpy arrays ► Operations including more than one array. ► Broadcasting. ► Merging. ► Numpy datatypes. ► Matrix operations. --- ### 29. Indexing & slicing in numpy Indexing. Slicing. Slicing by using arrays. Boolean indexing. #### INDEXING ► In Numpy arrays, the position of each element is determined by its index. ▸ Elements can be accessed or modified using their index. ▸ Each dimension of a numpy array has its own index. ► Indexes start at 0. (negative indexs -1) #### Selecting Numpy array element single element (Just like how matrices work, we all know linear algebra, right?) [M[r, c]] multiple elements : == to last element -1 select whole row or column: no input, :, 0:last+1 #### Slicing with arrays Numpy array magic assign the multiple rows or columns by R or C = [a, b, c] then M[R,:][:,C] or M[np.ix_(R,C)] p.s. ix_() function construct an open mesh from multiple sequences. #### Boolean indexing (True and False) ► Boolean arrays and boolean masks can be used to retrieve or modify numpy array elements. ► Boolean indexing can be used together with standard indexing. --- ### 30. Demo #### Creating, slicing and modifying **Lecture agenda** - Slicing numpy arrays - Modifying numpy arrays - Searching for elements let play with numpy, I am using Visual Studio by my PC selected it for me. **import numpy as np** ##never forget to import the library #### Creating numpy arrays name = np.array([[value1,value2,value3...]) Two dimensional numpy array by using "," check attributes "print(name.shape)" ##### Function **Creating different shapes of zero array** `name = np.zeros(shape=(r,c))` **Creating different shapes of single unique value array** `name = np.full(shape=(r,c), fill_value=value)` **arange array** [name = np.arange(start, stop((not included)), step)] `name = np.arange(value)` *question then, how to create 2 dimensional arange array* np.array expects a single argument so, all way can do is assign name1 and name 2 by np.arange then np. array(name1, name2) **random array** `name = np.random.randint(minimum value, maximum value(not included), (r, c))` Quick reminder- **python cheat sheet!** ![image](https://hackmd.io/_uploads/r1xA1eYuC.png) ![image](https://hackmd.io/_uploads/BklkxeF_R.png) source: [Perso.limsi.fr](https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf) #### Slicing `name[start:stop:step]` **slice by another array (can be sliced by boolean array)** `name1[name2]` #### obtain information `name1[target]` **mask** `name_mask = name_array >(or<) target_value` **masking** `name_array[name_arrayr >(or<) target_value]` #### Modifying **single element** `name[r,c] = new_value` **row/ column** `name[r,c] = np.array([new_value1, new_value2, new_value3...])` can be based on index arrays or boolean expression `new[np.ix_(row_idx, col_idx)] = new_value` `name[name < target_value] = new_value` #### Searching **Find index of elements** `np.where(name > condition_value)` @quite useless for me **Fine the unique elements @which is just sort them out** `np.unique(name)` **count how many of them** `np.unique(name, return_counts=True)` ### 31. Operations on single numpy arrays Operations on single elements. Array summarization. Reshaping. #### Function `np.sqrt()` - square root. `np.exp()` - exponential. `np.log()` - logarithm. `np.max()` - maximum value. `np.mean()` - mean value. **NUMPY ARRAYS AND SCALARS** if + == sum if > == boolean **Summarization** arrays axis (represent numpy array dimensions) axis 0: rows axis 1: columns *how to remember which one is row or column? Row has r, which r---- so this is go fo rthe x axis one. Column has l, so this goes down by l*. #### Array shape ► Information on number of dimensions and size of each dimension. ► Can be retrieved through the shape attribute. #### ARRAY RESHAPING `np.reshape(array, newshape=(r, w))` `array.reshape(shape=(r,w))` **Array shapes must be compatible.** *filled by row in order.* ##### transpose replace rows and columns switching rows <==> columns entirely ##### EXPANDING DIMENSIONALITY ▸ Artificially expanding dimensionality of an array to make it compatible with other arrays. `np.expand_dims(array, axis=n)` n=0 (row) or 1(column) --- ### 32. Demo Lecture agenda - Operations on 1-D and 2-D arrays - Array reshaping #### Array operations Square root of each element `np.sqrt(ArrayName)` Exponential (base e) of each element `np.exp(ArrayName)` Log of each element `np.log(ArrayName)` Addition `ArrayName + Number` Multiplication `ArrayName * Number` Comparison `ArrayName > Number` result in **boolean** **Aggregation** Sum `np.sum(ArrayName)` Average `np.mean(ArrayName)` Maximum `np.max(ArrayName)` Minimum `np.min(ArrayName)` Sort `ArrayName.sort()` #### 2-D arrays - Aggregation operations `np.Aggregation(ArrayName, axis=number)` number= 0(column) or 1(row) #### Array reshaping Transpose `NewArrayName = ArrayName.T` Custom reshape `NewArrayName = np.reshape(ArrayName, newshape=(RowSize, ColumnSize))` Add new dimension `NewArrayName = np.expand_dims(ArrayName, axis=number)` Downgrade to 1 D `NewArrayName = np.squeeze(ArrayName)` *downgradw only suit for only has one row/ column data* --- ### 33. Operations between numpy arrays & broadcasting Operations between arrays having same shapes Broadcasting **Same shapes ** Operations will be executed element wise #### Broadcasting Occurs when an operation involves arrays of different shapes Arrays will "stretch" in order to obtain same shape Rules: When comparing arrays that have same dimensionality: Compare dimensionalities one by one. Dimensions are compatible if: They are equal. One of them equals 1. - When comparing arrays that have different dimensionality: Add dimensions equal "1" to the array of lower dimensionality until dimensionalities match. ► Start with the final dimension and move towards the initial dimension. ► For each dimension: The array with a smaller size in that dimension replicates itself to match the size of the larger array in that dimension. comparison (column->row) --- ### 34. Demo #### Operations between numpy arrays See [32. Previous demo](https://hackmd.io/-dLZpEglTL-AgsBJ9Gm6RQ?view#32-Demo). The logic is the same. #### Broadcasting `ArrayName1+ArrayName2` I tried to operate shapes (3,2) with (6,1), which does not drag them to (6,6) --- ### 35. Merging numpy arrays Lecture agenda - Merging numpy arrays - Concatenations - Stacking #### Concatenations Merge 1D`NewArrayName = np.concatenate((ArrayName, ArrayName))` 2D`NewArrayName = np.concatenate((ArrayName, ArrayName), axis=number)` *reminder 0 == column, 1 == row* *Do not forget to expand to 2D by `np.expand_dims(ArrayName, axis=number)`* Can add multiple by `np.concatenate((ArrayNamea, ArrayNameb, ArrayNamec), axis=number) ` #### Stacking Vertical stacking `np.vstack((ArrayNamea, ArrayNameb))` Column stacking `np.column_stack((ArrayNamea, ArrayNameb))` np.concatenate only can combine with same dimension, stacking can overcome dimension different. --- ### 36. Data types in numpy - Datatypes `np.int32`: This is a 32-bit integer type. It can represent integer values from -2147483648 to 2147483647. It uses 32 bits of memory, which equals 4 bytes. `np.int64`: This is a 64-bit integer type. It can represent integer values from -9223372036854775808 to 9223372036854775807. It uses 64 bits of memory, which equals 8 bytes. float 32 or 64 boolean (.bool_) (do not forget the under line) - Datatype conversions Numpy array can contain only elements of one datatype. #### Datatype conversions if to int, loss everything behind the. if this is int. `ArrayName.astype(np.DataTypeName)` --- ### 37. Matrix operations in numpy - Creating matrices - Matrix adition and multiplication #### Creating matrices See [30. previous demo](https://hackmd.io/-dLZpEglTL-AgsBJ9Gm6RQ?view#Function) to create array/ matrix. Identity matrix `MartixName = np.eye(MartixSize)` Matrix of ones `MartixName = np.ones((RowSize, ColumnSize))` Random matrix `MartixName = np.random.random((RowSize, ColumnSize))` #### Matrix adition and multiplication Adition: use **+** Multiplication: Multiplication: **`Matrixa @ Matrixb`** Not **`Matrixa*Matrixb`** Transponse: `NewMatrixName = MatrixName.T` --- ## 38. ---- Part 2 - pandas ---- Pandas intro Pandas topic #### WHAT IS PANDAS ► python library that helps manage and analyze data, especially when it's in a table format. ► Install `pip install pandas` `py -m pip install pandas` for windows user in cmd @I always forget how to use window correctly ► import `import pandas as pd` ##### MAIN DATA FORMATS two main ways of storing data. **series**. or **frames**. ###### Pandas data series a single column data frame. Each element is described by its **index**. All elements in pandas series must have **same** datatype. Can contain missing values. Objects of type "pandas.core.series.Series". ###### Pandas data frames using **table** format. Similar to 2D numpy arrays. Each element is determined by index and column name. All entries inside single column must have same datatype. Columns can contain missing values. Objects of "pandas.core.frame.DataFrame" class. take home message series have 1 variable only while frames can have multiple variable. #### PANDAS OPERATIONS ► Pandas provides wide range of tools for working with data frames and series. ► Operations are data type specific. ▸ Example operations: Summarizations. Data type conversions. #### PANDAS TOPIC ► Indexing in pandas. ► Operations on single data frames / series. ► Operations including multiple data frames / series. ► Useful utilities. ► Group by. --- ### 39. Pandas indexing and slicing Slicing with.loc. Selecting whole columns. Slicing with booleans. Slicing with .iloc. Indexing pandas data series. Columns represent features while rows contain instances of the dataset #### SLICING PANDAS DATAFRAME ► **Rows** are **first** dimension while **columns** are **second** dimension. ► Retrieving / modifying elements is similar as in numpy. Two main methods for slicing: `.loc` - uses row and column **names**. `.iloc `- uses row and column **positions**. ##### SLICING with .loc single element: Dataframe==Name `Name.loc[IndexNumber, 'ColumnName']` Mutiple elements: `Name.loc[[IndexNumberStart:IndexNumberEnd], ['ColumnNameStart':'ColumnNameEnd']]` `Name.loc[[IndexNumber,IndexNumber], ['ColumnName','ColumnName']]` `Name.loc[bool_arr_1, bool_arr_2]` If you **have** [] on 'ColumnName', this will return pandas dataframs. `Name.loc[[IndexNumber,IndexNumber], ['ColumnName']]` If 'ColumnName' **without** [] , this will return pandas series. `Name.loc[[IndexNumber,IndexNumber], 'ColumnName']` ##### considertation use ":" to select whole rows / columns. indexing can not use negative indexing can use boolean arrays ##### Selecting whole columns `Name.loc[:,'ColumnName']` or `Name.loc['ColumnName']` *p.s. ['ColumnName'] can return to data frame instead of series* ##### Slicing with booleans this can slice by booleans!!!!!! just use it as a filter!!! ##### SLICING with .iloc row/column names will be **ignored** and data frame will be **sliced** as it was **2D numpy array** `Name.iloc[:row,:column]` ##### Indexing pandas data series One or **more** elements can be **retrieved**. Index is used for **retrieving elements**. Single element `Name[IndexNumber]` Multiple `Name[IndexNumberStart:IndexNumberEnd-1]` `Name[IndexNumber,IndexNumber]` by booleans `Name[Ture or False with , to separate ]` --- ### 40. Demo - Creating data frames - Pandas DataFrame - Pandas Series #### Pandas DataFrame based on numpy then pandas `data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])` **Define the column names** `columns = ['Column1', 'Column2', 'Column3']` **Define the index names** `index = ['Row1', 'Row2', 'Row3']` **Create a DataFrame** `DataFrameName = pd.DataFrame(data, columns=columns, index=index)` **Properties check** `type(DataFrameName)` **Dataframe shape** `DataFrameName.shape` **Dataframe length** `len(DataFrameName)` **Create pandas dataframe with no index / column names** Create a 2D NumPy array `data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])` Create a DataFrame `DataFrameName= pd.DataFrame(data)` **Assign index / column names** **Index can be alphabet!!** `DataFrameName.index = ['Row1', 'Row2', 'Row3']` `DataFrameName.columns = ['Column1', 'Column2', 'Column3']` ##### Create dataframe from dict of lists Example: data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, 32, 18, 41, 28], 'City': ['New York', 'Los Angeles', 'London', 'Berlin', 'Sydney'] } DataFrameName = pd.DataFrame(data) DataFrameName Input from variable level ##### Create dataframe from list of dictionaries data = [ {'Name': 'Alice', 'Age': 25, 'City': 'New York'}, {'Name': 'Bob', 'Age': 32, 'City': 'Los Angeles'}, {'Name': 'Charlie', 'Age': 18, 'City': 'London'}, {'Name': 'David', 'Age': 41, 'City': 'Berlin'}, {'Name': 'Eve', 'Age': 28, 'City': 'Sydney'} ] DataFrameName = pd.DataFrame(data) DataFrameName Input from case level ##### Create data frame from dict of dicts data = { 'A': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 60}, 'B': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6}, 'C': {0: 'Yes', 1: 'No', 2: 'Cat', 3: 'Dog', 4: 'Rabbit', 5: 'Fish'}, 'D': {0: 'Thur', 1: 'Sun', 2: 'Thur', 3: 'Fri', 4: 'Sun', 5: 'Thur'}, 'E': {0: True, 1: False, 2: False, 3: False, 4: True, 5: False}, 'F': {0: True, 1: False, 2: False, 3: False, 4: True, 5: False}, 'G': {0: 6.7, 1: 2.2, 2: 3.4, 3: 11.1, 4: 12, 5: 22} } DataFrameName = pd.DataFrame(data) DataFrameName Input from variable level This does not necessarily need to be in order. I swiched `'F': {0: True, 1: False, 4: False, 2: False, 3: False, 5: False},` and it still works perfectly fine. ##### Create data frame from list of lists data = [ ['Alice', 25, 'New York'], ['Bob', 32, 'Los Angeles'], ['Charlie', 18, 'London'], ['David', 41, 'Berlin'], ['Eve', 28, 'Sydney'] ] DataFrameName = pd.DataFrame(data, columns=['Name', 'Age', 'City']) DataFrameName Input from case level first, then name the variables later. ##### Loading dataframe from a csv file df = pd.read_csv( filepath_or_buffer='CSVFileName.csv', index_col='AssignedIDColumnName' ) df = pd.read_csv( filepath_or_buffer='CSVFileName.csv', ) This one is without 'index_col', so it will assign an index for the dataframe. The ID column does not necessarily need to be the first column. **Key: {}** ##### Save dataframe to file DataName = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, 32, 18, 41, 28], 'City': ['New York', 'Los Angeles', 'London', 'Berlin', 'Sydney'] } DataFrameName = pd.DataFrame(DataName) DataFrameName.to_csv( path_or_buf='CSVName.csv', sep=',' ) #### Pandas Series ##### Create from list SeriesName = pd.Series( data = [10, 40, 50], index= ['Bob', 'Anna', 'Peter'], name='series1' ) print(type(SeriesName)) SeriesName ##### Create from data frame SeriesName = DataFrameName['Name'] print(type(SeriesName)) SeriesName #### Create from dict d1 = {'Height': 175, 'Age': 25, 'Weight': 70} SeriesName = pd.Series(d1) SeriesName #### Create pandas data frame from series DataFrameName = pd.DataFrame(SeriesName) DataFrameName I found out this can use multiple series to create the pandas data frame. `DataFrameName = pd.DataFrame([SeriesName1, SeriesName2, SeriesName3]` With assigned name `DataFrameName = pd.DataFrame([SeriesName1, SeriesName2, SeriesName3],["AssignIndexA" ,"AssignIndexB" ,"AssignIndexC"])` --- ## Next note ### Python data science libraries (41-54) 2 41. Pandas indexing and slicing - demo 42. Operations on single data frames/series 43. Operations on single data frames/series - demo 44. Operations between data frames/series 45. Operations between data frames/series - demo 46. Other useful pandas functionalities 47. Pandas data types 48. Pandas data types - demo 49. Pandas group by statement 50. Pandas group by statement - demo 51. ---- Part 3 - Data visualisations ---- 52. Matplotlib basics 53. Seaborn basics 54. Chapter summary