# Clustering And Dimensionality Reduction - Deep Dive
# Python data science libraries (27-40) 1
27. Chapter agenda
28. ---- Part 1 - Numpy ----
29. Indexing & slicing in numpy
30. Indexing & slicing in numpy - demo
31. Operations on single numpy arrays
32. Operations on single numpy arrays - demo
33. Operations between numpy arrays & broadcasting
34. Operations between numpy arrays & broadcasting - demo
35. Merging numpy arrays
36. Data types in numpy
37. Matrix operations in numpy
38. ---- Part 2 - pandas ----
39. Pandas indexing and slicing
40. Creating data frames
---
## 27. Chapter agenda
For data processing & manipulation.
► Numpy. (28-37)
► Pandas.(38-50)
For data visualization: (51-53)
► Matplotlib.
► Seaborn.
Summary(54)
---
## 28. ---- Part 1 - Numpy ----
Numpy
Numpy topic
NUMPY ARRAYS
Numpy stores data by using multi-dimensional arrays.
► All the data in a single numpy array must be of the same data type.(efficienct)
Numpy arrays are objects of "numpy.ndarray" class
NUMPY ARRAY PROCESSING
► NumPy comes with a wide collection of utilities that allow us to process numpy arrays.
► Numpy can perform mathematical operations on entire arrays all at once, no need for loops.
► Performing computations in numpy is much faster and handier than doing the same task by using basic iterable data types.
► Numpy is backbone of other important data science libraries(e.g.: pandas, scipy)
Topic:
► Creating.
► Slicing.
► Modifying.
► Searching for elements.
► Operations on single numpy arrays
► Operations including more than one array.
► Broadcasting.
► Merging.
► Numpy datatypes.
► Matrix operations.
---
### 29. Indexing & slicing in numpy
Indexing.
Slicing.
Slicing by using arrays.
Boolean indexing.
#### INDEXING
► In Numpy arrays, the position of each element is determined by its index.
▸ Elements can be accessed or modified using their index.
▸ Each dimension of a numpy array has its own index.
► Indexes start at 0.
(negative indexs -1)
#### Selecting
Numpy array element
single element (Just like how matrices work, we all know linear algebra, right?) [M[r, c]]
multiple elements
: == to
last element -1
select whole row or column: no input, :, 0:last+1
#### Slicing with arrays
Numpy array magic
assign the multiple rows or columns by R or C = [a, b, c] then M[R,:][:,C] or M[np.ix_(R,C)]
p.s. ix_() function construct an open mesh from multiple sequences.
#### Boolean indexing
(True and False)
► Boolean arrays and boolean masks can be used to retrieve or modify numpy array elements.
► Boolean indexing can be used together with standard indexing.
---
### 30. Demo
#### Creating, slicing and modifying
**Lecture agenda**
- Slicing numpy arrays
- Modifying numpy arrays
- Searching for elements
let play with numpy, I am using Visual Studio by my PC selected it for me.
**import numpy as np** ##never forget to import the library
#### Creating numpy arrays
name = np.array([[value1,value2,value3...])
Two dimensional numpy array by using ","
check attributes "print(name.shape)"
##### Function
**Creating different shapes of zero array**
`name = np.zeros(shape=(r,c))`
**Creating different shapes of single unique value array**
`name = np.full(shape=(r,c), fill_value=value)`
**arange array** [name = np.arange(start, stop((not included)), step)]
`name = np.arange(value)`
*question then, how to create 2 dimensional arange array*
np.array expects a single argument
so, all way can do is assign name1 and name 2 by np.arange then np. array(name1, name2)
**random array**
`name = np.random.randint(minimum value, maximum value(not included), (r, c))`
Quick reminder- **python cheat sheet!**


source: [Perso.limsi.fr](https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf)
#### Slicing
`name[start:stop:step]`
**slice by another array (can be sliced by boolean array)**
`name1[name2]`
#### obtain information
`name1[target]`
**mask**
`name_mask = name_array >(or<) target_value`
**masking**
`name_array[name_arrayr >(or<) target_value]`
#### Modifying
**single element**
`name[r,c] = new_value`
**row/ column**
`name[r,c] = np.array([new_value1, new_value2, new_value3...])`
can be based on index arrays or boolean expression
`new[np.ix_(row_idx, col_idx)] = new_value`
`name[name < target_value] = new_value`
#### Searching
**Find index of elements**
`np.where(name > condition_value)`
@quite useless for me
**Fine the unique elements @which is just sort them out**
`np.unique(name)`
**count how many of them**
`np.unique(name, return_counts=True)`
### 31. Operations on single numpy arrays
Operations on single elements.
Array summarization.
Reshaping.
#### Function
`np.sqrt()` - square root.
`np.exp()` - exponential.
`np.log()` - logarithm.
`np.max()` - maximum value.
`np.mean()` - mean value.
**NUMPY ARRAYS AND SCALARS**
if + == sum
if > == boolean
**Summarization**
arrays axis (represent numpy array dimensions)
axis 0: rows
axis 1: columns
*how to remember which one is row or column? Row has r, which r---- so this is go fo rthe x axis one. Column has l, so this goes down by l*.
#### Array shape
► Information on number of dimensions and size of each dimension.
► Can be retrieved through the shape attribute.
#### ARRAY RESHAPING
`np.reshape(array, newshape=(r, w))`
`array.reshape(shape=(r,w))`
**Array shapes must be compatible.**
*filled by row in order.*
##### transpose
replace rows and columns
switching rows <==> columns entirely
##### EXPANDING DIMENSIONALITY
▸ Artificially expanding dimensionality of an array to make it compatible with other arrays.
`np.expand_dims(array, axis=n)` n=0 (row) or 1(column)
---
### 32. Demo
Lecture agenda
- Operations on 1-D and 2-D arrays
- Array reshaping
#### Array operations
Square root of each element `np.sqrt(ArrayName)`
Exponential (base e) of each element `np.exp(ArrayName)`
Log of each element `np.log(ArrayName)`
Addition `ArrayName + Number`
Multiplication `ArrayName * Number`
Comparison `ArrayName > Number` result in **boolean**
**Aggregation**
Sum `np.sum(ArrayName)`
Average `np.mean(ArrayName)`
Maximum `np.max(ArrayName)`
Minimum `np.min(ArrayName)`
Sort `ArrayName.sort()`
#### 2-D arrays - Aggregation operations
`np.Aggregation(ArrayName, axis=number)`
number= 0(column) or 1(row)
#### Array reshaping
Transpose `NewArrayName = ArrayName.T`
Custom reshape `NewArrayName = np.reshape(ArrayName, newshape=(RowSize, ColumnSize))`
Add new dimension `NewArrayName = np.expand_dims(ArrayName, axis=number)`
Downgrade to 1 D `NewArrayName = np.squeeze(ArrayName)`
*downgradw only suit for only has one row/ column data*
---
### 33. Operations between numpy arrays & broadcasting
Operations between arrays having same shapes
Broadcasting
**Same shapes **
Operations will be executed element wise
#### Broadcasting
Occurs when an operation involves arrays of different shapes
Arrays will "stretch" in order to obtain same shape
Rules:
When comparing arrays that have same dimensionality:
Compare dimensionalities one by one.
Dimensions are compatible if:
They are equal.
One of them equals 1.
- When comparing arrays that have different dimensionality:
Add dimensions equal "1" to the array of lower dimensionality until dimensionalities match.
► Start with the final dimension and move towards the initial dimension.
► For each dimension:
The array with a smaller size in that dimension replicates itself to match the size of the larger array in that dimension.
comparison (column->row)
---
### 34. Demo
#### Operations between numpy arrays
See [32. Previous demo](https://hackmd.io/-dLZpEglTL-AgsBJ9Gm6RQ?view#32-Demo). The logic is the same.
#### Broadcasting
`ArrayName1+ArrayName2`
I tried to operate shapes (3,2) with (6,1), which does not drag them to (6,6)
---
### 35. Merging numpy arrays
Lecture agenda
- Merging numpy arrays
- Concatenations
- Stacking
#### Concatenations
Merge
1D`NewArrayName = np.concatenate((ArrayName, ArrayName))`
2D`NewArrayName = np.concatenate((ArrayName, ArrayName), axis=number)`
*reminder 0 == column, 1 == row*
*Do not forget to expand to 2D by `np.expand_dims(ArrayName, axis=number)`*
Can add multiple by `np.concatenate((ArrayNamea, ArrayNameb, ArrayNamec), axis=number) `
#### Stacking
Vertical stacking `np.vstack((ArrayNamea, ArrayNameb))`
Column stacking `np.column_stack((ArrayNamea, ArrayNameb))`
np.concatenate only can combine with same dimension, stacking can overcome dimension different.
---
### 36. Data types in numpy
- Datatypes
`np.int32`: This is a 32-bit integer type.
It can represent integer values from -2147483648 to 2147483647. It uses 32 bits of memory, which equals 4 bytes.
`np.int64`: This is a 64-bit integer type.
It can represent integer values from -9223372036854775808 to 9223372036854775807. It uses 64 bits of memory, which equals 8 bytes.
float 32 or 64
boolean (.bool_) (do not forget the under line)
- Datatype conversions
Numpy array can contain only elements of one datatype.
#### Datatype conversions
if to int, loss everything behind the. if this is int.
`ArrayName.astype(np.DataTypeName)`
---
### 37. Matrix operations in numpy
- Creating matrices
- Matrix adition and multiplication
#### Creating matrices
See [30. previous demo](https://hackmd.io/-dLZpEglTL-AgsBJ9Gm6RQ?view#Function) to create array/ matrix.
Identity matrix `MartixName = np.eye(MartixSize)`
Matrix of ones `MartixName = np.ones((RowSize, ColumnSize))`
Random matrix `MartixName = np.random.random((RowSize, ColumnSize))`
#### Matrix adition and multiplication
Adition: use **+**
Multiplication:
Multiplication: **`Matrixa @ Matrixb`**
Not **`Matrixa*Matrixb`**
Transponse: `NewMatrixName = MatrixName.T`
---
## 38. ---- Part 2 - pandas ----
Pandas intro
Pandas topic
#### WHAT IS PANDAS
► python library that helps manage and analyze data, especially when it's in a table format.
► Install `pip install pandas`
`py -m pip install pandas` for windows user in cmd
@I always forget how to use window correctly
► import `import pandas as pd`
##### MAIN DATA FORMATS
two main ways of storing data.
**series**. or **frames**.
###### Pandas data series
a single column data frame.
Each element is described by its **index**.
All elements in pandas series must have **same** datatype.
Can contain missing values.
Objects of type "pandas.core.series.Series".
###### Pandas data frames
using **table** format.
Similar to 2D numpy arrays.
Each element is determined by index and column name.
All entries inside single column must have same datatype.
Columns can contain missing values.
Objects of "pandas.core.frame.DataFrame" class.
take home message series have 1 variable only while frames can have multiple variable.
#### PANDAS OPERATIONS
► Pandas provides wide range of tools for working with data frames and series.
► Operations are data type specific.
▸ Example operations:
Summarizations.
Data type conversions.
#### PANDAS TOPIC
► Indexing in pandas.
► Operations on single data frames / series.
► Operations including multiple data frames / series.
► Useful utilities.
► Group by.
---
### 39. Pandas indexing and slicing
Slicing with.loc.
Selecting whole columns.
Slicing with booleans.
Slicing with .iloc.
Indexing pandas data series.
Columns represent features while rows contain instances of the dataset
#### SLICING PANDAS DATAFRAME
► **Rows** are **first** dimension while **columns** are **second** dimension.
► Retrieving / modifying elements is similar as in numpy.
Two main methods for slicing:
`.loc` - uses row and column **names**.
`.iloc `- uses row and column **positions**.
##### SLICING with .loc
single element:
Dataframe==Name
`Name.loc[IndexNumber, 'ColumnName']`
Mutiple elements:
`Name.loc[[IndexNumberStart:IndexNumberEnd], ['ColumnNameStart':'ColumnNameEnd']]`
`Name.loc[[IndexNumber,IndexNumber], ['ColumnName','ColumnName']]`
`Name.loc[bool_arr_1, bool_arr_2]`
If you **have** [] on 'ColumnName', this will return pandas dataframs. `Name.loc[[IndexNumber,IndexNumber], ['ColumnName']]`
If 'ColumnName' **without** [] , this will return pandas series. `Name.loc[[IndexNumber,IndexNumber], 'ColumnName']`
##### considertation
use ":" to select whole rows / columns.
indexing
can not use negative indexing
can use boolean arrays
##### Selecting whole columns
`Name.loc[:,'ColumnName']` or `Name.loc['ColumnName']`
*p.s. ['ColumnName'] can return to data frame instead of series*
##### Slicing with booleans
this can slice by booleans!!!!!!
just use it as a filter!!!
##### SLICING with .iloc
row/column names will be **ignored** and data frame will be **sliced** as it was **2D numpy array**
`Name.iloc[:row,:column]`
##### Indexing pandas data series
One or **more** elements can be **retrieved**.
Index is used for **retrieving elements**.
Single element
`Name[IndexNumber]`
Multiple
`Name[IndexNumberStart:IndexNumberEnd-1]`
`Name[IndexNumber,IndexNumber]`
by booleans
`Name[Ture or False with , to separate ]`
---
### 40. Demo - Creating data frames
- Pandas DataFrame
- Pandas Series
#### Pandas DataFrame
based on numpy then pandas
`data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])`
**Define the column names**
`columns = ['Column1', 'Column2', 'Column3']`
**Define the index names**
`index = ['Row1', 'Row2', 'Row3']`
**Create a DataFrame**
`DataFrameName = pd.DataFrame(data, columns=columns, index=index)`
**Properties check** `type(DataFrameName)`
**Dataframe shape** `DataFrameName.shape`
**Dataframe length** `len(DataFrameName)`
**Create pandas dataframe with no index / column names**
Create a 2D NumPy array `data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])`
Create a DataFrame `DataFrameName= pd.DataFrame(data)`
**Assign index / column names**
**Index can be alphabet!!**
`DataFrameName.index = ['Row1', 'Row2', 'Row3']`
`DataFrameName.columns = ['Column1', 'Column2', 'Column3']`
##### Create dataframe from dict of lists
Example:
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 18, 41, 28],
'City': ['New York', 'Los Angeles', 'London', 'Berlin', 'Sydney']
}
DataFrameName = pd.DataFrame(data)
DataFrameName
Input from variable level
##### Create dataframe from list of dictionaries
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 32, 'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 18, 'City': 'London'},
{'Name': 'David', 'Age': 41, 'City': 'Berlin'},
{'Name': 'Eve', 'Age': 28, 'City': 'Sydney'}
]
DataFrameName = pd.DataFrame(data)
DataFrameName
Input from case level
##### Create data frame from dict of dicts
data = {
'A': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 60},
'B': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6},
'C': {0: 'Yes', 1: 'No', 2: 'Cat', 3: 'Dog', 4: 'Rabbit', 5: 'Fish'},
'D': {0: 'Thur', 1: 'Sun', 2: 'Thur', 3: 'Fri', 4: 'Sun', 5: 'Thur'},
'E': {0: True, 1: False, 2: False, 3: False, 4: True, 5: False},
'F': {0: True, 1: False, 2: False, 3: False, 4: True, 5: False},
'G': {0: 6.7, 1: 2.2, 2: 3.4, 3: 11.1, 4: 12, 5: 22}
}
DataFrameName = pd.DataFrame(data)
DataFrameName
Input from variable level
This does not necessarily need to be in order. I swiched `'F': {0: True, 1: False, 4: False, 2: False, 3: False, 5: False},` and it still works perfectly fine.
##### Create data frame from list of lists
data = [
['Alice', 25, 'New York'],
['Bob', 32, 'Los Angeles'],
['Charlie', 18, 'London'],
['David', 41, 'Berlin'],
['Eve', 28, 'Sydney']
]
DataFrameName = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
DataFrameName
Input from case level first, then name the variables later.
##### Loading dataframe from a csv file
df = pd.read_csv(
filepath_or_buffer='CSVFileName.csv',
index_col='AssignedIDColumnName'
)
df = pd.read_csv(
filepath_or_buffer='CSVFileName.csv',
)
This one is without 'index_col', so it will assign an index for the dataframe.
The ID column does not necessarily need to be the first column.
**Key: {}**
##### Save dataframe to file
DataName = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 18, 41, 28],
'City': ['New York', 'Los Angeles', 'London', 'Berlin', 'Sydney']
}
DataFrameName = pd.DataFrame(DataName)
DataFrameName.to_csv(
path_or_buf='CSVName.csv',
sep=','
)
#### Pandas Series
##### Create from list
SeriesName = pd.Series(
data = [10, 40, 50],
index= ['Bob', 'Anna', 'Peter'],
name='series1'
)
print(type(SeriesName))
SeriesName
##### Create from data frame
SeriesName = DataFrameName['Name']
print(type(SeriesName))
SeriesName
#### Create from dict
d1 = {'Height': 175, 'Age': 25, 'Weight': 70}
SeriesName = pd.Series(d1)
SeriesName
#### Create pandas data frame from series
DataFrameName = pd.DataFrame(SeriesName)
DataFrameName
I found out this can use multiple series to create the pandas data frame.
`DataFrameName = pd.DataFrame([SeriesName1, SeriesName2, SeriesName3]`
With assigned name
`DataFrameName = pd.DataFrame([SeriesName1, SeriesName2, SeriesName3],["AssignIndexA" ,"AssignIndexB" ,"AssignIndexC"])`
---
## Next note
### Python data science libraries (41-54) 2
41. Pandas indexing and slicing - demo
42. Operations on single data frames/series
43. Operations on single data frames/series - demo
44. Operations between data frames/series
45. Operations between data frames/series - demo
46. Other useful pandas functionalities
47. Pandas data types
48. Pandas data types - demo
49. Pandas group by statement
50. Pandas group by statement - demo
51. ---- Part 3 - Data visualisations ----
52. Matplotlib basics
53. Seaborn basics
54. Chapter summary