# AI Workflow: Feature Engineering and Bias Detection - Lecture note
https://www.coursera.org/learn/ibm-ai-workflow-feature-engineering-bias-detection?specialization=ibm-ai-workflow
## Data Transformations & Feature Engineering
### Imbalance dataset
#### Measurement
* Accuracy is not a suitable measure for imbalance dataset since some models may be sensitive to class imbalance
* e.g. Consider a medical dataset where only 5% of patients have a rare condition and the other 95% do not. In this case, a model that predicts that all patients are disease-free would achieve an accuracy of 95%. Nevertheless, this model completely lacks the ability to accurately identify the 5% of individuals who are really ill, making it essentially ineffective for its primary purpose of medical diagnosis.

(Source: https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers)
#### Sampling Techniques
* Under-sampling: "giảm mẫu"
* Easy way to balance the dataset
* Downside: not using all of the data => waste
* Over-sampling: use some approaches to increase the size of minority classes.
* Random Over-sampling: simplest form, randomly duplicates examples from the minority class
* SMOTE (Synthetic Minority Oversampling Technique): generates synthetic samples by interpolating between existing minority class examples
* ADASYN (Adaptive Synthetic Sampling): similar to SMOTE but focuses on harder-to-learn examples
* SMOTENC (SMOTE for Nominal and Continuous data): an extension of SMOTE for datasets with mixed categorical and continuous features
#### Model handle imbalance data
* [Support Vector Machines (SVMs)](https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html)
### Dimensionality Reduction
* When the number of features becomes unwieldy, it may be necessary to consider dimension reduction techniques
* Some advantages:
* Enable exploratory data analysis (EDA) visualization in high-dimensional data
* Remove multicollinearity
* Eliminate redundant features
* Help deal with the curse of dimensionality
* Identify structure for supervised learning
#### PCA
* Is used to decompose dataset with masses of features into orthogonal components that explain a maximal amount of the variance
* Other approaches

### Manifold learning
* Being used in the context of visualization
* Being used to generalize linear techniques as PCA to be more sensitive to nonlinear structure in the data
* PCA and Independent Component Analysis (ICA) provide linear projections but do not capture non-linear structures
#### tSNE
* Converts similarities between data points into joint probabilities and minimizes the KL divergence between the low-dimensional embedding and high-dimensional data
* It often provides a different representation of high-dimensional data compared to linear techniques
* Usage: used for visualization and not commonly in pipelines due to its non-convex cost function, which means it does not always yield the same result
### Topic Modeling - lab
* My code: https://drive.google.com/file/d/1irYCtWarYc-o8WtC8HS22oQk0PqHtNUV/view?usp=sharing
* **We use pyLDAvis==3.2.2 because the lastest version of pyLDAvis doesn't have sklearn module**
## Outlier/bias detection
### ai360 (tutorial)
* Source code: https://github.com/Trusted-AI/AIF360
### Outlier detection
* Outlier: data point falls outside an expected distribution / pattern (like which was learned in statistics)
* Novelty detection: assumes that train data don't have outliers (outlier detection is the other way around)
* Application:
* Imbalanced dataset
* Deploy model
#### Algorithms
* Elliptic envelope (outlier)
* This could perform poorly or break in a high-dimensional environment (need to use reduction technique like PCA before)
* One class SVM (outlier + novelty)
* To define the border, a scalar parameter and a kernel must be selected
* Isolation Forest (outlier)
* It functions similarly to random forests but has a lot of single decision stumps
* Local Outlier Factor (outlier + novelty)
* Works well on high-dimensional datasets
=> Visualization is a crucial method for checking if the algorithm is functioning according to expectations
## Unsupervised learning
* Unsupervised learning techniques can be beneficial throughout ***the design thinking process*** because they excel at summarizing and clustering data. Here's how they can be used in each stage:
* **Empathize:** Identify patterns in user data to better understand their needs and pain points
* **Define:** Cluster user feedback and pain points to define the problem statement more clearly
* **Ideate:** Analyze existing solutions and identify patterns to spark new ideas
* **Prototype:** Use clustering to group similar prototype features and test different variations
* **Test:** Analyze user feedback on prototypes to identify patterns and areas for improvement
* Overall, unsupervised learning helps designers make sense of large datasets, uncover hidden patterns, and generate more informed solutions.
### Clustering
#### Algorithms
* **Classical algorithms:**
* k-means: Groups data points into k clusters based on distance
* Hierarchical: Builds a hierarchy of clusters by iteratively merging or splitting them
* **Contemporary algorithms:**
* Spectral Clustering: Uses eigenvectors of a similarity matrix to identify clusters
* Affinity Propagation: Finds clusters by exchanging messages between data points
* **Mixture Models:**
* Assume data comes from a mixture of probability distributions and aim to characterize these distributions (GMM and DPGMM)
#### Evaluation
* Range: [-1,1]
* -1 - indicates incorrect clustering
* 0 - highly overlapping clusters
* 1 - dense well-separated clusters
#### Observations
* No single clustering method outperforms all others
* Silhouette score can be misleading for certain cluster shapes
* Choosing the appropriate clustering method and evaluation metric depends on the specific dataset and problem
#### Additional Points
* The passage emphasizes the importance of visualizing clustering assignments to understand their strengths and weaknesses
* It highlights the flexibility of spectral clustering and GMM
* It mentions the existence of variants of GMMs that automatically determine the number of clusters