AI Workflow: Feature Engineering and Bias Detection - Lecture note

# AI Workflow: Feature Engineering and Bias Detection - Lecture note https://www.coursera.org/learn/ibm-ai-workflow-feature-engineering-bias-detection?specialization=ibm-ai-workflow ## Data Transformations & Feature Engineering ### Imbalance dataset #### Measurement * Accuracy is not a suitable measure for imbalance dataset since some models may be sensitive to class imbalance * e.g. Consider a medical dataset where only 5% of patients have a rare condition and the other 95% do not. In this case, a model that predicts that all patients are disease-free would achieve an accuracy of 95%. Nevertheless, this model completely lacks the ability to accurately identify the 5% of individuals who are really ill, making it essentially ineffective for its primary purpose of medical diagnosis. ![Screenshot 2024-06-25 143132](https://hackmd.io/_uploads/HJaPfg_8R.png) (Source: https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers) #### Sampling Techniques * Under-sampling: "giảm mẫu" * Easy way to balance the dataset * Downside: not using all of the data => waste * Over-sampling: use some approaches to increase the size of minority classes. * Random Over-sampling: simplest form, randomly duplicates examples from the minority class * SMOTE (Synthetic Minority Oversampling Technique): generates synthetic samples by interpolating between existing minority class examples * ADASYN (Adaptive Synthetic Sampling): similar to SMOTE but focuses on harder-to-learn examples * SMOTENC (SMOTE for Nominal and Continuous data): an extension of SMOTE for datasets with mixed categorical and continuous features #### Model handle imbalance data * [Support Vector Machines (SVMs)](https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html) ### Dimensionality Reduction * When the number of features becomes unwieldy, it may be necessary to consider dimension reduction techniques * Some advantages: * Enable exploratory data analysis (EDA) visualization in high-dimensional data * Remove multicollinearity * Eliminate redundant features * Help deal with the curse of dimensionality * Identify structure for supervised learning #### PCA * Is used to decompose dataset with masses of features into orthogonal components that explain a maximal amount of the variance * Other approaches ![Screenshot 2024-06-25 153837](https://hackmd.io/_uploads/ryGmfZdUC.png) ### Manifold learning * Being used in the context of visualization * Being used to generalize linear techniques as PCA to be more sensitive to nonlinear structure in the data * PCA and Independent Component Analysis (ICA) provide linear projections but do not capture non-linear structures #### tSNE * Converts similarities between data points into joint probabilities and minimizes the KL divergence between the low-dimensional embedding and high-dimensional data * It often provides a different representation of high-dimensional data compared to linear techniques * Usage: used for visualization and not commonly in pipelines due to its non-convex cost function, which means it does not always yield the same result ### Topic Modeling - lab * My code: https://drive.google.com/file/d/1irYCtWarYc-o8WtC8HS22oQk0PqHtNUV/view?usp=sharing * **We use pyLDAvis==3.2.2 because the lastest version of pyLDAvis doesn't have sklearn module** ## Outlier/bias detection ### ai360 (tutorial) * Source code: https://github.com/Trusted-AI/AIF360 ### Outlier detection * Outlier: data point falls outside an expected distribution / pattern (like which was learned in statistics) * Novelty detection: assumes that train data don't have outliers (outlier detection is the other way around) * Application: * Imbalanced dataset * Deploy model #### Algorithms * Elliptic envelope (outlier) * This could perform poorly or break in a high-dimensional environment (need to use reduction technique like PCA before) * One class SVM (outlier + novelty) * To define the border, a scalar parameter and a kernel must be selected * Isolation Forest (outlier) * It functions similarly to random forests but has a lot of single decision stumps * Local Outlier Factor (outlier + novelty) * Works well on high-dimensional datasets => Visualization is a crucial method for checking if the algorithm is functioning according to expectations ## Unsupervised learning * Unsupervised learning techniques can be beneficial throughout ***the design thinking process*** because they excel at summarizing and clustering data. Here's how they can be used in each stage: * **Empathize:** Identify patterns in user data to better understand their needs and pain points * **Define:** Cluster user feedback and pain points to define the problem statement more clearly * **Ideate:** Analyze existing solutions and identify patterns to spark new ideas * **Prototype:** Use clustering to group similar prototype features and test different variations * **Test:** Analyze user feedback on prototypes to identify patterns and areas for improvement * Overall, unsupervised learning helps designers make sense of large datasets, uncover hidden patterns, and generate more informed solutions. ### Clustering #### Algorithms * **Classical algorithms:** * k-means: Groups data points into k clusters based on distance * Hierarchical: Builds a hierarchy of clusters by iteratively merging or splitting them * **Contemporary algorithms:** * Spectral Clustering: Uses eigenvectors of a similarity matrix to identify clusters * Affinity Propagation: Finds clusters by exchanging messages between data points * **Mixture Models:** * Assume data comes from a mixture of probability distributions and aim to characterize these distributions (GMM and DPGMM) #### Evaluation * Range: [-1,1] * -1 - indicates incorrect clustering * 0 - highly overlapping clusters * 1 - dense well-separated clusters #### Observations * No single clustering method outperforms all others * Silhouette score can be misleading for certain cluster shapes * Choosing the appropriate clustering method and evaluation metric depends on the specific dataset and problem #### Additional Points * The passage emphasizes the importance of visualizing clustering assignments to understand their strengths and weaknesses * It highlights the flexibility of spectral clustering and GMM * It mentions the existence of variants of GMMs that automatically determine the number of clusters