[Go back to main schedule page](https://hackmd.io/KhkZGZhyRt6pu4lbEHi6ow?view)
## Introduction
- Today we will create a classifier to detect patients with Portosystemic shunts.
## Agenda
1. Introduction to basic tools
1. Data preprocessing
2. Model selection
3. Model training
4. Evaluation
5. Interpretation
## 1. Tools used today
- Jupyter provide an interactive computing environment for data analysis and collaboration.
- [Scikit-learn](https://scikit-learn.org/stable/) offers a wide range of machine learning algorithms and evaluation techniques.
- Pandas simplifies data manipulation and analysis tasks.
- NumPy provides efficient array operations and mathematical functions.
- [XGBoost](https://xgboost.readthedocs.io/en/stable/) is an open-source machine learning library that provides a gradient boosting framework.
## 2. Data Preprocessing
- Data preprocessing plays a crucial role in machine learning as it helps prepare the data for model training.
- Key steps in data preprocessing include:
- Handling missing values
- Dealing with outliers
- Feature selection or engineering
- Data normalization or scaling
## 3. Model Selection
- Model selection is the process of choosing the best machine learning algorithm and its hyperparameters for a given problem.
- It involves:
- Evaluating different algorithms and their performance on the data
- Tuning hyperparameters to optimize model performance
- Selecting the best-performing model based on evaluation metrics
## 4. Model Training
- Model training involves training the machine learning model using labeled data to learn patterns and make predictions.
- Steps for model training using XGBoost:
- Split the data into training and testing sets
- Define the problem as a classification task
- Set up the XGBoost model with appropriate hyperparameters
- Fit the model to the training data
## 5. Evaluation
- Evaluating the performance of a machine learning model is essential to assess its effectiveness.
- Common evaluation metrics are all provided by Scikit-learn.
## 6. Model Interpretation
- Model interpretation helps us understand the inner workings of a trained machine learning model and gain insights into its decision-making process.
- Some techniques for model interpretation include:
- Feature permutation importance: It measures the impact of permuting a feature on the model's performance, indicating the feature's importance.
- XGBoost built-in feature importance: XGBoost provides a built-in method to calculate feature importance based on the number of times a feature is used to split the data across all trees in the ensemble.
- [SHAP (SHapley Additive exPlanations) ](https://analyticsindiamag.com/a-complete-guide-to-shap-shapley-additive-explanations-for-practitioners/#:~:text=What%20is%20SHAP%3F-,SHAP%20or%20SHAPley%20Additive%20exPlanations%20is%20a%20visualization%20tool%20that,explainable%20by%20visualizing%20its%20output.)values: SHAP values provide a unified measure of feature importance that takes into account feature interactions and provides explanations for individual predictions.
## 7. Example: Creating a Shunt Classifier
- Let's walk through an example of creating a binary classifier using XGBoost:
## Session notes:
Classification task is a basic example of machine learning.
Machine learning is using the data to teach a machine to classify or predict an outcome.
AI/DeepLearning/LLM/ObjectPrediction - are more high intensity data
OpenCV can be used with image processing. Using a pre-trained model can speed up training an algorithm for specific data/tasks.
Q: what are we looking for to know if our project is applicable for machine learning/AI?
Generally, anything less than 1000 [samples/embryos/seeds/patients] is bias. Check for more than 25% missing data in a data category. Even with imputing missing data, there should be a arbitrary threshold set to 30% or less.
Before training the full dataset, use a smaller subset of the data to check/validate all tools on datatype. This allows for the identification of the best model for the data.
Downsampling and oversampling can be used to eliminate bias in an unbiased dataset.
Certain commands are faster than others. Think indexing instead of the `drop()` command in python.
Q: Is the answer really clear to trained professionals?
Ran the model three times because the people in vetmed were not sure of the results. That's why the pre-processing is very important, because the data needs to be unbiased.
Pay attention to bias-variance trade-off for real world data.
If you are working on a deadly disease consider would you rather have a higher False Positive or False Negative result for your samples. Is it a breast cancer diagnosis, identifying a previous shunt, or segementation of cell types?
Defining the desired results of the algorithm is an important beginning step as this will define the pre-processing steps taken across the dataset.
To combat over-fitting, cross-validate the training set results with k-fold permutation.
Do not touch the test set until the final training. This can create bias.
A grid, random, or bayesian search can be used to tune the parameters of the machine learning algo. There are existing packages/modules for hyperparameter tuning automatically (to mixed results).
[explainerdashboard](https://explainerdashboard.readthedocs.io/en/latest/) is a scikit-learn machine learning result dashboard.
Q: what were some of the hardest things of working on this project? what could the domain experts could have done to help more?
Data, was the hardest part. Understanding the data and having a cleaned dataset would have been better at the beginning.
Including medical images with this dataset that could have assisted in this process.
Having the conversation of here is the data you are giving me and here is the data I need. Shadowing the clinicians to understand the data more
---
## Some interesting tools and resources
- [paperswithcode](https://paperswithcode.com)
- [PyTorch](https://pytorch.org)
- [TensorFlow](https://www.tensorflow.org)
- [huggingface](https://huggingface.co)
---