Machine Learning

# Machine Learning ## Overview ### What is Machine Learning? ![Screen Shot 2023-12-20 at 5.48.35 PM](https://hackmd.io/_uploads/B1uIkebDT.png) [source](https://medium.com/machine-learning-for-humans) ![Screen Shot 2023-12-20 at 5.51.42 PM](https://hackmd.io/_uploads/rkWzxgWP6.png) [source](https://www.datasciencecentral.com/profiles/blogs/artificial-intelligence-vs-machine-learning-vs-deep-learning) ### History of AI in Medicine ![Screen Shot 2023-12-20 at 5.53.27 PM](https://hackmd.io/_uploads/SJT_eg-P6.png) `Kaul V, Enslin S, Gross SA. History of artificial intelligence in medicine. Gastrointest Endosc. 2020 Oct;92(4):807-812. doi: 10.1016/j.gie.2020.06.040. Epub 2020 Jun 18. PMID: 32565184.` ![Screen Shot 2023-12-20 at 5.55.05 PM](https://hackmd.io/_uploads/rkekWg-Dp.png) `Beam AL, Kohane IS. Big Data and Machine Learning in Health Care. JAMA. 2018 Apr 3;319(13):1317-1318. doi: 10.1001/jama.2017.18391. PubMed PMID: 29532063.` ![Screen Shot 2023-12-20 at 5.55.54 PM](https://hackmd.io/_uploads/B1a-Ze-vp.png) `Matheny M, Israni ST, Ahmed M, Whicher, D. (Eds.). 2019. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. Special Publication, National Academy of Medicine, Washington, DC.` ## Machine Learning Models ![Screen Shot 2023-12-21 at 10.36.36 AM](https://hackmd.io/_uploads/HkP5j0bDp.png) [source](https://vitalflux.com/dummies-notes-supervised-vs-unsupervised-learning/) ## Stages of Machine Learning Workflows for EHRs 1. Define the machine learning question 2. Train the machine learning model 3. Internal validation of the model (cross-validation) 4. External validation of the model 5. Integrate the model into a health information system (HIS; e.g., EHR) as a clinical decision support (CDS) tool ## 1. Define the Machine Learning Question - **Supervised machine learning**: you have a label (target) that you are interested in predicting - You have a population of ICU patients and you want to predict the risk of mortality. Your label is mortality. Your features might include: demographics, vital signs, medications, problems, medical history, medical comorbidities, other risk factors. - You have a population of diabetic patients, and you want to predict their hemoglobin A1c. Your label is HbA1c. - **Unsupervised machine learning**: you don't have a label or target; instead, you want to uncover hidden structure, such as clusters, subgroups, etc. - You have a population of chronic pain patients, and you want to see if you can discover clinically relevant clusters or subgroups within your population. #### Classification vs. Regression - **Classification**: there are a finite number of discrete classes (categories) into which your label can fall - Binary: exactly two classes - Mortality (alive, deceased) - Multiclass: three or more classes - Destination after hospital discharge (home, short-term care, long-term care, deceased) - Cancer grading and staging - **Regression**: there is not a finite number of possible values for your label - HbA1c, systolic blood pressure, creatinine clearance ![Screen Shot 2023-12-21 at 10.40.50 AM](https://hackmd.io/_uploads/BJti3AZDa.png) [source](https://www.springboard.com/blog/data-science/regression-vs-classification/) ![Screen Shot 2023-12-21 at 10.41.05 AM](https://hackmd.io/_uploads/HkFi2CWPT.png) [source](https://www.geeksforgeeks.org/one-vs-rest-strategy-for-multi-class-classification/) ### Supervised Machine Learning Models | Model | Binary Classification | Multiclass Classification | Regression | | -------- | -------- | -------- |-------- | | Generalized Linear Models (GLM) | Yes (logistic regression) | Yes (multinomial logistic regression) | Yes (linear regression) | | Random Forests | Yes | Yes | Yes | | Support Vector Machines (SVM) | Yes | Yes | Yes | | Artificial Neural Networks (ANN) | Yes | Yes | Yes | ## 2. Train the Machine Learning Model 1. Select one or more models to train and compare. 2. Randomly shuffle your data set and randomly split it into a training set (70%) and a testing set (30%). 3. Train each model on the training set. 4. Test the performance of each model on the testing set using one or more metrics: a. Sensitivity - **type II error** - "rule out" b. Specificity - **type I error** - "rule in" c. Receiver operating characteristic (ROC) curve d. Area under the ROC curve (AUROC) 5. Compare the models to each other. 6. Consider the performance of each model in the clinical context. ## 3. Internal Validation (Cross-Validation) - Performed using your original data set - Evaluates the robustness of the model - Steps: - Randomly shuffle your original data set and randomly split it into k subsets of equal size - Common choice: k = 10 (10-fold cross-validation) - Train the model k times. For the k-th iteration, use the k-th subset as the testing set and use all other subsets (combined together) as the training set. ![Screen Shot 2023-12-21 at 10.49.39 AM](https://hackmd.io/_uploads/BJXsRRWD6.png) [source](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) ## 4. External Validation - During internal validation (cross-validation), you used the original data set (randomly partitioned into k subsets). - During external validation, you use a **different data set**. - During external validation, you do not train your model! Instead, you take the model that you have already trained and test it on the new data set. - Examples of external validation: - Original data set: patients at RIH. New data set: patients from a hospital in California. - Original data set: ICU patients at RIH. New data set: hospital floor patients at RIH. - Original data set: all RIH patients from 2005 to 2010. New data set: all RIH patients from 2010 to 2015. - External validation is important because it shows that your model is **generalizable** and **reproducible**. ## 5. Integrate the Model into an HIS as a CDS Tool - Authenticate the user (clinician, patient, family member) - **SMART** on **FHIR apps**: open standard for authentication applications to HIS - Read data from the HIS - **Fast Healthcare Interoperability Resources (FHIR)**: open standard for reading data from HIS - **United States Core Data for Interoperability (USCDI)**: federal standards for select healthcare data fields - Present results and recommendations back to the user ## Example: Semi-Automated Curation of Problem List - An inaccurate, incomplete, or outdated problem list can lead to suboptimal clinical care. - Patient has diabetes, but diabetes is not on the problem list. When reviewing the patient's problem list in advance of seeing a patient, the clinician may not realize that the patient has diabetes, and thus they may not prepare (in advance) the additional screenings and preventative care that are required for diabetic patients. - The medication list can be used to automatically suggest changes to the problem list, such as additions, deletions, and modifications. - If a patient is regularly filling a prescription for metformin, there is a high probability that diabetes should be on the patient's problem list (if there is no other condition on the problem list for which metformin is indicated). | | | | | -------- | -------- | -------- | | 1 | Define the machine learning question: **Can we use medications to predict the probability that a patient has diabetes?** | Features: **medications** Labels: **diabetes (yes or no)** Supervised or unsupervised: **supervised** Classification or regression: **classification** Binary or multiclass: **binary** | | 2 | Train the machine learning model | **Models**: GLM, random forest, SVM, ANN **Metrics**: sensitivity, specificity, AUROC **Training data set**: All primary care patients seen at Brown Medicine from 2010 to 2015 | | 3 | Internal validation (cross-validation) | All primary care patients seen at Brown Medicine from 2010 to 2015 | | 4 | External validation | All primary care patients seen at Kaiser Permanente from 2010 to 2015 | | 5 | Integrate the model into a health information system (EHR, HIE, etc.) as a CDS tool | Build a SMART on FHIR app that integrates into the EHR and makes suggestions if a patient has a high probability of having diabetes but does not have diabetes on their problem list | ## Supervised Learning - Labeled datasets - Train or “supervise” algorithms into classifying data or predicting outcomes accurately - Model can measure its accuracy and learn over time - Classification & Regression ![Screen Shot 2023-12-21 at 11.01.19 AM](https://hackmd.io/_uploads/SJVdWJGva.png) ## Unsupervised Learning - Unlabeled datasets - Discover hidden patterns in data and find useful insights from the data - Finds the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format. ![Screen Shot 2023-12-21 at 11.01.37 AM](https://hackmd.io/_uploads/SJJcZkzD6.png) ### Clustering | Common Types of Clustering Algorithms | | -------- | | K-means | | Hierarchial | | Self-Organizing Maps | | DBSCAN | | Gaussian Mixture Model | | Balance Iterative Reducing and Clustering using Hierarchies (BIRCH) | | Affinity Propagation | | Ordering Points to Identify the Clustering Structure (OPTICS) | ![Screen Shot 2023-12-21 at 11.07.55 AM](https://hackmd.io/_uploads/ryPlXJzw6.png) ::: info **Advantages** - Used for more complex tasks due to no labeled data - Easier to get unlabeled data in comparison to labeled data - Determine unknown patterns in datasets to provide new insight about the topic such as a disease of interest **Disadvantages** - Used more commonly for exploratory analysis - Does not have corresponding output - Challenging to validate - Could be less accurate as input data is not labeled, and algorithms do not know the exact output in advance ::: ### K-Means ![Screen Shot 2024-01-16 at 4.45.13 PM](https://hackmd.io/_uploads/S1CxFuEKa.png) ### Self-Organizing Map (SOM) ![Screen Shot 2024-01-16 at 4.45.39 PM](https://hackmd.io/_uploads/SyEGYOVYT.png) ### Two Level Clustering ![Screen Shot 2024-01-16 at 4.45.57 PM](https://hackmd.io/_uploads/HyBQKOVta.png)