Inès Krissaane
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Rare Disease Prediction AI in Health & Care Study Group June 26-28 2019 Industrial leader : Giovanni Charles Academic leader : Thomas House Group : David Haw Diego perez Ruiz Clement Twumasi Ines Krissaane Xioxi Pang Connor Toal ## Rare Disease Prediction #### Objectives We are interested in machine learning methods that are effective and secure on siloed data. There has been a lot of interest in Federated Learning and Zero-knowledge computation to tackle issues with disparate, sensitive data. Our hope is that methods like these could make machine learning feasible for rare disease prediction while preserving patient privacy. This research raises questions around the level of privacy that can be guaranteed; if there are methods to allow hospitals to moderate the data leaving their system; and which restrictions would this have (if any) on predictive performance. #### Questions - Value of genetic information GIVEN phenotype P_r(Fabry | X_gene ) close to 1 what about P_r(Fabry | phenotype1, ..., phenotype_n) and P_r(phenotype1|phenotypeN)? - Value of new phenotypic information GIVEN current phenotypic - Many covariates - Uncertainty in risks - Acceptable values of model accuracy/performance - Hierarchical data : feature importance #### Methods or Ideas - Regression - Random Forest (decision trees) - Gradient Boosting - Auto encoders deep learning - Features selection - Lasso, Ridge, ElasticNet - Fused priors - Clustering - PCA - dimensionnaly reduction ## Data 100k genomes project https://www.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/ Fabry Disease https://en.wikipedia.org/wiki/Fabry_disease - Stroke - Cardiac - Acroparesthesia # Presentation ## Introduction A disease has prevalence $p\ll 1$ **within a given phenotype**. \begin{eqnarray} Pr(+\vert Disease)&=&\alpha\nonumber\\ Pr(-\vert Well)&=&\beta\nonumber\\ Pr(Disease\vert +)&=&\frac{Pr(+\vert Disease)Pr(Dis)}{Pr(+\vert Disease)Pr(Dis)+Pr(+\vert Well)Pr(Well)}\nonumber\\ &=&\frac{\alpha p}{\alpha p+(1-\beta)(1-p)}\nonumber \end{eqnarray} By updating $p$, conditional on new phenotypic information, we can improve accuracy. Clinical support: * ""$X\%$ of people with this phenotype have this disease." * "The main contributors are ..." * "Try testing for ..." * "It will prove accuracy to $Y\%$ if positive and $X\%$ if negative." The data: ![](https://i.imgur.com/D0p1MsV.png) Methods: - Logistic regression - Lasso, Ridge, ElasticNet - Fused priors - Random Forest (decision trees) - K-means clustering - Hierarchical agglomerative clustering ## 1/ Clustering Clustering is a broad set of techniques for finding subgroups of observations within a data set. We can use **unsupervised clusteringbased anomaly/outlier detection approach** for detecting implausible observations in EHR data and to identify specific structure into phenotypes. ![](https://i.imgur.com/Hz5bdIa.png) #### Agglomerative clustering Hierarchical clustering is a cluster analysis method, which produce a tree-based representation (i.e.: dendrogram) of a data. Objects in the dendrogram are linked together based on their similarity. Steps : - Computing (dis)similarity information between every pair of objects in the data set. We chose euclidian distance. - Using linkage function to group objects into hierarchical cluster tree, based on the distance information. We choose the Ward’s minimum variance method that minimizes the total within-cluster variance. - At each step the pair of clusters with minimum between-cluster distance are merged. - Determining where to cut the hierarchical tree into clusters. This creates a partition of the data. One way to measure how well the cluster tree generated is to compute the correlation between the cophenetic distances and the original distance data. If the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the original distance matrix. One of the problems with hierarchical clustering is that, it does not tell us how many clusters there are, or where to cut the dendrogram to form clusters. ![](https://i.imgur.com/Wom3d4U.png) ![](https://i.imgur.com/PmgVoEZ.png) #### Kmeans K-means clustering is an unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster. Steps : - Specify the number of clusters (K) - Select randomly k objects from the data set as the initial cluster centers or means - Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid - For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; p is the number of variables. - Iteratively minimize the total within sum of square. Iterate until the cluster assignments stop changing or the maximum number of iterations is reached. We can use the **Elbow method** for determining the optimal clusters (here we chose k = 10). ![](https://i.imgur.com/Vr3QFKb.png) All patients are represented by points in the plot, using principal components and results of the kmeans method : ![](https://i.imgur.com/kqQ24qJ.png) ![](https://i.imgur.com/ffBd9gC.png) We can compare two distributions obtained with kmean for the two groups by using Chi-square goodness-of-fit tests. (H0: The cluster distrib for the patients with the disease follow the same distribution as the group without the disease. ) ![](https://i.imgur.com/1nMW3LL.png) *To go further* : K-means Clustering via Principal Component Analysis http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf **Papers**: - A Clustering Approach for Detecting Implausible Observation Values in Electronic Health Records Data https://www.biorxiv.org/content/biorxiv/early/2019/03/07/570564.full.pdf - Flexible, cluster-based analysis of the electronic medical record of sepsis with composite mixture models - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6014629/ - Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis - https://www.ncbi.nlm.nih.gov/pubmed/24323995 - Electronic Health Record Based Algorithm to Identify Patients with Autism Spectrum Disorder. -https://www.ncbi.nlm.nih.gov/pubmed/27472449 - Generalized Louvain Method for Community Detection in Large Networks https://www.ncbi.nlm.nih.gov/pubmed/27472449 - Detection of gene communities in multi-networks reveals cancer drivers - https://www.nature.com/articles/srep17386 - Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930-945 . Hennig, C. (2005) A method for visual cluster validation. In: Weihs, C. and Gaul, W. (eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg 2005, 153-160. Seber, G. A. F. (1984). Multivariate Observations. New York: Wiley. Nice tuto : http://girke.bioinformatics.ucr.edu/GEN242/pages/mydoc/Rclustering.html ## 2/ Variable Selection This section discuss the implementation of the Stability Selection Procedure To Estimate Probabilities Of Selection Of Covariates For The Sparse PLS Method. This is base on the paper by Meinshausen and Buhlmann, (2010). ### Implementation in R. To implement this procedure in R, we use the function *logit.spls.stab* from the **library("plsgenomics")**. We use the bigger dataset containing a total of the simulated data. For each patient we use the values of hyper-parameters on multiple sub-samplings in the data. The stability selection procedure selects the covariates that are selected by most of the models among the grid of hyper-parameters. We vary $\lambda$ = 0.05 0.20 0.35 0.50 0.65 0.80 0.95. Results on the bigger dataset are shown in the below figure. ![](https://i.imgur.com/g5LZa0a.png) #### Working with a reduce version of the dataset. To demonstrate the implementation of different classification methods, we use a reduced version of the bigger dataset containing only 10% of all the cases. First, we select the covariates to implement in our classifiers using the stability selection procedure described before. ![](https://i.imgur.com/5Nrc5eE.png) ### Decision Trees A decision tree is a tree shape diagram that is used to determine a course of action. Each branch of the tree represents a possible decision, occurrence or reaction. Implementation in R using the rpe rpe rp ![](https://i.imgur.com/Fh8zLdy.png) #### Confusion Matrix and Statistics | Prediction/Reference | 0 | 1 | | -------- | -------- | -------- | | 0| 6971 | 163 | | 1 | 7 | 630 | Accuracy : 0.9781 95% CI : (0.9746, 0.9813) No Information Rate : 0.898 P-Value [Acc > NIR] : < 2.2e-16 ## Reference: Stability selection by Nicolai Meinshausen and Peter Bühlmann https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/j.1467-9868.2010.00740.x ## Resources #### Imbalanced data https://pdfs.semanticscholar.org/c1a9/5197e15fa99f55cd0cb2ee14d2f02699a919.pdf #### Machine Learning and Deep Learning https://www.researchgate.net/publication/326260671_Robust_ensemble_learning_to_identify_rare_disease_patients_from_electronic_health_records https://towardsdatascience.com/extreme-rare-event-classification-using-autoencoders-in-keras-a565b386f098 https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f #### Related Researchers and papers Richard Samworth http://www.statslab.cam.ac.uk/~rjs57/ Paper: Sparse principal component analysis via axis-aligned random projections Link: https://arxiv.org/pdf/1712.05630.pdf R package: https://cran.r-project.org/web/packages/SPCAvRP/SPCAvRP.pdf More papers: https://web.stanford.edu/group/SOL/papers/fused-lasso-JRSSB.pdf http://www.statslab.cam.ac.uk/~rjs57/rssb_1034.pdf https://www.di.ens.fr/~fbach/fbach_bolasso_icml2008.pdf ## 3/ Testing We tested a quick model to work with a simple representation of a patient on a real patient dataset. The Genomics England Research dataset was the most feasible to access over the study period. ### Cohort selection This was a multi-step process to filter from the 100k participants to cohorts of patients that are appropriate for the rare disease classification task. |Step|Description| |----|-----------| |Select diagnosed patients|Select patients that have a `pathogenic` or `likely pathogenic` causative variant reported through the GeL programme| |Select patients with strong genomic evidence for a recessive disease|To increase our cohort size, we also assessed patients who had not yet been diagnosed. We created a pipeline to select patients who had a variant which: <ol><li>Was "tiered" according to the GeL bioinformatics pipeline</li><li>Was reported as clinically significant on ClinVar</li><li>Was associated to a recessive disease</li><li>Was a compound heterozygous </li></ol>| |Select patients with clinically relevant diagnoses|We limited the scope of our study to include diagnoses that would be clinically relevant to doctors and are likely to have an economic impact. These were our criteria: <ul><li>Acute diseases should be excluded since their corrensponding medical records are unlikely to have much of a phenotypic history</li> <li>Simple diagnoses should be excluded. These diagnoses would more likely be picked up earlier in the patient's journey. Trivial diagnoses also provide limited value to clinicians </li><li> The diagnoses must be confirmable by genetic testing. This validates the strength of the previous step.</li>| |Select feasible cohorts| We then picked diseases with cohorts larger than 100 patients. To put a lower bound on the training data available for a model| ### Results We then trained and validated our model on the remaining cohorts. We randomly sampled GeL participants to create negative data points. The results are below: |Diagnoses|Genes|Cohort size|Accuracy| |---------|-----|-----------|--------| |Cystic fibrosis|CFTR|139|0.59| |Polycystic Kidney Disease|PKD1, PKD2, PHKD1|126|0.68| |Mitochondrial DNA depletion syndrome, Spinocerebellar ataxia with epilepsy, Aplers-Huttenloher syndrome, Mitochondrial neurogastrointestinal encephalopathy|POLG|129|0.63| |Usher Syndrome Type II|USH2A, ADGRV1|325|0.75| |Cohen Syndrome|VPS13B|149|0.59| ## 4/ Next steps * Run intensive models on GeL dataset * Write research proposals to BioBank and the GeL dataset * Create framework for integration with health economic models * Apply for cloud credits for computation * Find an academic leader for publishing

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully