owned this note
owned this note
Published
Linked with GitHub
# Rare Disease Prediction
AI in Health & Care Study Group
June 26-28 2019
Industrial leader : Giovanni Charles
Academic leader : Thomas House
Group :
David Haw
Diego perez Ruiz
Clement Twumasi
Ines Krissaane
Xioxi Pang
Connor Toal
## Rare Disease Prediction
#### Objectives
We are interested in machine learning methods that are effective and secure on siloed data. There has been a lot of interest in Federated Learning and Zero-knowledge computation to tackle issues with disparate, sensitive data. Our hope is that methods like these could make machine learning feasible for rare disease prediction while preserving patient privacy. This research raises questions around the level of privacy that can be guaranteed; if there are methods to allow hospitals to moderate the data leaving their system; and which restrictions would this have (if any) on predictive performance.
#### Questions
- Value of genetic information GIVEN phenotype
P_r(Fabry | X_gene ) close to 1
what about P_r(Fabry | phenotype1, ..., phenotype_n) and P_r(phenotype1|phenotypeN)?
- Value of new phenotypic information GIVEN current phenotypic
- Many covariates
- Uncertainty in risks
- Acceptable values of model accuracy/performance
- Hierarchical data : feature importance
#### Methods or Ideas
- Regression
- Random Forest (decision trees)
- Gradient Boosting
- Auto encoders deep learning
- Features selection
- Lasso, Ridge, ElasticNet
- Fused priors
- Clustering
- PCA - dimensionnaly reduction
## Data
100k genomes project
https://www.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/
Fabry Disease
https://en.wikipedia.org/wiki/Fabry_disease
- Stroke
- Cardiac
- Acroparesthesia
# Presentation
## Introduction
A disease has prevalence $p\ll 1$ **within a given phenotype**.
\begin{eqnarray}
Pr(+\vert Disease)&=&\alpha\nonumber\\
Pr(-\vert Well)&=&\beta\nonumber\\
Pr(Disease\vert +)&=&\frac{Pr(+\vert Disease)Pr(Dis)}{Pr(+\vert Disease)Pr(Dis)+Pr(+\vert Well)Pr(Well)}\nonumber\\
&=&\frac{\alpha p}{\alpha p+(1-\beta)(1-p)}\nonumber
\end{eqnarray}
By updating $p$, conditional on new phenotypic information, we can improve accuracy.
Clinical support:
* ""$X\%$ of people with this phenotype have this disease."
* "The main contributors are ..."
* "Try testing for ..."
* "It will prove accuracy to $Y\%$ if positive and $X\%$ if negative."
The data:
![](https://i.imgur.com/D0p1MsV.png)
Methods:
- Logistic regression
- Lasso, Ridge, ElasticNet
- Fused priors
- Random Forest (decision trees)
- K-means clustering
- Hierarchical agglomerative clustering
## 1/ Clustering
Clustering is a broad set of techniques for finding subgroups of observations within a data set. We can use **unsupervised clusteringbased anomaly/outlier detection approach** for detecting implausible observations in EHR data and to identify specific structure into phenotypes.
![](https://i.imgur.com/Hz5bdIa.png)
#### Agglomerative clustering
Hierarchical clustering is a cluster analysis method, which produce a tree-based representation (i.e.: dendrogram) of a data. Objects in the dendrogram are linked together based on their similarity.
Steps :
- Computing (dis)similarity information between every pair of objects in the data set. We chose euclidian distance.
- Using linkage function to group objects into hierarchical cluster tree, based on the distance information. We choose the Ward’s minimum variance method that minimizes the total within-cluster variance.
- At each step the pair of clusters with minimum between-cluster distance are merged.
- Determining where to cut the hierarchical tree into clusters. This creates a partition of the data.
One way to measure how well the cluster tree generated is to compute the correlation between the cophenetic distances and the original distance data. If the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the original distance matrix.
One of the problems with hierarchical clustering is that, it does not tell us how many clusters there are, or where to cut the dendrogram to form clusters.
![](https://i.imgur.com/Wom3d4U.png)
![](https://i.imgur.com/PmgVoEZ.png)
#### Kmeans
K-means clustering is an unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.
Steps :
- Specify the number of clusters (K)
- Select randomly k objects from the data set as the initial cluster centers or means
- Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
- For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; p is the number of variables.
- Iteratively minimize the total within sum of square. Iterate until the cluster assignments stop changing or the maximum number of iterations is reached.
We can use the **Elbow method** for determining the optimal clusters (here we chose k = 10).
![](https://i.imgur.com/Vr3QFKb.png)
All patients are represented by points in the plot, using principal components and results of the kmeans method :
![](https://i.imgur.com/kqQ24qJ.png)
![](https://i.imgur.com/ffBd9gC.png)
We can compare two distributions obtained with kmean for the two groups by using Chi-square goodness-of-fit tests. (H0: The cluster distrib for the patients with the disease follow the same distribution as the group without the disease. )
![](https://i.imgur.com/1nMW3LL.png)
*To go further* :
K-means Clustering via Principal Component Analysis
http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf
**Papers**:
- A Clustering Approach for Detecting Implausible Observation Values in Electronic Health Records Data https://www.biorxiv.org/content/biorxiv/early/2019/03/07/570564.full.pdf
- Flexible, cluster-based analysis of the electronic medical record of sepsis with composite mixture models - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6014629/
- Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis - https://www.ncbi.nlm.nih.gov/pubmed/24323995
- Electronic Health Record Based Algorithm to Identify Patients with Autism Spectrum Disorder. -https://www.ncbi.nlm.nih.gov/pubmed/27472449
- Generalized Louvain Method for Community Detection in Large Networks
https://www.ncbi.nlm.nih.gov/pubmed/27472449
- Detection of gene communities in multi-networks reveals cancer drivers - https://www.nature.com/articles/srep17386
- Hennig, C. (2004) Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930-945 . Hennig, C. (2005) A method for visual cluster validation. In: Weihs, C. and Gaul, W. (eds.): Classification - The Ubiquitous Challenge. Springer, Heidelberg 2005, 153-160. Seber, G. A. F. (1984). Multivariate Observations. New York: Wiley.
Nice tuto :
http://girke.bioinformatics.ucr.edu/GEN242/pages/mydoc/Rclustering.html
## 2/ Variable Selection
This section discuss the implementation of the Stability Selection Procedure To Estimate Probabilities Of Selection Of Covariates For The Sparse PLS Method. This is base on the paper by Meinshausen and Buhlmann, (2010).
### Implementation in R.
To implement this procedure in R, we use the function *logit.spls.stab* from the **library("plsgenomics")**. We use the bigger dataset containing a total of the simulated data. For each patient we use the values of hyper-parameters on multiple sub-samplings in the data.
The stability selection procedure selects the covariates that are selected by most of the models among the grid of hyper-parameters. We vary $\lambda$ = 0.05 0.20 0.35 0.50 0.65 0.80 0.95. Results on the bigger dataset are shown in the below figure.
![](https://i.imgur.com/g5LZa0a.png)
#### Working with a reduce version of the dataset.
To demonstrate the implementation of different classification methods, we use a reduced version of the bigger dataset containing only 10% of all the cases. First, we select the covariates to implement in our classifiers using the stability selection procedure described before.
![](https://i.imgur.com/5Nrc5eE.png)
### Decision Trees
A decision tree is a tree shape diagram that is used to determine a course of action. Each branch of the tree represents a possible decision, occurrence or reaction.
Implementation in R using the rpe rpe rp
![](https://i.imgur.com/Fh8zLdy.png)
#### Confusion Matrix and Statistics
| Prediction/Reference | 0 | 1 |
| -------- | -------- | -------- |
| 0| 6971 | 163 |
| 1 | 7 | 630 |
Accuracy : 0.9781
95% CI : (0.9746, 0.9813)
No Information Rate : 0.898
P-Value [Acc > NIR] : < 2.2e-16
## Reference:
Stability selection by Nicolai Meinshausen and Peter Bühlmann
https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/j.1467-9868.2010.00740.x
## Resources
#### Imbalanced data
https://pdfs.semanticscholar.org/c1a9/5197e15fa99f55cd0cb2ee14d2f02699a919.pdf
#### Machine Learning and Deep Learning
https://www.researchgate.net/publication/326260671_Robust_ensemble_learning_to_identify_rare_disease_patients_from_electronic_health_records
https://towardsdatascience.com/extreme-rare-event-classification-using-autoencoders-in-keras-a565b386f098
https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f
#### Related Researchers and papers
Richard Samworth
http://www.statslab.cam.ac.uk/~rjs57/
Paper: Sparse principal component analysis via axis-aligned
random projections
Link: https://arxiv.org/pdf/1712.05630.pdf
R package: https://cran.r-project.org/web/packages/SPCAvRP/SPCAvRP.pdf
More papers: https://web.stanford.edu/group/SOL/papers/fused-lasso-JRSSB.pdf
http://www.statslab.cam.ac.uk/~rjs57/rssb_1034.pdf
https://www.di.ens.fr/~fbach/fbach_bolasso_icml2008.pdf
## 3/ Testing
We tested a quick model to work with a simple representation of a patient on a real patient dataset. The Genomics England Research dataset was the most feasible to access over the study period.
### Cohort selection
This was a multi-step process to filter from the 100k participants to cohorts of patients that are appropriate for the rare disease classification task.
|Step|Description|
|----|-----------|
|Select diagnosed patients|Select patients that have a `pathogenic` or `likely pathogenic` causative variant reported through the GeL programme|
|Select patients with strong genomic evidence for a recessive disease|To increase our cohort size, we also assessed patients who had not yet been diagnosed. We created a pipeline to select patients who had a variant which: <ol><li>Was "tiered" according to the GeL bioinformatics pipeline</li><li>Was reported as clinically significant on ClinVar</li><li>Was associated to a recessive disease</li><li>Was a compound heterozygous </li></ol>|
|Select patients with clinically relevant diagnoses|We limited the scope of our study to include diagnoses that would be clinically relevant to doctors and are likely to have an economic impact. These were our criteria: <ul><li>Acute diseases should be excluded since their corrensponding medical records are unlikely to have much of a phenotypic history</li> <li>Simple diagnoses should be excluded. These diagnoses would more likely be picked up earlier in the patient's journey. Trivial diagnoses also provide limited value to clinicians </li><li> The diagnoses must be confirmable by genetic testing. This validates the strength of the previous step.</li>|
|Select feasible cohorts| We then picked diseases with cohorts larger than 100 patients. To put a lower bound on the training data available for a model|
### Results
We then trained and validated our model on the remaining cohorts. We randomly sampled GeL participants to create negative data points. The results are below:
|Diagnoses|Genes|Cohort size|Accuracy|
|---------|-----|-----------|--------|
|Cystic fibrosis|CFTR|139|0.59|
|Polycystic Kidney Disease|PKD1, PKD2, PHKD1|126|0.68|
|Mitochondrial DNA depletion syndrome, Spinocerebellar ataxia with epilepsy, Aplers-Huttenloher syndrome, Mitochondrial neurogastrointestinal encephalopathy|POLG|129|0.63|
|Usher Syndrome Type II|USH2A, ADGRV1|325|0.75|
|Cohen Syndrome|VPS13B|149|0.59|
## 4/ Next steps
* Run intensive models on GeL dataset
* Write research proposals to BioBank and the GeL dataset
* Create framework for integration with health economic models
* Apply for cloud credits for computation
* Find an academic leader for publishing