# Missing value estimation methods for DNA microarrays
I shall be honest in my efforts and will make my parents proud.
[TOC]
## Introduction
<div style="text-align: justify">
Gene expression microarray experiments can generate data sets with multiple missing expression values. In this paper, automated methods are investigated for estimating missing data because many algorithms for gene expression analysis require a complete matrix of gene array values as input. Methods for imputing missing data are needed to minimize the effect of incomplete data sets on analysis.
The paper implemented and evaluated three methods for the estimation of missing values: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. These methods are evaluated using various parameter settings and over different real data sets and assess the robustness of the imputation methods to the amount of missing data over the range of 1–20% missing values.
The results shows that KNNimpute provides a more robust and sensitive method for missing value estimation than SVDimpute, and both are far better than the commonly used row average.
</div>
## KNNimpute algorithm
<div style="text-align: justify">
The KNN-based method selects gene expression profiles similar to the gene of interest to impute missing values. If we consider gene A has one missing value in experiment 1, this method will find K other genes, which have a value present in experiment 1, with expressions most similar to A in experiments 2–N (where N is the total number of experiments). A weighted average of values in experiment 1 from the K closest genes is then used to estimate the missing value in gene A.
Euclidean distance was used as a metric for gene similarity after log-transforming the data to reduce the effect of outliers.
The performance was measured over different datasets and different values of K. The method is very accurate, with the estimated values showing only a 6–26% average deviation from the true values. This method is successful in the accurate estimation of missing values for genes that are expressed in small clusters. Approximately 88% of the values are estimated with normalized RMS error under 0.25, with KNN-based estimation for a noisy time series data set with 10% entries missing. Under low apparent noise levels in time series data, as many as 94% of values are estimated within 0.25 of the original value.
The algorithm is also robust to increasing the percent of values missing, with a maximum of 10% decrease in accuracy with 20% of the data missing
</div>
## SVDimpute algorithm
<div style="text-align: justify">
In this method, we use singular value decomposition to obtain a set of mutually orthogonal expression patterns (eigengenes) that can be linearly combined to approximate the expression of all genes in the data set.
Matrix VT now eigengenes and diagonal matrix Σ contains eigenvalues. We then identify the most significant eigengenes by sorting them based on their corresponding eigenvalue. Once k most significant eigengenes from VT are selected, we estimate a missing value j in gene i by first regressing this gene against the k eigengenes and then using the regression coefficients to reconstruct j from a linear combination of the k eigengenes.
Through experimentation, it was concluded that the most accurate estimation is achieved when approximately 20% of the eigengenes are used for estimation.
Although SVD-based estimation provides significantly higher accuracy than row average on all data sets, its performance is sensitive to the type of analyzed data.
SVDimpute yields the best results on time-series data with low noise levels, where it even performs better than KNN. The KNN-based method exhibits higher performance for both noisy time series data and non-time series data.
</div>
<image src="https://i.imgur.com/fsKbkq0.png" style="display:block; margin-left: auto; margin-right: auto;"></image>
<div style=" margin: auto; width: 20%;padding:5px;">
Figure 1: SVD
</div>
## Row average
<div style="text-align: justify">
Missing log2 transformed data are replaced by an average expression over the row, or ‘row average’. This approach is not optimal since these methods do not take into consideration the correlation structure of the data.
Estimation by row (gene) average, although an improvement upon replacing missing values with zeros, yielded drastically lower accuracy than either KNN or SVD-based estimation. In contrast to SVD and KNN, row average does not take advantage of the rich information provided by the expression patterns of other genes in the data set.
</div>
## Methodology and Results
<div style="text-align: justify">
Three microarray datasets were used: a time series dataset, a noisy time series dataset, and a non-time series dataset. Each data set was pre-processed for the evaluation by removing rows and columns containing missing expression values, yielding ‘complete’ matrices. The methods were then evaluated over each dataset as follows. Between 1 and 20% of the data were deleted randomly to create test data sets. Each method was then used to recover the introduced missing values for each data set, and the estimated values were compared to those in the original data set. The metric used to assess estimation accuracy was calculated as the Root Mean Squared (RMS) difference between the imputed matrix and the original matrix, divided by the average data value into the complete data set. This normalization allowed for the comparison of estimation accuracy between different data sets.
</div></br>
<div>
<image src="https://i.imgur.com/CKxPdrE.png" style="display:block; margin-left: auto; margin-right: auto;"></image>
</div>
<div style=" margin: auto; width: 20%;padding:5px;">
Figure 2: Results
</div>
## Conlusions
<div style="text-align: justify">
KNN and SVD based methods provide fast and accurate ways of estimating missing values for microarray data. Both methods far surpass row average by taking advantage of the correlation structure of the data to estimate missing expression values. Based on the results, the authors recommend the KNN-based method for imputing missing values.
KNN-based imputation shows less deterioration in performance with an increasing percentage of missing entries. In addition, the KNNimpute method is more robust than SVD to the type of data for which estimation is performed, performing better on non-time series or noisy data. KNNimpute is also less sensitive to the exact parameters used (number of nearest neighbors), whereas the SVD-based method shows a sharp deterioration in performance when a non-optimal fraction of missing values is used. KNNimpute has the advantage of providing
accurate estimation for missing values in genes that belong to small tight expression clusters.
</div>
## References
<div style=" font-style: italic;">
1. Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman (2001) Missing value estimation methods for DNA microarrays.
</div>