3. XCMS correspondence, hard filling and imputation

--- tags: SOP, WIP --- # 3. XCMS correspondence, hard filling and imputation ## Correspondence The final step in the metabolomics preprocessing is the correspondence that matches detected chromatographic peaks between samples (and depending on the settings, also within samples if they are adjacent). The input of this step is the **xcms object from Alignment**. The method to perform the correspondence in xcms is `groupChromPeaks`. We will use the peak density method to group chromatographic peaks. The algorithm combines chromatographic peaks depending on the density of peaks along the retention time axis (defined by `bw`) within small slices along the mz dimension (defined by `binSize`). ![](https://i.imgur.com/leJehlS.png) [from xcms tutorial] Upper panel: chromatogram for an mz slice with multiple chromatographic peaks. lower panel: identified chromatographic peaks at their retention time (x-axis) and index within samples of the experiments (y-axis) for different values of the bw parameter. The black line represents the peak density estimate. Grouping of peaks (based on the provided settings) is indicated by grey rectangles. --- The key parameters in correspondance is `bw` (i.e. the bandwidth in the RT domain) and `minFraction` (i.e. the minimum proportion of samples per sample group where the peak is present in order to consider it a feature). `minFraction` is more intuitive to set than `bw` which needs more care and effort. Normally, minFraction is set as 0.4 to guarantee most features of interest remained in the dataset. That means only chromatographic peaks present in at least 40% of the samples **per sample group** (e.g. sQCs, ltQCs and real samples, etc.) are grouped into a feature. The sample group assignment is specified with the `sampleGroups` argument (set at the step of peakpicking). A much lower `minFraction` will include more noise and group-specific features (and thus problematic for the subsequent pre-processing steps) and much higher `minFraction` will result in higher quality data but will be very conservative and risk of losing information, especially such features that may be present in specific sample groups only (could even potentially be a perfect classifier). `bw` is the bandwidth (standard deviation or half width at half maximum) of the gaussian smoothing kernel to apply to the peak density chromatogram, which can be roughly translated to the maximum expected RT deviation across samples (see the picture above) ([REF](http://stanstrup.github.io/material/presentations/1.%20XCMS.html#/28)). Below we are going to explore how to estimate an appropriate `bw` for the data set. Please note that `bw` is also estimated previously in the IPO step. However, this value turns out extremely conservative if estimated by IPO (often below 1) and thus not suitable in reality. The `bw` is instrument-dependent and as we are using a UPLC system, the range of `bw` is normally observed between 2-5 s ([REF](https://www.nature.com/articles/nprot.2017.151)). But a dataset-wise investigation is highly recommended. Below we will plot the `bw` against the number of features to facilate a better visualization of a valid `bw` value. ``` BWG <- seq(0.5,5, by=0.5) feature_number <- c() for (i in 1:length(BWG)) { pdp <- PeakDensityParam(sampleGroups = xdata$sample_group, minFraction = 0.4, bw = BWG[i], binSize = 0.015) xdata_BWG <- groupChromPeaks(xdata, param = pdp) feature_number[i] = nrow(featureValues(xdata_BWG)) } feature_number jpeg("BWG vs feature number") plot(BWG, feature_number) dev.off() ``` --- An example plot: ![](https://i.imgur.com/MtTgggl.png) There is often a "critical" point where the lines change patterns (i.e. slopes), indicating the correpondance algorithm transfers from one state (splitting same features into different ones) to another (merge different features into one). As merging different features into one is detrimental to the data and this procedure is irreversible and thus a slightly conservative value of `bw` before this critical point is recommended. Feature splitting could be corrected by batchCorr in later procedure (batchCorr Drift Correction). Please note the `bw` in the correspondence is usally smaller than that used in Retention Time Alignment becasue we can expected less deviation of chromatographic peaks after the alignment between samples. `binSize` should be set to a small enough value to avoid peaks from different metabolites, but with similar m/z and retention time, to be grouped together. Here we set `binsize` as 0.015. `binsize` can be optimized by IPO. Now we can set the optimized parameters to the function and perform the Correspondence. ``` pdp <- PeakDensityParam(sampleGroups = xdata$sample_group, minFraction = 0.4, bw = 2.4, binSize = 0.015) xdata <- groupChromPeaks(xdata, param = pdp) saveRDS(xdata, file = "xdata_correspondance.rds") ``` From here, we can extract the feature table from the xcms object. ``` #Define the function to extract peak table getTable <- function(XObj) { table <- featureValues(XObj, value = "into") cat('\nConversion of',deparse(substitute(XObj)),'into peak table') cat('\n',ncol(table),'samples (rows)') cat('\n',nrow(table),'features (columns)') perc=paste0('(',signif(100*sum(is.na(table))/prod(dim(table)),3),'% of data)') cat('\n',sum(is.na(table)),'missing values',perc) NASamp <- apply(table,2,function(x) sum(is.na(x))) NASamp <- NASamp / nrow(table) NAFeat <- apply(table,1,function(x) sum(is.na(x))) NAFeat <- NAFeat / ncol(table) par(mfrow=c(1,2)) hist(NASamp,main='Missingness per sample',xlab='Proportion') hist(NAFeat,main='Missingness per feature',xlab='Proportion') par(mfrow=c(1,1)) featInfo <- featureDefinitions(XObj) featInfo <- paste(featInfo$mzmed,featInfo$rtmed,sep='@') rownames(table) <- featInfo table <- t(table) } #Extract the feature table FT_NOFill <- getTable(xdata) write.csv(FT_NOFill, file = "project_mode_FT_NOFill_new.csv") ``` In SOP 1 it is written we can evaluate the prefilter in this SOP (SOP 3) after correspondance. The evaluation can be performed by: - The number of peaks gained after correspondance. It should not be too low or too high, around 5000-8000 peaks. But remember there is room for variation. The instrument can have a good/bad day. - From peak picking we get the information about the number of region of interest (ROI) and number of peaks. They should not deviate too much. If the region of interest is much higher than number of peaks, we have probably captured noise in the ROI and the prefilter should be increased. Yes this info is gained after peakpicking, but after correspondance these two factors are good to evaluate together. ## Hard filling After correspondence, the feature matrix contains NA values for samples in which no chromatographic peak was detected in the feature’s m/z-rt region. While in many cases there might indeed be no peak signal in the respective region, it might also be that there is signal, but the peak detection algorithm failed to detect a chromatographic peak (e.g. because the signal was too low or too noisy). Xcms provides the `fillChromPeaks` method to fill in intensity data for such missing values from the original files. The filled-in peaks are added to the chromPeaks matrix. Different options to define the mz-rt region of the features are available. With the `ChromPeakAreaParam()` parameter object used below, the feature area is defined (by default) using the m/z and rt ranges of all of its (detected) chromatographic peaks: the lower m/z value of the area is defined as the lower quartile (25% quantile) of the "mzmin" values of all peaks of the feature, the upper m/z value as the upper quartile (75% quantile) of the "mzmax" values, the lower rt value as the lower quartile (25% quantile) of the "rtmin" and the upper rt value as the upper quartile (75% quantile) of the "rtmax" values. This ensures that the signal is integrated from a feature-specific area. ``` xdata <- fillChromPeaks(xdata, param = ChromPeakAreaParam()) saveRDS(xdata, "xdata_filled.rds") Missing_beforeFill<- apply(featureValues(xdata, filled = FALSE), MARGIN = 2, FUN = function(z) sum(is.na(z))) Missing_afterFill<- apply(featureValues(xdata), MARGIN = 2, FUN = function(z) sum(is.na(z))) Miss <- cbind.data.frame(Missing_beforeFill, Missing_afterFill) print(Miss) ``` ## Imputation After the hard filling of missing peaks, there likely are some missingness left in the feature matrix. We use the in-house package [StatTools](https://gitlab.com/CarlBrunius/StatTools) to perform the impuation by either the random forest (RF) algorithm or PLS. RF might arguably be the better choice, but PLS is significantly faster, especially for very large data sets. Importantly, RF cannot extrapolate beyond the training data, i.e. cannot suggest lower than the minimum or higher than the maximum reported values per feature. However, PLS is a linear model, which readily extrapolates. Consequently, PLS-based imputation can even suggest negative concentrations and therefore needs to be lower-bounded to zero by adding `forceZero=TRUE`). Otherwise, default parameters can be used. For Random Forest-based imputation: ``` FT_fill <- getTable(xdata) FT_fill_imputation <- mvImpWrap(MAT = FT_fill , method = "RF", rfMeth = "ranger", guess = "minVar") write.csv(FT_fill_imputation, file = "project_mode_FT_fill_imputation.csv") ``` Similar for PLS-based imputation: ``` FT_fill <- getTable(xdata) FT_fill_imputation <- mvImpWrap(MAT = FT_fill , method = "PLS", guess = "minVar", forceZero = TRUE) write.csv(FT_fill_imputation, file = "project_mode_FT_fill_imputation.csv") ``` When this is done, you can move on to do intensity drift correction described in SOP "4. batchCorr Drift Correction"