Try   HackMD

3. XCMS correspondence, hard filling and imputation

Correspondence

The final step in the metabolomics preprocessing is the correspondence that matches detected chromatographic peaks between samples (and depending on the settings, also within samples if they are adjacent).

The input of this step is the xcms object from Alignment. The method to perform the correspondence in xcms is groupChromPeaks. We will use the peak density method to group chromatographic peaks. The algorithm combines chromatographic peaks depending on the density of peaks along the retention time axis (defined by bw) within small slices along the mz dimension (defined by binSize).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
[from xcms tutorial] Upper panel: chromatogram for an mz slice with multiple chromatographic peaks. lower panel: identified chromatographic peaks at their retention time (x-axis) and index within samples of the experiments (y-axis) for different values of the bw parameter. The black line represents the peak density estimate. Grouping of peaks (based on the provided settings) is indicated by grey rectangles.


The key parameters in correspondance is bw (i.e. the bandwidth in the RT domain) and minFraction (i.e. the minimum proportion of samples per sample group where the peak is present in order to consider it a feature). minFraction is more intuitive to set than bw which needs more care and effort. Normally, minFraction is set as 0.4 to guarantee most features of interest remained in the dataset. That means only chromatographic peaks present in at least 40% of the samples per sample group (e.g. sQCs, ltQCs and real samples, etc.) are grouped into a feature. The sample group assignment is specified with the sampleGroups argument (set at the step of peakpicking). A much lower minFraction will include more noise and group-specific features (and thus problematic for the subsequent pre-processing steps) and much higher minFraction will result in higher quality data but will be very conservative and risk of losing information, especially such features that may be present in specific sample groups only (could even potentially be a perfect classifier).

bw is the bandwidth (standard deviation or half width at half maximum) of the gaussian smoothing kernel to apply to the peak density chromatogram, which can be roughly translated to the maximum expected RT deviation across samples (see the picture above) (REF). Below we are going to explore how to estimate an appropriate bw for the data set. Please note that bw is also estimated previously in the IPO step. However, this value turns out extremely conservative if estimated by IPO (often below 1) and thus not suitable in reality. The bw is instrument-dependent and as we are using a UPLC system, the range of bw is normally observed between 2-5 s (REF). But a dataset-wise investigation is highly recommended. Below we will plot the bw against the number of features to facilate a better visualization of a valid bw value.

BWG <- seq(0.5,5, by=0.5)
feature_number <- c()
for (i in 1:length(BWG)) {
  pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
                          minFraction = 0.4, bw = BWG[i], binSize = 0.015)
  xdata_BWG <- groupChromPeaks(xdata, param = pdp)
  feature_number[i] = nrow(featureValues(xdata_BWG))
}

feature_number
jpeg("BWG vs feature number")
plot(BWG, feature_number)
dev.off()

An example plot:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

There is often a "critical" point where the lines change patterns (i.e. slopes), indicating the correpondance algorithm transfers from one state (splitting same features into different ones) to another (merge different features into one). As merging different features into one is detrimental to the data and this procedure is irreversible and thus a slightly conservative value of bw before this critical point is recommended. Feature splitting could be corrected by batchCorr in later procedure (batchCorr Drift Correction). Please note the bw in the correspondence is usally smaller than that used in Retention Time Alignment becasue we can expected less deviation of chromatographic peaks after the alignment between samples. binSize should be set to a small enough value to avoid peaks from different metabolites, but with similar m/z and retention time, to be grouped together. Here we set binsize as 0.015. binsize can be optimized by IPO.

Now we can set the optimized parameters to the function and perform the Correspondence.

pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
                        minFraction = 0.4, bw = 2.4, binSize = 0.015)
xdata <- groupChromPeaks(xdata, param = pdp)
saveRDS(xdata, file = "xdata_correspondance.rds")

From here, we can extract the feature table from the xcms object.

#Define the function to extract peak table

getTable <- function(XObj) {
  table <- featureValues(XObj, value = "into")
  cat('\nConversion of',deparse(substitute(XObj)),'into peak table')
  cat('\n',ncol(table),'samples (rows)')
  cat('\n',nrow(table),'features (columns)')
  perc=paste0('(',signif(100*sum(is.na(table))/prod(dim(table)),3),'% of data)')
  cat('\n',sum(is.na(table)),'missing values',perc)
  NASamp <- apply(table,2,function(x) sum(is.na(x)))
  NASamp <- NASamp / nrow(table)
  NAFeat <- apply(table,1,function(x) sum(is.na(x)))
  NAFeat <- NAFeat / ncol(table)
  par(mfrow=c(1,2))
  hist(NASamp,main='Missingness per sample',xlab='Proportion')
  hist(NAFeat,main='Missingness per feature',xlab='Proportion')
  par(mfrow=c(1,1))
  featInfo <- featureDefinitions(XObj)
  featInfo <- paste(featInfo$mzmed,featInfo$rtmed,sep='@')
  rownames(table) <- featInfo
  table <- t(table)
}

#Extract the feature table
FT_NOFill <- getTable(xdata)
write.csv(FT_NOFill, file = "project_mode_FT_NOFill_new.csv")

In SOP 1 it is written we can evaluate the prefilter in this SOP (SOP 3) after correspondance. The evaluation can be performed by:

  • The number of peaks gained after correspondance. It should not be too low or too high, around 5000-8000 peaks. But remember there is room for variation. The instrument can have a good/bad day.
  • From peak picking we get the information about the number of region of interest (ROI) and number of peaks. They should not deviate too much. If the region of interest is much higher than number of peaks, we have probably captured noise in the ROI and the prefilter should be increased. Yes this info is gained after peakpicking, but after correspondance these two factors are good to evaluate together.

Hard filling

After correspondence, the feature matrix contains NA values for samples in which no chromatographic peak was detected in the feature’s m/z-rt region. While in many cases there might indeed be no peak signal in the respective region, it might also be that there is signal, but the peak detection algorithm failed to detect a chromatographic peak (e.g. because the signal was too low or too noisy). Xcms provides the fillChromPeaks method to fill in intensity data for such missing values from the original files. The filled-in peaks are added to the chromPeaks matrix. Different options to define the mz-rt region of the features are available. With the ChromPeakAreaParam() parameter object used below, the feature area is defined (by default) using the m/z and rt ranges of all of its (detected) chromatographic peaks: the lower m/z value of the area is defined as the lower quartile (25% quantile) of the "mzmin" values of all peaks of the feature, the upper m/z value as the upper quartile (75% quantile) of the "mzmax" values, the lower rt value as the lower quartile (25% quantile) of the "rtmin" and the upper rt value as the upper quartile (75% quantile) of the "rtmax" values. This ensures that the signal is integrated from a feature-specific area.

xdata <- fillChromPeaks(xdata, param = ChromPeakAreaParam())
saveRDS(xdata, "xdata_filled.rds")
Missing_beforeFill<- apply(featureValues(xdata, filled = FALSE), MARGIN = 2,
      FUN = function(z) sum(is.na(z)))
Missing_afterFill<- apply(featureValues(xdata), MARGIN = 2,
      FUN = function(z) sum(is.na(z)))
Miss <-  cbind.data.frame(Missing_beforeFill, Missing_afterFill)
print(Miss)

Imputation

After the hard filling of missing peaks, there likely are some missingness left in the feature matrix. We use the in-house package StatTools to perform the impuation by either the random forest (RF) algorithm or PLS. RF might arguably be the better choice, but PLS is significantly faster, especially for very large data sets.

Importantly, RF cannot extrapolate beyond the training data, i.e. cannot suggest lower than the minimum or higher than the maximum reported values per feature. However, PLS is a linear model, which readily extrapolates. Consequently, PLS-based imputation can even suggest negative concentrations and therefore needs to be lower-bounded to zero by adding forceZero=TRUE). Otherwise, default parameters can be used.

For Random Forest-based imputation:

FT_fill <- getTable(xdata)
FT_fill_imputation <- mvImpWrap(MAT = FT_fill , method = "RF", rfMeth = "ranger", guess = "minVar")
write.csv(FT_fill_imputation, file = "project_mode_FT_fill_imputation.csv")

Similar for PLS-based imputation:

FT_fill <- getTable(xdata)
FT_fill_imputation <- mvImpWrap(MAT = FT_fill , method = "PLS", guess = "minVar", forceZero = TRUE)
write.csv(FT_fill_imputation, file = "project_mode_FT_fill_imputation.csv")

When this is done, you can move on to do intensity drift correction described in SOP "4. batchCorr Drift Correction"