The final step in the metabolomics preprocessing is the correspondence that matches detected chromatographic peaks between samples (and depending on the settings, also within samples if they are adjacent).
The input of this step is the xcms object from Alignment. The method to perform the correspondence in xcms is groupChromPeaks
. We will use the peak density method to group chromatographic peaks. The algorithm combines chromatographic peaks depending on the density of peaks along the retention time axis (defined by bw
) within small slices along the mz dimension (defined by binSize
).
[from xcms tutorial] Upper panel: chromatogram for an mz slice with multiple chromatographic peaks. lower panel: identified chromatographic peaks at their retention time (x-axis) and index within samples of the experiments (y-axis) for different values of the bw parameter. The black line represents the peak density estimate. Grouping of peaks (based on the provided settings) is indicated by grey rectangles.
The key parameters in correspondance is bw
(i.e. the bandwidth in the RT domain) and minFraction
(i.e. the minimum proportion of samples per sample group where the peak is present in order to consider it a feature). minFraction
is more intuitive to set than bw
which needs more care and effort. Normally, minFraction is set as 0.4 to guarantee most features of interest remained in the dataset. That means only chromatographic peaks present in at least 40% of the samples per sample group (e.g. sQCs, ltQCs and real samples, etc.) are grouped into a feature. The sample group assignment is specified with the sampleGroups
argument (set at the step of peakpicking). A much lower minFraction
will include more noise and group-specific features (and thus problematic for the subsequent pre-processing steps) and much higher minFraction
will result in higher quality data but will be very conservative and risk of losing information, especially such features that may be present in specific sample groups only (could even potentially be a perfect classifier).
bw
is the bandwidth (standard deviation or half width at half maximum) of the gaussian smoothing kernel to apply to the peak density chromatogram, which can be roughly translated to the maximum expected RT deviation across samples (see the picture above) (REF). Below we are going to explore how to estimate an appropriate bw
for the data set. Please note that bw
is also estimated previously in the IPO step. However, this value turns out extremely conservative if estimated by IPO (often below 1) and thus not suitable in reality. The bw
is instrument-dependent and as we are using a UPLC system, the range of bw
is normally observed between 2-5 s (REF). But a dataset-wise investigation is highly recommended. Below we will plot the bw
against the number of features to facilate a better visualization of a valid bw
value.
An example plot:
There is often a "critical" point where the lines change patterns (i.e. slopes), indicating the correpondance algorithm transfers from one state (splitting same features into different ones) to another (merge different features into one). As merging different features into one is detrimental to the data and this procedure is irreversible and thus a slightly conservative value of bw
before this critical point is recommended. Feature splitting could be corrected by batchCorr in later procedure (batchCorr Drift Correction). Please note the bw
in the correspondence is usally smaller than that used in Retention Time Alignment becasue we can expected less deviation of chromatographic peaks after the alignment between samples. binSize
should be set to a small enough value to avoid peaks from different metabolites, but with similar m/z and retention time, to be grouped together. Here we set binsize
as 0.015. binsize
can be optimized by IPO.
Now we can set the optimized parameters to the function and perform the Correspondence.
From here, we can extract the feature table from the xcms object.
In SOP 1 it is written we can evaluate the prefilter in this SOP (SOP 3) after correspondance. The evaluation can be performed by:
After correspondence, the feature matrix contains NA values for samples in which no chromatographic peak was detected in the feature’s m/z-rt region. While in many cases there might indeed be no peak signal in the respective region, it might also be that there is signal, but the peak detection algorithm failed to detect a chromatographic peak (e.g. because the signal was too low or too noisy). Xcms provides the fillChromPeaks
method to fill in intensity data for such missing values from the original files. The filled-in peaks are added to the chromPeaks matrix. Different options to define the mz-rt region of the features are available. With the ChromPeakAreaParam()
parameter object used below, the feature area is defined (by default) using the m/z and rt ranges of all of its (detected) chromatographic peaks: the lower m/z value of the area is defined as the lower quartile (25% quantile) of the "mzmin" values of all peaks of the feature, the upper m/z value as the upper quartile (75% quantile) of the "mzmax" values, the lower rt value as the lower quartile (25% quantile) of the "rtmin" and the upper rt value as the upper quartile (75% quantile) of the "rtmax" values. This ensures that the signal is integrated from a feature-specific area.
After the hard filling of missing peaks, there likely are some missingness left in the feature matrix. We use the in-house package StatTools to perform the impuation by either the random forest (RF) algorithm or PLS. RF might arguably be the better choice, but PLS is significantly faster, especially for very large data sets.
Importantly, RF cannot extrapolate beyond the training data, i.e. cannot suggest lower than the minimum or higher than the maximum reported values per feature. However, PLS is a linear model, which readily extrapolates. Consequently, PLS-based imputation can even suggest negative concentrations and therefore needs to be lower-bounded to zero by adding forceZero=TRUE
). Otherwise, default parameters can be used.
For Random Forest-based imputation:
Similar for PLS-based imputation:
When this is done, you can move on to do intensity drift correction described in SOP "4. batchCorr Drift Correction"