4. batchCorr Drift Correction

--- tags: SOP, WIP --- # 4. batchCorr Drift Correction ## Objective Modern high-resolution mass spectrometers are fantastic instruments for untargeted molecular research, since they allow for a wide coverage of the analytical space, e.g. the plasma, urine or CSF metabolome. However, these instruments are also inherently unstable in their response over time, e.g. due to build-up of dirt in the interface of an LC-MS system or imperfect column regeneration in the gradient program. This lack of stability can affect both the measured m/z, retention time and signal intensity and several approaches have been developed and used to manage this situation. In general, systematic signal deviations are larger between batches than within batches. The approach in batchCorr has been to separate variability between and within batches to manage them separately. The package consists of 3 main modules: i) between-batch correspondence/alignment; ii) within-batch intensity drift correction and; iii) between-batch normalization. The main rationale for the first two modules is to work on information aggregated from several samples (module i) or several features (module ii) to improve signal/noise ratio and reduce likelihood of various types of overfitting. BatchCorr is an R package built by Carl Brunius. For more information (including a tutorial, may overlap to the infromation on this page) see [https://gitlab.com/CarlBrunius/batchCorr](https://)) ### What to do when BatchCorr isn't improving data enough? In some situations sQCs will have been compromised to a point where batchCorr can't do what it's supposed to do properly. Too many features might be removed for the data to be usable properly. In a situation like that there are two viable alternatives that can be tested: - MetNormalizer (https://github.com/jaspershen/MetNormalizer) - SERF (https://pubs.acs.org/doi/10.1021/acs.analchem.8b05592, https://slfan2013.github.io/SERRF-online/) ## Installation 1. Install the release version of ‘devtools’ from CRAN: install.packages("devtools") 2. Make sure that a working development environment has been properly installed. Windows: Install Rtools (https://cran.r-project.org/bin/windows/Rtools/). Mac: Install Xcode from the Mac App Store. Linux: install the R development package, usually called ‘r-devel’ or ‘r-base-dev’. 3. Install batchCorr from Gitlab. library(devtools) install_git("https://gitlab.com/CarlBrunius/batchCorr.git") ## Preparation and input data 1. Sample data needs to be organized into a table format dataframe (`peakTable`): > Rows: injections (samples) > Columns: variables (features) 2. Meta data needs to be organized into a second table format dataframe (`metadata`) > Rows: injections (samples) > Columns: variables (features), should include: > sample name (`sampleName`), sample batch(`batch`),sample type (`grp`), injection(`inj`) 3. Make sure that rows in `peakTable` corresponds to `metadata` and that `peakTable` doesn't contain any missing peaks. > One way to do it is to use the following command:<br> `identical(rownames(peakTable),metadata$sampleName)` If you use the workflow from xcms to RamClust presented in this project/guide(?) The output file after xcms and imputation is in the setting suitable for batchCorr. ## Application ### One-batch example 1. Load the library and the example data ``` library(batchCorr) data('OneBatchData') ``` 2. To execute batchCorr for this single batch dataset use the following command: ``` batchBCorr <- correctDrift(peakTable = B_PT, injections = B_meta$inj, sampleGroups = B_meta$grp, QCID = 'QC', modelNames = 'VVE', G = 17:22) ``` 3. The parameters set for the `correctDrift()` function are: * peakTable: The peak table dataframe (`B_PT`) * Injections: The column of metadata containing injection number for samples in sequence (`B_meta$inj`) * sampleGroups: The column of metadata containing sample type (e.g. "sample", "QC", "Ref"; `B_meta$grp`) * QCID: The name given by user to the QC sample type (`QC`) * modelNames and G: Will vary depending on application (`VVE` & `17:22`) batchCorr utilizes the 'mclust' package which uses the geometrical constraints supplied by the user through the `modelNames` argument. When running the first round of batchCorr trying all of them is usually a good idea, which can be done by exchanging`modelNames = 'VVE'` with `modelNames = c('VVV','VVE','VEV','VEE','VEI','VVI','VII')` <br> The 'G'-value is a range informing 'mclust' of how many clusters it should try to find. In the example `G = 17:22` indicates that 'mclust' should try to find 17-22 different clusters. By using ` = 'G = seq(5,35,by=10)'` 'mclust' will try to find 5, 15, 25 or 35 clusters. This can be useful when trying all the models for the first run-through of batchCorr since it saves a lot of computing time. We'll discuss the output at the very end of this SOP > [name=Anton] Failed to mention reference samples, but I have very limited experience with those so not even sure exactly how they work tbh :( - Having finished writing the multi-batch example perhaps reference samples are only useful for multi-batch stuff? ### Multi-batch example 1. Load the library and the example data ``` library(batchCorr) data('ThreeBatchData') ``` 2. This example dataset contains 3 objects: * `PTnofill` which is the peak table with missing data (have *not* gone through imputation) * `PTfill` which is the filled peak table with no missing data (have gone through imputation) * `meta` which contains information on batch (`B`, `F` and `H`), sample group (`QC` or `Ref`) and inj (injection number in sequence) 3. Join batches from the three datasets together ``` peakIn <- peakInfo(PT = PTnofill, sep = '@', start = 3) alignBat <- alignBatches(peakInfo = peakIn, PeakTabNoFill = PTnofill, PeakTabFilled = PTfill, batches = meta$batch, sampleGroups = meta$grp, selectGroup = 'QC') PT=alignBat$PTalign ``` > In the code above the `peakInfo()` function extracts m/z and rt information about all the features from the peak table. `alignBatches()` harmonizes the peak tables from the batches prior to correction. > [name=Anton]Should we mention here how alignBatches does the harmonization and adjoining of batch PTs? 4. Break up data into individual datasets based on batch-belonging and perform sequence drift on each one. Remember, it is important the batches are presented in the order they were run, sometime batches are rerun and then they will not be run in the initial order. ``` batchB <- getBatch(peakTable = PT, meta = meta, batch = meta$batch, select = 'B') BCorr <- correctDrift(peakTable = batchB$peakTable, injections = batchB$meta$inj, sampleGroups = batchB$meta$grp, QCID = 'QC', G = seq(5,35,by=3), modelNames = c('VVE', 'VEE')) batchF <- getBatch(peakTable = PT, meta = meta, batch = meta$batch, select = 'F') FCorr <- correctDrift(peakTable = batchF$peakTable, injections = batchF$meta$inj, sampleGroups = batchF$meta$grp, QCID = 'QC', G = seq(5,35,by=3), modelNames = c('VVE', 'VEE')) batchH <- getBatch(peakTable = PT, meta = meta, batch = meta$batch, select = 'H') HCorr <- correctDrift(peakTable = batchH$peakTable, injections = batchH$meta$inj, sampleGroups = batchH$meta$grp, QCID = 'QC', G = seq(5,35,by=3),modelNames = c('VVE', 'VEE')) ``` > `getBatch()` is used to extract all information pertaining to a specific batch from the combined peak table and meta data. See the 'One batch example' for more information on how to utilize the `correctDrift()` function. 5. Align the three sequence-corrected datasets with each other again ``` mergedData <- mergeBatches(list(BCorr,FCorr,HCorr)) normData <- normalizeBatches(peakTable = mergedData$peakTable, batches = meta$batch, sampleGroup = meta$grp, refGroup = 'Ref', population = 'sample') PTnorm <- normData$peakTable ``` > `mergeBatches()` will join features which had less than 30% CV in at least 50% of the batches. `normalizeBatches()` is only useful if a 'reference sample' (QC not used for modelling and correcting drift) has been used. If it has, the function will check if the reference sample passes certain criteria (Brunius et al. 2016, Metabolomics) and normalize by them. Batches in which reference samples fail to pass the criteria are normalized based on population median (which is specified by giving the `population` argument the correct sample type) ## Output and results ### Sequence drift correction ##### batchCorr Object Sequence drift correction for a single batch (or several batches for a multi-batch experiment) will result in a batchCorr object per batch. This object contains a lot of information on the outcomes of sequence drift: `$actionInfo` gives a summary of the drift-pattern clusters found in the analysis and the results of correction in terms of peaks filtered out `$testFeatsCorr` contains the drift-corrected data `$testFeatsFinal` contains the drift-corrected data that fulfilled the criterion of having < 30% CV in the QC injections ##### PDF report Beside the batchCorr object a summary will be printed in ".pdf"-format which will contain a number of graphs. - **Graph 1** will be of the 'goodness of fit'(BIC) of the clustering of the different models and how many clusters they were able to succesfully model. ![](https://i.imgur.com/9gRLfRy.png) - **Graph 2** will usually be many pages long and contain a 2D-image of the drift pattern of each cluster where the top half is the data before correction and the bottom half is after correction. Grey lines represent a single feature in the QC and the black line is the average drift pattern for the entire cluster. - ![](https://i.imgur.com/zLyMZ3e.png) - **Graph 3** will be of the distribution of feature CVs in the QC, depicted as a histogram. ![](https://i.imgur.com/A8e0LQ7.png) - **Graph 4** will be of the distribution of feature CVs in the QC after correction, also depicted as a histogram. ![](https://i.imgur.com/2yBQPC0.png)