1. XCMS Peak Picking

--- tags: SOP, WIP --- # 1. XCMS Peak Picking ## 1.1 Introduction ### 1.1.1 Objective The peak picking step involves a process in which chromatographic peaks are identified within each file. A chromatograhic peak can be defined as a signal from an ion in retention time dimension (different from a mass peak that represents the signal along the m/z dimension within a spectrum). Chromatographic peak detection aims to identify peaks along the retention time axis that represent the signal from individual compounds’ ions. ### 1.1.2 Data requirements Before you start with preprocessing, you must collect the raw data .d files from the instrument and convert them to .mzML files. For this you can use an MSconvert program such as Proteowizard. For downloading Proteowizard please see: [Download Proteowizard.](https://proteowizard.sourceforge.net/) You should have metadata, eg id, treatment, visit, age and so on in an Excel file, which you can match to your files collected from the instrument. XCMS is very sensitive to parametrization and parameters are also notoriously difficult to optimize. The objective should not be to obtain perfect parameters - A more realistic goal is rather to minimize errors. XCMS parameters for peak picking, retention time adjustment and correspondence can be optimized using the IPO package in R. However, the fitness function in this package is very greedy. And if IPO is left to optimize without constraints, it will be both VERY time consuming and result in poor parameter settings. ## 1.2 Manual parameter optimization As mentioned above, it is better to set certain parameters manually prior to optimization: - `noise` (the threshold value for identifying regions-of-interest) - `prefilter` (the threshold value for identifying peaks from regions-of-interest) - `snthresh` (signal-to-noise ratio) ### `noise` Use the function "specNoise" by xcms, [https://www.rdocumentation.org/packages/xcms/versions/1.48.0/topics/specNoise](https://) Function: specNoise(spec, gap = quantile(diff(spec[, "mz"]), 0.9)). A better algorithm is under development. ### `prefilter` This parameter includes two arguments (n,K). n indicates the minimum number of scan points needed in each peak; K indicates the intensity of each point where we consider regions of interests to be peaks. This should be adjusted to a suitable level to get a reasonable number of peaks. If its set too low there will be a lot of peaks which probably include noise, if its too high we will miss meaningful peaks. For reversed phase negative a reasonable amount of peaks is around ~5000-8000. For HILIC its lower, but how much lower is less known (we can discuss it?). Its important to remember that these numbers are indications but it can vary, depending on the condition of the instrument. ### `snthresh` This decides how big difference in intensity there has to be between noise and a peak, this value is set arbitrary, common values have been 6 or 10. For example, if there are a lot of peaks it may be an idea to increase snthresh, evaluated after correspondance (SOP 3) ## IPO-assisted parameter optimization Parameters normally optimized by IPO are: - minpeakwidth = Minimum peakwidth - maxpeakwidth = Maximum peakwidth - ppm = ROI are created by a maximum allowed distance within a tolerance m/z deviation - mzdiff = the minimum difference of m/z for peaks with overlapping retention time IPO will perform a response surface model optimization to identify optimal settings. Based on previous experience, we normally initiate an IPO search using these initial parameters For IPO you test `value_of_prefilter` (that is the K value mentioned above) manually and you will get an idea of which prefilters are to prefer. A criterion for a good prefilter is that the number of region of interest and the number of detected peaks should not deviate a lot from each other. You can test different values of prefilter using the pickpeaking function of XCMS: in this case, you need to set other parameters to be optimized as reasonable values for your data (Yes, you need to guess them at this step; you can get a reasonable value from publications or more easily from your colleagues who generated their data on the same instrument and same sample type.) Then you use the most appropriate `value_of_prefilter` you could find for peakpicking parameter optimization in IPO. Evaluation of the prefilter setting will be performed after correspondance, see [SOP 3](https://hackmd.io/jq2LbBIgTK-qrXzLrSCxXg) PROCEDURE: - For each batch, extract 5 QCs at random. - Perform IPO separately per batch - Look at the stability of the per-batch optimized parameters - Compare to expected values - If the parametes are stable across all batches, the mean values can then be applied to xcms peakpicking (i.e. chromatographic peak detection). ## Converting from IPO (XCMS2) to XCMS3 XCMS v3 uses a more consistent naming scheme of methods that follows established naming conventions. For example in this new version the parameter is called `correspondence` intead of `grouping`. However, IPO only works with XCMSv2, so its important that parameters are translated to XCMSv3. For example I think there is a difference in profStep/binSize IPO XCMS2/XCMS3 - Different naming of parameters for different version of xcms (2/3) XCMS2 | XCMS3 | | --------| --------| | profStep| Obiwarp binsize| | mzwid | pdParam binsize | **Some text to translate parameter names & give reference values for parameters for RP and HILIC** Code for peakpicking with IPO is presented below, for complete code, see data source X. IPO for retention time is presented in [SOP 2](https://hackmd.io/P4ujZIMHQgeVLIEmEECehA) ``` #Input MS data------------------------------------------------------------------------------ LCMS_data <- getFiles("/castor/project/proj/Elise/Negative mzML/") %>% cleanFiles(c("WASH","sQCcond","blank", "cond", "iterative")) sQCs <- LCMS_data %>% getRP() %>% getNEG() %>% getQC("sQC") Batches<- stri_extract_first_regex(sQCs$batch, "[0-9]+") sQCs_batch_specific <- sQCs[Batches==1,] #run IPO per batch sQCs_batch_names <- sQCs_batch_specific %>% extractNames() %>% sample(5) ## ----optimize_peak_picking----------------------------------------------------------------- peakpickingParameters <- getDefaultXcmsSetStartingParams('centWave') peakpickingParameters$min_peakwidth <- c(5, 9) peakpickingParameters$max_peakwidth <- c(20, 40) peakpickingParameters$noise <- 500 peakpickingParameters$ppm <- c(10, 30) peakpickingParameters$prefilter <- 3 peakpickingParameters$value_of_prefilter <-3000 peakpickingParameters$snthresh <- 10 time.xcmsSet <- system.time({ # measuring time, time to do optimization resultPeakpicking <- optimizeXcmsSet(files = sQCs_batch_names, params = peakpickingParameters, isotopeIdentification= "IPO", BPPARAM = MulticoreParam(12), subdir = "location", plot = TRUE) }) ``` By now you have got a feeling of which value of prefilter is to prefer, and parameters are optimized for that value. If it is a multibatch experiment, for each batch you collect the parameters optimized by IPO (minpeakwidth, maxpeakwidth, ppm, mzdiff) for the overall best prefilter, then you calculate the average, standard deviation and the coefficient of variation for each parameter. If the differences are not too big between bacthes it is okay to use the average of parameters for peak picking. If the average is too big then the settings should be set separately for the different groups. It is possible only one batch is deviating from the others, but it is also possible there are groups of batches deviating. If this is the case the peak peakpicking will be performed separately. After peak picking data will be coerced (`c(x,y)`). All the data will be just one object during further steps in XCMS. **Reference values for parameters** | | Reversed phase | HILIC | | -------- | -------------- | ----- | min_peakwidth | 5-10 | 5-10 | max_peakwidth | 20-40 | 20-70 | | ppm | 15-25 | 20-40 | | mzdiff | +-0.002 | +-0.02 | ## Actual peak picking - Put the actual script in Github ``` #Necessary packages library(xcms) # main algorithm for peak picking library(CMSITools,lib.loc = "/home/elise/R/x86_64-pc-linux-gnu-library/3.6/") # Carl brunius has written a script of how to sort out the data, eg function "getFiles" CMSITools: https://gitlab.com/CarlBrunius/CMSITools. You need an invitation from Carl to be able to access CMSITools library(tidyverse) # for function %>% library(StatTools) # For the imputation function, 'mvImp', script written by Carl Brunius, https://gitlab.com/CarlBrunius/StatTools library(BiocParallel) # parallell computing register(bpstart(MulticoreParam(12))) # shared memory computing ```` In the first step, files are collected. Data from negative/positive ionisation are preprocessed separately. ```` LCMS_data <- getFiles("/castor/project/proj/Elise/IBS 2/NEG_mzML_11-21/") %>% cleanFiles(c("WASH","sQCcond","blank", "cond", "iterative", "ibs1")) %>% getRP() %>% getNEG() #Define groups LCMS_data$group <- rep(NA,nrow(LCMS_data)) LCMS_data$group[LCMS_data$sample %>% grep("sQC",.)] <- "sQC" LCMS_data$group[LCMS_data$sample %>% grep("ltQC",.)] <- "ltQC" LCMS_data$group[-c(LCMS_data$sample %>% grep("sQC",.), LCMS_data$sample %>% grep("ltQC",.))] <- "sample" ```` Thereafter different type of samples are asigned a sample_name and a sample_group/batch/injection order, saved in the object "pd". 'sample_group' is the group/groups samples are assigned to. In some projects all samples will be considered one group. Then all samples are assigned to the group called for example 'sample'. For other projects it is beneficial to have several groups, for example if there are different treatments, A and B, then samples from trt 'A' are assigned the group 'A' and samples from treatment B are assigned group 'B'. Later on during retention time alignment and correspondance, the algorithm will run based on the sample_groups. For more infromation see SOP 2 and 3. To save which batch/injection a sample has can be good for plotting during retentiontime correction, see SOP 2. ```` pd <- data.frame( sample_name = sub(basename(LCMS_data_names), pattern = ".mzML", replacement = "", fixed = TRUE), sample_group = paste0(LCMS_data$group), batch = LCMS_data$batch, injection = LCMS_data$injection, stringsAsFactors = FALSE) ```` Rawdata is collected by the function 'readMSData'. ```` raw_data <- readMSData(files = LCMS_data_names, pdata = new("NAnnotatedDataFrame", pd), mode = "onDisk") %>% filterRt(.,rt = c(0, 630)) #filterRt depends on the chromatography ```` The actual peak picking (parameters set manually/by IPO). During peakpicking the algorithm will present how many regions of interest that is found (ROI) per sample, it also presents how many peaks were found. They should be harmonized, if there are much more ROI than found peaks, it can be reasonable to increase prefilter values. ```` cwp <- CentWaveParam(peakwidth = c(8.6, 58.4), noise = 500, ppm = 17.3, mzdiff = 0.00193, prefilter = c(3,3500), integrate = 1) xdata <- findChromPeaks(raw_data, param = cwp) saveRDS(xdata,"ibs_all_NEG_PP.rds") ```` Save data in RDS file to be able to take it up from here when necessary ```` xdata <- readRDS("name_of_your_file.rds") ````