5. Ramclust parameter optimization

--- tags: SOP, WIP --- #### Input data - Output file from Batchcorr - Xcms data generated after groupChromPeaks() (from alignment/correspondance) #### Output data - Final data matrix to be used for statistical analysis - List with information about each cluster. # 5. Ramclust parameter optimization RamClust is a tool to merge features belonging to the same metabolite. This is needed as during ionization in the LC, metabolites can get fragmented into smaller pieces, and we don't want information about the same metabolite appearing several times in our data, as this can lead to overfitted models later on. In addition to fragmentation, isotopes are also detected as separate features in the MS, these should also be bundled together in the cluster. Another indirect benefit of clustering fragments/features belonging to the same metabolite is that during identification a cluster can be an indication of a feature being a real feature, not only noise. ## Loading data First we have to load all the necessary data and libraries. The input to RAMClust is either an xcms object or a .csv file. As we are using batchCorr between xcms and RAMClust, we will need to use the .csv format, therefore we load the post-batchCorr data and convert it into a .csv file. We also load the xcms data obtained after peak grouping as this is needed to create the plots we use for final evaluation of the different parameters. We also need to load the plotClust.R, this should be found in the same repository as this file. ```{r load data} library(RAMClustR) library(xcms) source('plotClust.R') #If this script file is not in your current working directory (checked with getwd() function), write full filepath. load('RN_xcms_data_after_groupChromPeaks().rda') #found in SOP #2 XCMS RT Alignment xcmsData <- 'name_of_RN_xcmsdata' #usually named xdata load('File_from_batchCorr.rda') #found in SOP #4 batchCorr Drift Correction peakTab <- PTDataNorm$peakTable write.csv(peakTab,file='peakTable.csv') ``` ## Parameter set-up Next we need to define which parameters and what values of these we want to investigate, along with creating empty vectors for future storage and selecting 20 random samples to use for the plots. Here we chose values 0.3, 0.4 and 0.5 for sr and 0.5, 1, 1.5 and 2 for st. In our experience, the "best" parameters are found amongst these. ```{r Set-up} expDes=defineExperiment(force.skip = T) sr=c(.3,.4,.5) st=c(.5,1,1.5,2) maxt=c(5) par=expand.grid(st=st,sr=sr,maxt=maxt) str(par) nClust=nSing=sizeMax=sizeMed=sizeMean=numeric(nrow(par)) nFeat=list() nSamp <- nrow(peakTab) samps=sample(1:nSamp,20) #choosing 20 random samples that will be used during plotting, could be increased for further assurance (at heavy computational cost). ``` ## Plotting Here we create a for-loop that will run over all different parameter combinations that we created in the previous stage. We start by using the ramClust algorithm for the i:th parameter setting followed by plotting the clusters created by this setting using the plotClust function. Each loop generates a separate PDF which are used to manually assess whether a certain parameter setting is deemed good or not (This is the tricky part, but it gets easier the more you do it!). ```{r Plotting} for (i in 1:nrow(par)) { RRP=ramclustR(ms='peakTable.csv', st = par$st[i], sr=par$sr[i], maxt = par$maxt, timepos = 2, sampNameCol = 1, featdelim = '@', ExpDes = expDes) #Important to note that a csv file is needed as input! nClust[i]=length(RRP$cmpd) nSing[i]=RRP$nsing sizeMax[i]=max(RRP$nfeat) sizeMed[i]=median(RRP$nfeat) sizeMean[i]=mean(RRP$nfeat) nFeat[[i]]=RRP$nfeat pdf(file=paste0('clusts_par',i,'.pdf'),width=15,height=8) par(mfrow=c(4,5),mar=c(4,4,2,0)+.5) clusts=round(c(2:6,seq(7,max(RRP$featclus),length.out = 15))) for (c in clusts) { plotClust(ram = RRP,clustnr = c,xcmsData = xcmsData,samps = samps) } dev.off() } cbind(par,nClust,nSing,sizeMax,sizeMean,sizeMed) ``` ## Plot assesment When assesing whether a plot is good there are the following factors to take into account: 1. Does it look like all of the clusters only contain peaks from the same metabolite? Or does it look like two features belonging to different metabolites might have been merged into one? 2. How do the m/z and rt values look? For instance, if there are peaks where three peaks who all increase by 1 m/z in magnitude each step? Good, likely an isotope pattern and likely belonging to the same metabolite. 3. Peakshape. Do the peaks look like you're only modelling noise (especially in the latter clusters)? Could be that your prefilter setting in xcms peak picking was too low. If it's really aweful you should reconsider adjusting the prefilter (after consultation with someone more experienced). Example of output: ![](https://i.imgur.com/lFel2Gv.png) Quality of parameters should not be judged based on features eluting earlier than around ~40 second, this time span is not trustworthy due to ion suppression. Create a grid in e.g. excel where you set up columns as st and rows as sr to make a 3x4 matrix. In this grid, assign a '+' for each setting you think 'look good', and a '-' for each setting you think does not look good (if you're unsure, a third option such as "OK" could be used as an inbetween option). Followed by application of a 'mental smoothing' (see recorded video on RAMClust tutorial) where you try to extrapolate based on which parameter combinations you expect to be better than others (results should get worse the further to the bottom right you get). Once you've made this grid, applied the mental smoothing and have a general feel for the data, pick out the highest combination of values for st and sr that does not look like it messed up the feature merging. Now that you've hopefully found your best parameter settings for sr and st, you're ready to move on to use RAMClust to get your final data set ready! # Ramclust usage for final data acquisition Here, we go ahead and do the actual clustering of your data with your newly obtained values of sr and st. ## Clustering Here is where the actual clustering takes place. Fill in your optimized values of sr and st into these variables and go ahead and run the code. ```{r Clustering} st <- 'Value from RamClust Optimization' sr <- 'Value from RamClust Optimization' RC <- ramclustR(ms='peakTableRN.csv', st = st, sr=sr, maxt = 5, timepos = 2, sampNameCol = 1, featdelim = '@', ExpDes = expDes) ``` ## Sanity check and saving data We start out by checking how many single features we still have after the clustering. If this number is close to your starting number of variables, something has likely gone wrong. We also check how many clusters that were created. We can then go ahead and combine our data so that both the clusters and the single features are all in the same dataframe and then we save our file. ```{r Clustering} RC$nsing #check how many features does not belong to any clusters max(RC$featclus) #check how many clusters were created Clusts <- RC$SpecAbund colnames(Clusts) <- paste0("C", colnames(Clusts)) RNRam=cbind(Clusts,RC$MSdata[,RC$featclus==0]) colnames(RNRam) <- paste0('RN',colnames(RNRam)) RNRC <- RC save(RNRC,RNRam,file='RNRam.rda') ``` N.B. this has to be rerun for all different modes that you have data for (i.e RP, HN and HP) Now you're finally done with all the preprocessing steps and can move on to start doing statistics on your data! Your final data matrices will be available in the RPRam and RNRam objects (and HNRam and HPRam if you also have HILIC data). Further information about the clusters are found in the RPRC and RNRC objects, where you amongst other things find information about which features are part of the cluster as well as their intensities.