Action list & Progress

--- tags: Resources --- # Action list & Progress ## Metabolomics data generation SOPs - Example data - repository? - [x] Scout for potential dataset -> CMSI QC (Rui's data as backup (see below)) - [ ] Check "QC" data (Anton) - [ ] Curate dataset into small/manageable dataset suitable for training purposes (Anton) - [ ] Make unified script for pre-processing of "QC" data (Anton) - Overview / Pipeline / Workflow / Vocabulary (metabolite, annotation/identification etc) - [ ] 1st version (Elise & Rui) - [ ] Check (separate into the SOP tasks) - SOP 1st version - [x] xcms - PP (Elise) - [x] xcms - RetAlign (Cecilia) - [x] xcms - corr (Rui) - [x] xcms - fillPeaks (Rui) - [x] imputation (Rui) - [x] RAMClust (Olle) - [x] batchCorr (Anton) - [x] MetID (Anton) - [ ] Report template (Anton) - [ ] Visualizations for sanity check - Check 1 - [x] xcms - PP (Rui) - [x] xcms - RetAlign (Elise --> Olle, double checked) - [x] xcms - corr (Elise) - [x] xcms - fillPeaks (Olle) - [x] imputation (Olle) - [x] RAMClust (Elise) - [x] batchCorr (Elise) - [ ] MetID () - [ ] Report template () - [ ] IPO vs XCMS3 translation table (Olle) - [ ] Harmonization (Olle) - Source code - [ ] Source code 1st version (Sergiu) - [ ] Source code checking (All) - Expand per SOP - Resources - [x] Video links (Calle) - [x] R packages (Calle) - [x] Bianca 1st version (Eddie / Cecilia) - [x] Bianca checking (Rui) - Git - [ ] Presentation (Yan) - [ ] Git-ing HackMD (Elise) ## Other efforts - Data stewardship - [ ] Original place for storage (chalmers offline; chalmers server; bianca) - [ ] Data format for metadata SOP/info - [ ] Codebook SOP/info - [ ] Codebook example - [ ] Managing Bianca projects SOP - [ ] Appointing data steward - QC package - MSID package ## Harmonization To improve throughput and reproducibility, we need to implement uniform naming strategies. Harmonization/standardization should be implemented for: **Instrument file names in A_B_C_D_E_F format** This naming strategy is important for CMSITools to properly extract all relevant information from the filename - A is the date written as YYYY-MM-DD - B is the Batch number and which week it was analyzed written as BZZWXX (e.g. B02W43). - C is for chromatography, either RP (Reversed Phase) or HILIC. - D is for polarity, either POS or NEG. - E is sample identifier, marking the sample as either a sQC, ltQC, blank, cond (conditioning plasma) or a sample (samples named so that they can be backtraced to which sample it belongs to). Each element here needs to have a unique name for every batch (e.g. sQC01, or 125c). - F is injection number, marking in what orders samples in each batch were injected. **Features in ABc@d format** - A is either H or R for chromatography (**H**ILIC / **R**eversed phase) - B is either P or N for polarity (**P**ositive / **N**egative) - c is the recorded m/z - @ is a separator between m/z and rt - d is the recorded retention time (rt) NB! To facilitate tracking of features between different modules in the pipeline, *m/z and rt should be given with full resolution* - i.e. not truncated to a certain number of decimals. **RAMClusters in ABCn@d format** - A is either H or R for chromatography (**H**ILIC / **R**eversed phase) - B is either P or N for polarity (**P**ositive / **N**egative) - C for cluster - n is the cluster number - @ is a separator between C and rt - d is the recorded retention time (rt) **Other things** @Hzwwav8nTEqDzIBprJEyaw @YhTVx2AwRfS78SzlzmHT2Q Please expand here as you see fit! ## Example data We will use the CMSI "QC" data set for training and testing purposes As a backup, Rui suggests this data: https://www.ebi.ac.uk/metabolights/MTBLS1839/assays ## Specific for overview - [ ] Graphical overview of algorithms - [ ] (Graphical) overview of file structure (SOPs, source code, example data, other resources) - [ ] Pipeline - [ ] Workflow - [ ] Vocabulary - [ ] Harmonization (see above) - [ ] Align with report template ## General structure for SOPs - Introduction: What is the purpose of this step in the pipeline? - Input data description Objects / variables / parameters / file formats Where does it come from (e.g. "converted from instrument raw data using proteowizard" or "Output from SOP X: Algorithm") - Describe key elements (central functions) of the code / scripts But not the actual script in itself (reference the SRC with a link) Describe key parameters and how they are obtained / optimized - Output data description ## Source code @tCElCkukTEqMlVK2VncDfQ Please write down what information is needed for each script and what are the example files Example scripts, that are sort of ready for pipeline integration: - BWAlignment.R - PeakPicking_preBW.R - ## Members ### Present - Anton - Elise - Rui - Olle - Calle - Yan (data stewardship) ### Past - Eddie - Cecilia - Sergiu