--- tags: data --- Author: Yingxiao Yan Access data --> Explore data --> Prepare data -->Analyze&report data -->export results Since ususally, data is delivered not at the same time (ppl are inconsistent). "round 1" means the preprocessing of data at the first time, "round 2" means that extra data are sent and the preprocessing of data is performed again. In each round, the workflow is written based on the R script, where what the input data, output data are are specified. # MAX ## 0. Original data ## 0.1. Inclusion of data - qualitative description | data type | brief description | | -------- | -------- | | metabolomics baseline, 6 weeks, 12 weeks| | | Dietary nutrients data baseline, 6 weeks, 12 weeks ???|nutrients???| | Dietary 6 index data baseline, 6 weeks, 12 weeks |6 dietary index data| | Dietary FFQ data baseline, 6 weeks, 12 weeks|FFQdata| | Dietary 24 h recall data baseline, 6 weeks, 12 weeks|24h recall| | Microbiota data baseline, 6 weeks, 12 weeks (gfocpk)| genus, family, order, class, phylum and kingdom | | Microbiota data baseline, 6 weeks, 12 weeks (genus)| genus | | Covariate & outcomes baseline, 6 weeks, 12 weeks| intermediate risk markers | ## 0.2. All avairable data - quantitative description | data document| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |MyFood24_Chalmers_20200424.xlsx|54970| 58 | will not use| |MyFood24_Chalmers_20200424_MPB.xlsx| 54970 | 58 | will not use | |myfood24_update_14.06.2021.csv|53151| 55 |USE| |MAX_DiGuMet_clinical_and_anthropometric_data.xlsx|1367| 139| USE | |digumet_baseline_SYG_AFF.csv|628|163|USE,lifestyle data| |digumet_6_and_12_SYG_AFF.csv |707|163| USE,lifestyle data | |Diease data_ENG.xlsx|628|163|USE, Disease data| |AFF07_disease.xlsx|628|10|will not use, Disease data| |helbred_labels.xlsx|145|3|will not use, Disease data| |affoering_labels.xlsx|44|6|will not use, Disease data| |affoering_lab_2.xlsx|16|12|will not use, Disease data| |row_matched_peaks_v3_final.csv|1362|4851|USE, metabolomics data| |row_matched_meta_v2_final.csv|1362|6|USE, metabolomics meta data| |Genus_1_percent_ICC_05.xlsx|1228| 156 | USE, microbiota datam genus| |All_phyla_genus_1_percent.xlsx |1228| 551 |USE, microbiota datam gfocpl | |2_Abundance_feces n444.sav |1069| 8 |will not use | ## 0.3. Inclusion of data (finally used) - quantitative description | data document| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |myfood24_update_14.06.2021.csv|53151| 55 |USE| |MAX_DiGuMet_clinical_and_anthropometric_data.xlsx|1367| 139| USE | |digumet_baseline_SYG_AFF.csv|628|163|USE,lifestyle data| |digumet_6_and_12_SYG_AFF.csv |707|163| USE,lifestyle data | |Diease data_ENG.xlsx|628|163|USE, Disease data| |row_matched_peaks_v3_final.csv|1362|4851|USE, metabolomics data| |row_matched_meta_v2_final.csv|1362|6|USE, metabolomics meta data| |Genus_1_percent_ICC_05.xlsx|1228| 156 | USE, microbiota datam genus| |All_phyla_genus_1_percent.xlsx |1228| 551 |USE, microbiota datam gfocpl | |2_Abundance_feces n444.sav |1069| 8 |will not use | |digumet_ffq_v1.sas7bdat |1461| 574 |FFQdietary data to use | ## 1. round 1 - R document work flow The metabolomics data is in: `/castor/project/proj/Yan_threecohort/MAX data/MAX_metabolomics data/` The other data group is in: `/castor/project/proj/Yan_threecohort/MAX data/` All the scripts are in: `/castor/project/proj/Yan_threecohort/` workflow get_rawdata.R --> clean_data_run1.R --> clean_data_run2_viktor.R --> match_data.R --> preclean_analysis.R--> Analysis_1.R All the intermediate and final processing result is in: `/castor/project/proj/Yan_threecohort/` ### 2.1 get_rawdata_new.R |Data document| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |MyFood24_Chalmers_20200424.xlsx|54970| 58 | will not use| |MyFood24_Chalmers_20200424_MPB.xlsx| 54970 | 58 | will not use | |myfood24_update_14.06.2021.csv|53151| 55 |USE| |MAX_DiGuMet_clinical_and_anthropometric_data.xlsx|1367| 139| USE | |digumet_baseline_SYG_AFF.csv|628|163|USE,lifestyle data| |digumet_6_and_12_SYG_AFF.csv |707|163| USE,lifestyle data | |Diease data_ENG.xlsx|628|163|USE, Disease data| |AFF07_disease.xlsx|628|10|will not use, Disease data| |helbred_labels.xlsx|145|3|will not use, Disease data| |affoering_labels.xlsx|44|6|will not use, Disease data| |affoering_lab_2.xlsx|16|12|will not use, Disease data| |row_matched_peaks_v3_final.csv|1362|4851|USE, metabolomics data| |row_matched_meta_v2_final.csv|1362|6|USE, metabolomics meta data| |Genus_1_percent_ICC_05.xlsx|1228| 156 | USE, microbiota datam genus| |All_phyla_genus_1_percent.xlsx |1228| 551 |USE, microbiota datam gfocpl | |2_Abundance_feces n444.sav |1069| 8 |will not use | |digumet_ffq_v1.sas7bdat |1461| 574 |FFQ dietary data to use | Input data file is: ``` MyFood24_Chalmers_20200424.xlsx MyFood24_Chalmers_20200424_MPB.xlsx myfood24_update_14.06.2021.csv MAX_DiGuMet_clinical_and_anthropometric_data.xlsx digumet_baseline_SYG_AFF.csv digumet_6_and_12_SYG_AFF.csv Diease data_ENG.xlsx AFF07_disease.xlsx helbred_labels.xlsx affoering_labels.xlsx affoering_lab_2.xlsx row_matched_peaks_v3_final.csv row_matched_meta_v2_final.csv Genus_1_percent_ICC_05.xlsx All_phyla_genus_1_percent.xlsx 2_Abundance_feces n444.sav ``` The script is `get_rawdata.R` The output is `rawdata_tobecleaned.rda` which contains |Dataframe| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |diet_data|53151| 55 |USE| |MAX_DiGuMet_clinical_and_anthropometric_data|1367| 139| USE | |lifestyle_data_baseline|628|163|USE,lifestyle data| |lifestyle_data_6_12 |707|163| USE,lifestyle data | |Diease data|628|163|USE, Disease data, it is the same as lifestyle_data_baselin| |metabolomics|1362|4851|USE, metabolomics data| |microbiota_data_small|1228| 156 | USE, microbiota datam genus| |microbiota_data_large |1228| 551 |USE, microbiota datam gfocpl | |diet_scores |1069| 8 |USE, 6 index scores: Nordic_dietscore, rMED_dietscore, PDI_index, hPDI_index, uPDI_index, proveg_index | |FFQdata |1461| 574 |FFQ dietary data to use | ### 2.2 clean_data_new.R Input data file is: `cleanedata_tobematched.rda` The script is `clean_data_run1.R` `clean_data_run2_viktor.R` The output is `cleaneddata_tobematched.rda` which contains |dataframe baseline| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |diet_data_baseline_grouped_sum|3913| 42 |USE| |diet_data_baseline_shrink_sum_use|625| 41| USE | |MAX_DiGuMet_clinical_and_anthropometric_data_baseline_use|628| 23| USE | |Diease_data_baseline|628|12|USE,lifestyle data| |metabolomics_baseline_use|626|4855|USE, metabolomics data| |microbiota_data_small_baseline_use|565| 156 | USE, microbiota datam genus| |microbiota_data_large_baseline_use |565| 551 |USE, microbiota datam gfocpl | |diet_scores_baseline |408| 8 |USE | |FFQdata_baseline_use |628| 574 |USE | |dataframe six| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |diet_data_six_grouped_sum|2504| 42 |USE| |diet_data_six_shrink_sum_use|393| 41| USE | |MAX_DiGuMet_clinical_and_anthropometric_data_six_use|387| 23| USE | |Diease_data_six|375|12|USE,lifestyle data| |metabolomics_six_use|385|4855|USE, metabolomics data| |microbiota_data_small_six_use|352| 156 | USE, microbiota datam genus| |microbiota_data_large_six_use |352| 551 |USE, microbiota datam gfocpl | |diet_scores_baseline |354| 8 |USE | |FFQdata_six_use |441| 574 |USE | |dataframe twelve| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |diet_data_twelve_grouped_sum|2319| 42 |USE| |diet_data_twelve_shrink_sum_use|362| 41| USE | |MAX_DiGuMet_clinical_and_anthropometric_data_twelve_use|352| 23| USE | |Diease_data_twelve_use|332|12|USE,lifestyle data| |metabolomics_twelve_use|351|4855|USE, metabolomics data| |microbiota_data_small_twelve_use|307| 156 | USE, microbiota datam genus| |microbiota_data_large_twelve_use |307| 551 |USE, microbiota datam gfocpl | |diet_scores_baseline |307| 8 |USE | |FFQdata_twelve_use |392| 574 |USE | ### 2.3 match_data Input data file is: `cleaneddata_tobematched.rda` The script is `match_data_new.R` The output is `matchedata_tobeanalyzed.rda` which contains: |dataframe for variables baseline| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |diet_data_baseline_shrink_sum_use|623|41|USE, | |diet_data_baseline_grouped_sum_use|3899|42|USE, | |Disease_data_baseline_use|626|12|USE, | |MAX_DiGuMet_clinical_and_anthropometric_data_baseline_use|626|23|USE, | |microbiota_data_small_baseline_use|565|156|USE, | |microbiota_data_large_baseline_usee|565|551|USE, | |diebaselinescores_baseline_usee|407|8|USE| |FFQdata_sbaseline_use|312|574|USE ,570 FFQ variables| |dataframe for metabolomics baseline| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | | metab_diet_baseline_use|623|4855|USE, | |metab_diet_baseline_grouped_use|621|4855|USE, | |metab_diseases_baseline_use|626|4855|USE, | |metab_clinical_baseline_use|626|4855|USE, | |metab_microsmall_baseline_use|565|4855|USE, | |metab_microlarge_baseline_use|565|4855|USE, | |metab_diet_scores_baseline_use|407|4855|USE, | |metab_FFQdata_baseline_use|312|4855|USE| |dataframe for variables six| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |diet_data_six_shrink_sum_use|380|41|USE, | |diet_data_six_grouped_sum_use|2416|42|USE, | |Disease_data_six_use|373|12|USE, | |MAX_DiGuMet_clinical_and_anthropometric_data_six_use|385|23|USE, | |microbiota_data_small_six_use|352|156|USE, | |microbiota_data_large_six_use|352|551|USE, | |diet_scores_six_use|352|8|USE, | |FFQdata_six_use|199|574|USE ,570 FFQ variables| |dataframe for metabolomics six| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | | metab_diet_six_use|380|4855|USE, | |metab_diet_six_grouped_use|380|4855|USE, | |metab_diseases_six_use|373|4855|USE, | |metab_clinical_six_use|385|4855|USE, | |metab_microsmall_six_use|352|4855|USE, | |metab_microlarge_six_use|352|4855|USE, | |metab_diet_scores_six_use|352|4855|USE, | |metab_FFQdata_baseline_use|199|4855|USE| |dataframe for variables twelve| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |diet_data_twelve_shrink_sum_use|347|41|USE, | |diet_data_twelve_grouped_sum_use|2221|42|USE, | |Disease_data_twelve_use|331|12|USE, | |MAX_DiGuMet_clinical_and_anthropometric_data_twelve_use|351|23|USE, | |microbiota_data_small_twelve_use|307|156|USE, | |microbiota_data_large_twelve_use|307|551|USE, | |diet_scores_six_use|306|8|USE,| |FFQdata_twelve_use|170|574|USE ,570 FFQ variables| |dataframe for metabolomics twelve| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | | metab_diet_twelve_use|347|4855|USE, | |metab_diet_twelve_grouped_use|346|4855|USE, | |metab_diseases_twelve_use|331|4855|USE, | |metab_clinical_twelve_use|351|4855|USE, | |metab_microsmall_twelve_use|307|4855|USE, | |metab_microlarge_twelve_use|307|4855|USE, | |metab_diet_scores_twelve_use|306|4855|USE, | |metab_FFQdata_twelve_use|170|4855|USE| ### 2.5 preclean_analysis Input data file is: `matchedata_tobeanalyzed.rda` The script is `preclean_analysis.R` The output is `ready_for_analysis.rda` ##age and gender exist in disease data ##visitdate and gender exist in clinical data ##age and gender exist in microbiota data ##age and gender exist in metabolomics data ##age and gender exist in FFQ data ##Compare gender across 5 datasets and compare age between 4 datasets ##we then decide to use the age and gender of the metabolomics dataset, ##since they do not have NA and need to match with all the data frames ### 3.1 Analysis_1 Input data file is: `ready_for_analysis.rda` The script is `Analysis_1.R` The output is `Analysis_1_result.rda`