BIODIVA data manipulation documentation

--- tags: data --- Author: Yingxiao Yan A website for dietary index calculation https://github.com/jamesjiadazhan/dietaryindex/tree/main/R Access data --> Explore data --> Prepare data -->Analyze&report data -->export results Since ususally, data is delivered not at the same time (ppl are inconsistent). "round 1" means the preprocessing of data at the first time, "round 2" means that extra data are sent and the preprocessing of data is performed again. In each round, the workflow is written based on the R script, where what the input data, output data are are specified. # BioDiva ## 0. Original data ## 0.1. Inclusion of data - qualitative description | data type | brief description | | -------- | -------- | | Dietary data baseline|FFQ| | Dietary data 10Y follow-up|FFQ| | Pollutants baseline| Organochlorine compounds and PFAS| | Pollutants 10Y follow-up|Organochlorine compounds and PFAS | | metabolomics baseline| HP,HN,RP,RN | | metabolomics 10Y follow-up| HP,HN,RP,RN | | Covariate & outcomes baseline| diabetes, intermediate risk markers | | Covariate & outcomes 10Y follow up| diabetes, intermediate risk markers | ## 0.2. All avairable data - quantitative description | data document| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |Biodiva_Metabolomics.Rdata |1408| 24759 | The processed metabolomics data with "Strigent processing"; including baseline and follow-up metbolomics data| |Biodiva_Metabolomics.csv | 1408 | 29244 | The raw metabolomics data; including baseline and follow up-metbolomics data | |BSD_logFFQNEW.rda|844| 24879 |The processed metabolomics data with "Strigent processing"; Baseline metabolomics for case-control pairs| |Diet_baseline.rda|844| 446| Baseline dietary data for case-control pairs | |FinalDietD.xlsx|1416| 419 | Baseline and follow-up dietary data| |MetaDiet_2.rda |1006|543| ???? | |OC: POPvar.rda|1736|22|no missing values,OC for baseline and follow-up (from Shi Lin)| |OC: BioDiva881POP.xlsx|776|25|many missing values below LOD (from Anna-Sara)| |OC: BioDiva886POP.xlsx|952|25|many missing values below LOD (from Anna-Sara)| |PFAS: BioDiva881PFAS.xlsx|776|15|many missing values below LOD (from Anna-Sara)| |RepeatedMetabolomcs.rda|748| 24762 | baseline and follow up metabolomics for selected case-control pairs| |repsampleData.Rata |748| 177 |baseline and follow up covariates for selected case-control pairs | ## 0.3. Inclusion of data (finally used) - quantitative description | data document| Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |Biodiva_Metabolomics.Rdata |1408| 24759 | The processed metabolomics data with "Strigent processing"; including baseline and follow-up metbolomics data https://link.springer.com/content/pdf/10.1007/s00125-017-4521-y.pdf| |FinalDietD.xlsx|1416| 419 | Baseline and follow-up dietary data| |OC: BioDiva881POP.xlsx|776|25|many missing values below LOD (from Anna-Sara)| |PFAS: BioDiva881PFAS.xlsx|776|15|many missing values below LOD (from Anna-Sara)| |RepeatedMetabolomcs.rda|748| 24762 | baseline and follow up metabolomics for selected case-control pairs| |repsampleData.Rata |748| 177 |baseline and follow up covariates for selected case-control pairs | ## 1. round 1 - R document work flow The metabolomics data is in: `/castor/project/proj/three_cohort_project/raw_data` The other data group is in: `/castor/project/proj/three_cohort_project/raw_data` All the scripts are in: `/castor/project/proj/three_cohort_project/data_checking` workflow get_rawdata_new.R --> clean_data_new.R --> normalize_OC.R--> match_data_new.R --> Analysis_preclean_new.R--> Analysis_1.R All the intermediate and final processing result is in: `/castor/project/proj/three_cohort_project/` ### 2.1 get_rawdata_new.R Input data file is: ``` Biodiva_Metabolomics.Rdata Biodiva_Metabolomics.csv BSD_logFFQNEW.rda Diet_baseline.rda FinalDietD.xlsx MetaDiet_Final2.rda POPvar.rda BioDiva881POP.xlsx BioDiva886POP.xlsx BioDiva881PFAS.xlsx RepeatedMetabolomcs.rda repsampleData.Rata ``` The script is `get_rawdata_new.R` The output is `rawdata_tobecleaned.rda` which contains |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |Biodiva_Metabolomics_csv |1408| 24759 | | |Biodiva_Metabolomics_rda | 1408 | 29244 | | |BSD_logFFQNEW|844| 24879 || |Diet_baseline|844| 446| | |FinalDietD|1416| 419 || |MetaDiet_2 |1006|543| | |POPvar|1736|22|| |OC_881|777|25|| |OC_886|953|25|| |PFAS_881|777|15|| |RepeatedMetabolomcs|748| 24762 | | |repINFO |748| 177 || |list | Nu.items | item_type |Additonal comment | | -------- | -------- |-------- |-------- | |ovelapping_BioDiva_Metabolomics_observations |4| overlapping_object|check 3 metabolomics data observations| |ovelapping_BioDiva_Metabolomics_variables |4| overlapping_object|check 3 metabolomics data variables| |ovelapping_diet_observations |4| overlapping_object|check 3 diet data observations| |ovelapping_diet_variables |4| overlapping_object|check 3 diet data variables| |vector | length | type | | | -------- | -------- |-------- |-------- | |BioDiva_Metabolomics_rda_HP_HN_RP_RN|24758| character|HP,HN,RP,RN status| | BSD_logFFQNEW_HP_HN_RP_RN|24758|character|HP,HN,RP,RN status| ### 2.2 clean_data_new.R Input data file is: `rawdata_tobecleaned.rda` The script is `clean_dat_new.R` The output is `cleaneddata_tobematched.rda` which contains |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |Biodiva_Metabolomics_rda |1408| 24759 |USE | |Biodiva_Metabolomics_csv | 1408 | 29244 | will not be used, unprocessed data| |BSD_logFFQNEW|844| 24879 |will not be used, lack of obs| |Diet_baseline_covariate|844| 32|will not be used, lack of obs| |Diet_baseline_dietdata|844| 325| will not be used, lack of obsd| |FinalDietD|1416| 33 |USE, some variables removed| |FinalDietD|1416| 299 |USE, some variables removed| |Diet_additional |1006|60| Generated diet varibles| |OCvar|1736|22|will not be used, redundant OC obs and missing variables| |OC_881_cleaned|775|24|USE, baseline and follow up, missing obs removed| |OC_886_cleaned|939|24|will not be used, do not have metabolomics data| |PFAS_881_cleaned|776|15|USE, baseline and follow up, missing obs removed| |RepeatedMetabolomcs|748| 24762 |USE | |repINFO_covariate |748| 25 |USE| |repINFO_dietdata |748| 131 |USE| |vector | length | type | | | -------- | -------- |-------- |-------- | |new_csv columns|29244| character|The name of variables for Biodiva_Metabolomics_csv with RN, RP, HP, HN| ### 2.3 normalize_OC (mandatory) The users need to choose if they want to lipid-normalize the pollutant data or not. The lipid_normlaizer2() function will not substitue missing value. but will do following things to achieve lipid-normalized OC: It will first separate variables based on if a variable has certain percent of values below LOD values to generate `remainframe` (variables below certain percentage) and `removeframe`(variables above certain percentage). No matter if a variable is in `remainframe` or `removeframe`, If a value is below LOD: When `do_normalize=T`: The new value will be `LOD/2/mean(total_lipid,na.rm=T)` When `do_normalize=F`: The new value will be `LOD/2` No matter if a variable is in `remainframe` or `removeframe`, If a value is above LOD When `do_normalize=T`: The new value will be `value/the corresponding total lipid value` When `do_normalize=F`: The new value will be the same value Input data file is: `cleaneddata_tobematched.rda` The script is (run the function first) `lipid_normalizer2.R` `normalize_OC.R` The output is `cleanedata_tobematched.rda` which contains addtionallly: |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |OC_881_cleaned_normalized|775|24| above LOD lipid_normalized OC, below LOD substitute as half of LOD| |PFAS_881_cleaned_normalized|776|15|PFAS below LOD substitute as half of LOD| ### 2.4 match_data_new Input data file is: `cleaneddata_tobematched.rda` The script is `match_data_new.R` The output is `matchedatatobeanalyzed.rda` which contains: |dataframe for variables | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |FinalDietD_covariate_metab_use|1408|33|USE, BL and repeated| |FinalDietD_dietdata_metab_use|1408|299|USE, BL and repeated| |Diet_baseline_covariate_metab_use|844|32|will not use| |Diet_baseline_dietdata_metab_use|844|325|will not use| |Diet_additional_metab_use|1006|60|USE| |OCvar_metab_use|776|22|will not use, missing variables| |OC881_metab_use|775|24|USE, baseline and repeated| |PFAS881_metab_use|776|15|USE, BL and repeated| |OC881_normalized_metab_use|775|24|USE, BL and repeated| |PFAS881_normalized_metab_use|776|15|USE, BL and repeated| |dataframe for metabolomics | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |metab_FinalDietD_covariate_use|1408|24759|USE| |metab_FinalDietD_dietdata_use|1408|24759|USE| |metab_Diet_baseline_covariate_use|844|24759|will not use| |metab_Diet_baseline_dietdata_use|844|24759|will not use| |metab_Diet_additional_use|1006|24759|USE| |metab_OCvar_use|776|24759|will not use| |metab_OC881_use|775|24759|USE| |metab_PFAS881_use|776|24759|USE| |metab_OC881_normalized_use|775|24759|USE| |metab_PFAS881_normalized_use|776|24759|USE| |RepeatedMetabolomics_use|748|24672|will not use| |repINFO_covairate_use|748|25|USE, which pairs together| |repINFO_dietdata_use|748|131|USE, which pairs together| ### 2.5 preclean_analysis This is to clean up some variables, dealing with uneasy . Separate variables to baseline and follow up Input data file is: `matchedatatobeanalyzed.rda` The script is `Analysis_preclean_new.R` The output is `ready_for_analysis.rda` |dataframe for variables (baseline) | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |FinalDietD_covariate_metab_use_BL|1034|33|USE, BL and repeated| |FinalDietD_dietdata_metab_use_BL|1034|299|USE, BL and repeated| |Diet_baseline_covariate_metab_use_BL|844|32|will not use| |Diet_baseline_dietdata_metab_use_BL|844|325|will not use| |Diet_additional_metab_use_BL|1006|60|USE| |OC881_metab_use_BL|402|24|USE, baseline and repeated| |PFAS881_metab_use_BL|402|15|USE, BL and repeated| |OC881_normalized_metab_use_BL|402|24|USE, BL and repeated| |PFAS881_normalized_metab_use_BL|402|15|USE, BL and repeated| |repINFO_covariate_use_BL|374|25|will not use| |repINFO_dietdata_use_BL|374|131|will not use| |dataframe for variables (10Y) | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |FinalDietD_covariate_metab_use_10Y|374|33|USE, BL and repeated| |FinalDietD_dietdata_metab_use_10Y|374|299|USE, BL and repeated| |Diet_baseline_covariate_metab_use_10Y|374|32|will not use| |Diet_baseline_dietdata_metab_use_10Y|374|325|will not use| |Diet_additional_metab_use_10Y|374|60|USE| |OC881_metab_use_10Y|374|24|USE, baseline and repeated| |PFAS881_metab_use_10Y|374|15|USE, BL and repeated| |OC881_normalized_metab_use_10Y|374|24|USE, BL and repeated| |PFAS881_normalized_metab_use_10Y|374|15|USE, BL and repeated| |repINFO_covariate_use_BL|374|25|will not use| |repINFO_dietdata_use_BL|374|131|will not use| |dataframe for metabolomics (baseline) | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |metab_FinalDietD_covariate_use_BL|1034|24759|USE| |metab_FinalDietD_dietdata_use_BL|1034|24759|USE| |metab_Diet_baseline_covariate_use_BL|844|24759|will not use| |metab_Diet_baseline_dietdata_use_BL|844|24759|will not use| |metab_Diet_additional_use_BL|1006|24759|USE| |metab_OC881_use_BL|402|24759|USE| |metab_PFAS881_use_BL|402|24759|USE| |metab_OC881_normalized_use_BL|402|24759|USE| |metab_PFAS881_normalized_use_BL|402|24759|USE| |RepeatedMetabolomics_use_BL|374|24672|will not use| |dataframe for metabolomics (10Y) | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |metab_FinalDietD_covariate_use_10Y|374|24759|USE| |metab_FinalDietD_dietdata_use_10Y|374|24759|USE| |metab_Diet_baseline_covariate_use_10Y|374|24759|will not use| |metab_Diet_baseline_dietdata_use_10Y|374|24759|will not use| |metab_Diet_additional_use_10Y|374|24759|USE| |metab_OCvar_use_10Y|374|24759|will not use| |metab_OC881_use_10Y|374|24759|USE| |metab_PFAS881_use_10Y|374|24759|USE| |metab_OC881_normalized_use_10Y|374|24759|USE| |metab_PFAS881_normalized_use_10Y|374|24759|USE| |RepeatedMetabolomics_use_10Y|374|24672|will not use| ## 3. Analysis ### 3.1 Analysis_1 Input data file is: `ready_for_analysis.rda` The script is `Analysis_1.R` The output is `Analysis_1_result.rda`