--- tags: data --- Author: Yingxiao Yan Access data --> Explore data --> Prepare data -->Analyze&report data -->export results ![image](https://hackmd.io/_uploads/HkHuxZ3gC.png) # SIMPLER-Uppsala clinical (SIMPLER-UC) ## 1. Original data All the metadata is in the local computer C:\Users\yingxiao\Desktop\three cohort project\SMCC ## 1.1. Inclusion of data - qualitative description | data type | brief description | | -------- | -------- | | Metabolomics|RP, RN, metadata| | Clinical variables| | | diseases| | | drugs| | | dxa| | | family_history2008| | | clinical quest| | | questionnaire_data| | | pop pfas| | | microbiota data| | ## 1.2. Inclusion of data - quantitative description of origin data | Metaolomics data | Nu.obs | Nu.vars | Additional comments | | ------------------------------------------------- | ---------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------- | | SMCC_RN_featureCluster_meta_all_2020_JAN.csv | 3676 | 9 | Additional info on clustered features RN mode (cluster_names, mz, int,isogr,iso, charge, adduct, ppm,label) | | SMCC_RN_cluster_representatives_meta_2020_JAN.csv | 823 | 2 | Info on clustered features RN mode (V1, mz) | | SMCC_RN_metabolomics_final_PT_meta_clean.csv | 4906 | 8 | Injection meta data RN mode (Barcode, SIMPKEY, Year, Month, Date, Batch, Week, Injection order) | | **SMCC_RN_metabolomics_final_PT_2020_JAN.csv** | 4906 | 3189 | Final peak table RN mode | | SMCC_RP_featureCluster_meta_all_2019_DEC.csv | 2340 | 9 | Additional info on clustered features RP mode(cluster_names, mz, int,isogr,iso, charge, adduct, ppm,label) | | SMCC_RP_cluster_representatives_meta_2019_DEC.csv | 562 | 2 | Info on clustered features RP mode(V1, mz) | | SMCC_RP_metabolomics_final_PT_meta_v2.csv | 4982 | 9 | Injection meta data RP mode (Barcode, SIMPKEY, Year, Month, Date, Batch, Week, Injection order, superbatch) | | **SMCC_RP_metabolomics_final_PT_2019_DEC.csv** | 4982 | 1698 | Final peak table RP mode| #cluster_names: expressed as order of cluster@cluster retention time; #mz: feature m/z value; #int: feature relative intensity in cluster; #isogr: isotope groups; #iso: isotope identity; #charge: charge of each feature; #adduct: annotated aduct; #ppm: annotation ppm; #label: annotated fragments; | other data group 1 | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |clinical_variables.csv|12808|23| see clinical_history.slsxfor metadata| |diseases.csv|2472|11 | ??| |drugs.csv |12586|21| ???? | |dxa.csv|5022|7|???| |family_history2008.csv|12819|10| ???| |pop.csv|694|26| OC| |pfas.csv|692|9| PFAS| **#clinicalvariables** ##waist: waist circumference (cm) ##hip: hip circumference (cm) ##sys1: Systolic BP, 1st measure (mmHg) ##diast1: Diastolic BP, 1st measure (mmHg) ##sys2: Systolic BP, 2nd measure (mmHg) ##diast2: Diastolic BP, 2nd measure (mmHg) ##pHDL_Chol: P-HDL_Cholesterol (mmol/L) ##pLDL_Chol: P-LDL_Cholesterol (mmol/L) ##pTrig: P-Triglycerides (mmol/L) ##pALAT: P-ALAT (µkat/L) ##pCRP: P-CRP (mg/L) ##pGlucose: P-Glucose (mmol/L) ##pCreat: P-Creatinine (µmol/L) ##sInsulin: S-Insulin (mE/L) ##INS_mU_l_: insulin, (mU/L). Approx 4400? fresh material, see mitchell 2018 paper ##IGFBP1: Insulin-like growth factor-binding protein 1 (ug/L) frozen material, see mitchell 2018 paper **#diseases** ##E11: ICD-10 type 2 diabetes ##I63: ICD-20 stroke ##I21: ICD-10 myocardial infraction ##K50: Crohn's disease ##K51: Ulcerative colitis ##BV: before visit ##AV: after visit **#drugs** ##A10: DRUGS USED IN DIABETES ##C01: CARDIAC THERAPY: for the treatment of cardiovascular conditions. no C05 06 data C02,03,07,08,09 are fro hypertension ##C02: ANTIHYPERTENSIVES ##C03: DIURETICS ##C04: PERIPHERAL VASODILATORS ##C07: BETA BLOCKING AGENTS ##C08: CALCIUM CHANNEL BLOCKERS ##C09: AGENTS ACTING ON THE RENIN-ANGIOTENSIN SYSTEM ##C10: LIPID MODIFYING AGENTS ##J01: antibacterials for systemic use ##BV: before visit ##AV: after visit **#dxa (ask Calle)** see mitchell 2018 paper ##fett_total ##lean_total ##fett_android ##fett_gynoid ##height ##weight **#family_history2008** ##For mother, father, sibling and relatives, DK is don't know ##C_fhx01: breast cancer ##C_fhx02: colon cancer ##C_fhx03: prostate cancer ##C_fhx04: other cancer ##C_fhx05: heart attack before age 60 ##C_fhx06: rheumatoid arthritis ##C_fhx07: psoriasis ##C_fhx08: diabetes ##C_fhx09: hypertension **#OC in pg/L** ##PeCB: ##HCB: ##alfaHCH: ###"betaHCH: ##gammaHCH: ##Oxychlordane: ##Transnonachlor: ##ppDDT: ##ppDDE: ##PCB28, PCB52, PCB74, PCB99, PCB101, PCB118, PCB138, PCB153, PCB156, PCB170, PCB180, PCB183, PCB187 ##BDE47, BDE99, BDE153 **#PFAS in ug /L** ##PFHpA: ##PFHxS: ##PFOA: ##PFNA: ##PFOS: ##PFDA: ##PFUnDA: ##PFDoA: | other data group 2 | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |clinical_quest.csv|12792|239| 231 dietary variables, 5 other variables| |questionnaire_data.csv|12819|85| 81 questionnaire variables, 4 other variables Cycle 1=first visit (i.e. SMCC Uppsala 2003-2009), cycle 2=second visit (i.e. SMCC Uppsala 2015-2023) **#clinical quest data** ##simpkey: unique ##sex ##age ##plats: Stockholm 107, Uppsala 5320, Västerås 7365 ##place: clinical 12401 Home 391 ##dateofbirth ##cycle: S, U1A, U1B, U1C, U1D, UH, V U1A=ffq1997 462 U1B=ffq 2004 V1 4348 U1C=ffq 2004 V2 148 U1D=2009 lifestyle 58 UH= sampled at home in Uppsala 304 S=sampled in Stockholm 107 V1=Västerås cycle 1 7365 ##qyear: which year they answered the questionnaire. Please refer to the subcohort variable list to check what each food variable means. The original variables are in the format g_food001, g_bev02, g_alc03. The derived food variables are in There are many NAs in the variable list. The NA values needs to be removed when running the MUVR model **#questionnaire data** ##A stands for 1987 ##B stands for 1997 ##C stand for 2008 health ##D stands for 2009 lifestyle ##E stands for 2019 health ##F stands for 2019 lifesyle ##simpkey : unique ##birthdate (1914-1952) ##deathdate (2005-2023) ##visitdate (2003-2019) | Variables |Type(should be) |Explanation| | -------- | -------- |-------- | | A_age| numeric|Age when answering questionnaire (Q87=SMC Baseline, Q97=COSM Baseline)| | B_age| numeric|Age when answering questionnaire (Q87=SMC Baseline, Q97=COSM Baseline)| |C_age |numeric|Age when answering questionnaire (Q87=SMC Baseline, Q97=COSM Baseline)| |D_age |numeric|Age when answering questionnaire (Q87=SMC Baseline, Q97=COSM Baseline)| |E_age |numeric|Age when answering questionnaire (Q87=SMC Baseline, Q97=COSM Baseline)| |C_health|factor|How is your health? 1=Very good, 2=Good, 3=Neither good/ nor bad, 4=Bad, 5=Very bad | |E_health |factor|How is your health? 1=Very good, 2=Good, 3=Neither good/ nor bad, 4=Bad, 5=Very bad| |B_der_height|numeric|B_height2:Height or if 1997 value is missing the 1987 value will be given| |B_der_weight|numeric| B_weight1:Weight 1997 or if 1997 value is missing the 1987 value will be given| |B_der_bmi |numeric|Weight/(Height)^2:BMI| |C_bp1 |factor|If no BP medication: Has your blood pressure been measured in the past 3 years? 1=No, 2=yes| |C_bp2 |factor|If BP measured - what result? 1=Too low, 2=Normal, 3=Slightly elevated, 4=Markedly elevated| |C_waist |numeric|Waist (cm)| |C_hip |numeric|Hip (cm)| | A_eat01 |factor |Type of diet. '1=Omnivorous, 2=Only lactovegetarian (no meat, fish or egg), 3=Mostly lactovegetarian, sometimes eats fish and eggs, 4=Vegan, 5=Other| |D_eat02 |factor|What is your main type of diet? '1=Mixed, 2=Vegetarian, 3=Vegan| | A_educ1 |factor |Education, 1, 2, 3=Primary school (≤ 9 years), 4=High school (10-12 years), 5=College/University (≥12 years), 6=Other| |B_der_educ1 |factor|B_educ2-B_educ7:Highest education-level 1=Primary school <=9 years, 2=High school 10-12 years, 3=University >12 years| | B_smok01_u|factor |Ever smoked cigarettes regularly 1=Yes, 0=No | | B_smok03_n|factor |Still smoking, 1=Yes, 0=No | |B_der_smok01 |factor|B_smok01:Ever smoker 1=Yes, 0=No| |B_der_smok03 |factor|B_smok03:Smoking status 1=Yes, 0=No| |D_smok01_u |factor|Ever smoked cigarettes regularly. 1=No, 2=Yes, currently, 3=Yes, but I stopped| |B_der_smok_u |factor|B_smok01+B_smok03:Ever smoker/status 1=Current, 2=Ex, 3=Never| |B_der_alc09_u |factor|????It is B_der_alc14_u in my variable list. B_alc14_u+B_alc15_u:Ever drinker 0=Never, 1=Ex, 2=Current| |B_der_act25 |numeric|Current total activity score (MET*hours/d)| |D_act20 |factor|Exercise, this year 1= Almost never, 2=<1 hour/week, 3=1 hour/week, 4=2-3 hours/week, 5=4-5 hours/week, 6=>5 hours/week| | B_med06_u| factor|Ever used cortisone tablets, 1=No, 2=Yes|| |C_med08_u|factor|Ever used cortisone tablets 1=No, 2=Yes| |C_med09_x |factor|Diabetes, treatment. 1=Insulin, 2=Tablets, 3=Dietary advice, 9=More than one| |C_med010_u |factor|Have you used anitbiotics during the past 10 years? 1=No, 2=Yes| |C_fhx05_no |factor|Relatives diagnosed with heart attack before age 60: no. 1=No| |C_fhx05_fm |factor|Heart attack before age 60: mother. 1=Yes| |C_fhx05_ff |factor|Heart attack before age 60: father. 1=Yes| |C_fhx05_fs |factor|Heart attack before age 60: sibling. 1=Yes| |C_fhx05_dk |factor|Relatives diagnosed with heart attack before age 60: don't know. 1=Don't know| |C_fhx08_no |factor|Relatives diagnosed with diabetes: no. 1=No| |C_fhx08_fm |factor|Diabetes: mother. 1=Yes| |C_fhx08_ff |factor|Diabetes: father. 1=Yes| |C_fhx08_fs |factor|Diabetes: sibling. 1=Yes| |C_fhx08_dk |factor|Relatives diagnosed with diabetes: dont know. 1=Don't know| |C_fhx09_no |factor|Relatives diagnosed with hypertensions: no. 1=No| |C_fhx09_fm |factor|Hypertension: mother. 1=Yes| |C_fhx09_ff |factor|Hypertension: father. 1=Yes| |C_fhx09_fs |factor|Hypertension: sibling. 1=Yes| |C_fhx09_dk |factor|Relatives diagnosed with hypertensions: dont know. 1=Don't know| |B_der_diag02_x|factor|B_diag02_x+B_diag02_y:Diag high blood pressure 1=Yes, 0=No| |B_der_diag03_x|factor| B_diag03_x+B_diag03_y:Diag high cholesterol 1=Yes, 0=No| |B_der_diag06_x|factor|B_diag06_x+B_diag06_y:Diag heart attack 1=Yes, 0=No| |B_der_diag07_x|factor| B_diag07_x+B_diag07_y:Diag stroke 1=Yes, 0=No| |B_der_diag08_x |factor|B_diag08_x+B_diag08_y:Diag diabetes 1=Yes, 0=No| | B_diag02_x| factor|Ever diagnosed with: Hypertension 1=Yes, 0=No| | B_diag02_a| numeric |Hypertension: at what age| |B_diag02_y1 |numeric|Hypertension: what year| | B_diag03_x| factor|Ever diagnosed with: High cholesterol 1=Yes, 0=No | | B_diag03_a|numeric |High cholesterol: at what age| |B_diag03_y1 |numeric|High cholesterol: what year| | B_diag05_x| factor |Ever diagnosed with: Angina pectoris 1=Yes, 0=No| | B_diag05_a| numeric|Angina pectoris: at what age| |B_diag05_y1 |numeric|Angina pectoris: what year| | B_diag06_x| factor |Ever diagnosed with: Heart attack 1=Yes, 0=No | | B_diag06_a| numeric|Heart attack: at what age| |B_diag06_y1 |numeric|Heart attack: what year| | B_diag07_x| factor |Ever diagnosed with: Stroke 1=Yes, 0=No| | B_diag07_a| numeric|Stroke: at what age| |B_diag07_y1 |numeric|Stroke: what year| | B_diag08_x| factor|Ever diagnosed with: Diabetes 1=Yes, 0=No| | B_diag08_a|numeric |Diabetes: what year| |B_diag08_y1 |numeric|Diabetes: at what age| |C_diag02_x |factor |Ever diagnosed with: Hypertension, 1=Yes, 0=No| |C_diag03_x|factor | Ever diagnosed with: High cholesterol, 1=Yes, 0=No| |C_diag05_x |factor |Ever diagnosed with: Angina pectoris , 1=Yes, 0=No| |C_diag16_x |factor |Heart failure 1=Yes| |C_diag08_x |factor |Ever diagnosed with: Diabetes , 1=Yes, 0=No| |C_diag08_a |numeric|Diabetes: at what age, 1=Yes, 0=No| |E_diag02_x|factor |Ever diagnosed with: Hypertension 1=Yes, 0=No| |E_diag03_x |factor | Ever diagnosed with: High cholesterol 1=Yes, 0=No| |E_diag05_x |factor | Ever diagnosed with: Angina pectoris 1=Yes, 0=No| |E_diag16_x |factor |Heart failure 1=Yes| |E_diag08_x |factor |Ever diagnosed with: Diabetes 1=Yes, 0=No| |E_diag08_a |numeric|Diabetes: at what age| | other data group 3 -microbiota | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |simpler_metagenomics_metaphlanalpha_diversity_v2.0.tsv|12792|239| 231 dietary variables, 8 other variables| |simpler_metagenomics_metaphlan_bray_curtis_dissimilarities_v2.0.tsv|12819|85| 81 questionnaire variables, 4 other variables| |simpler_metagenomics_metaphlan_bray_curtis_dissimilarities_v2.0.tsv|12792|239| 231 dietary variables, 8 other variables| |technical_variables.tsv|12819|85| 81 questionnaire variables, 4 other variables| ## 2. round 1 - R document work flow The metabolomics data is in: `/castor/project/proj/Omicsdataleverans/metabolomics_chalmers/to_simp2023018/` The other data group 1 is in `/castor/project/proj/Dataleverans/` All the scripts are in `/castor/project/proj/Yan_threecohort/` workflow: get_rawdata.R --> clean_data.R -->normalize_OC.R-->microbiota_explore, foodvariable_explore.R--> match_data.R --> confounder_imputation_VC.R-->confounder_build_VC.R--> preclean_analysis_VC.R --> outcome_r_f_s_VC.R--> exposure_outcome_r_f_s_VC.R--> after_O_E_features_selected_VC.R--> All the intermediate and final processing result is in: `/castor/project/proj/Yan_threecohort/data_preprocessing_results/` ### 2.1 get_rawdata Input data file is: ``` SMCC_RN_featureCluster_meta_all_2020_JAN.csv SMCC_RN_cluster_representatives_meta_2020_JAN.csv SMCC_RN_metabolomics_final_PT_meta_clean.csv SMCC_RN_metabolomics_final_PT_2020_JAN.csv SMCC_RP_featureCluster_meta_all_2019_DEC.csv SMCC_RP_cluster_representatives_meta_2019_DEC.csv SMCC_RP_metabolomics_final_PT_meta_v2.csv SMCC_RP_metabolomics_final_PT_2019_DEC.csv clinical_variables.csv diseases.csv drugs.csv dxa.csv family_history2008 pop.csv pfas.csv clinical_quest.csv questionnaire_data.csv ``` The script is `get_rawdata.R` The output is `rawdata_tobecleaned.rda` which contains |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |clinical_variables|12808|23| | |diseases|2317| 7 | | |drugs |11020|19| | |dxa|5022|7| | |family_history2008|12819|10| | |pop|694|26|| |pfas|692|9|| |clinical_quest|12792|239|| |questionnaire_data|12815|85|| | SMCC_RN_featureCluster_meta_all_2020_JAN | 3679 | 9 | || SMCC_RN_cluster_representatives_meta_2020_JAN | 823 | 2 | | | SMCC_RN_metabolomics_final_PT_meta_clean | 4906 | 8 | | | **SMCC_RN_metabolomics_final_PT_2020_JAN** | 4906 | 3189 | | | SMCC_RP_featureCluster_meta_all_2019_DEC | 2340 | 9 | | | SMCC_RP_cluster_representatives_meta_2019_DEC | 562 | 2 | | | SMCC_RP_metabolomics_final_PT_meta_v2 | 4982 | 9 | | | **SMCC_RP_metabolomics_final_PT_2019_DEC** | 4982 | 1698 | | ### 2.2 clean_data Input data file is: `rawdata_tobecleaned.rda` The script is `clean_data.R` The output is `cleanedata_tobematched.rda` which contains |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |clinical_variables|12808|25| | |diseases|2317| 16 | | |drugs |11020|19| | |dxa|5022|7| | |family_history2008|12819|10| | |pop|694|26|| |pfas|692|9|| |clinical_quest|12792|239|231 food variables and 13 other variables| |questionnaire_data|12815|85|81 variables covariate, 4 other variables| |original_dietvar|12792|186|original variables| |derived_dietvar|12792|61|derived variables| |questionnaire_data_whole|12819|73|selected covariate variables| |questionnaire_data_lean|12819|43|redundant variables removed or integrated| | SMCC_RN_featureCluster_meta_all_2020_JAN.csv | 3679 | 9 | || SMCC_RN_cluster_representatives_meta_2020_JAN.csv | 823 | 2 | | | SMCC_RN_metabolomics_final_PT_meta_clean.csv | 4906 | 8 | | | **SMCC_RN_metabolomics_final_PT_2020_JAN.csv** | 4906 | 3189 | | | SMCC_RP_featureCluster_meta_all_2019_DEC.csv | 2340 | 9 | | | SMCC_RP_cluster_representatives_meta_2019_DEC.csv | 562 | 2 | | | SMCC_RP_metabolomics_final_PT_meta_v2.csv | 4982 | 9 | | | **SMCC_RP_metabolomics_final_PT_2019_DEC.csv** | 4982 | 1698 | | |list | Nu.items | item_type |Additonal comment | | -------- | -------- |-------- |-------- | |clinical_variables_check|3|mix| | |diseases_check|3| mix | | |drugs_check |3|mix| | |dxa_check|3|mix| | |family_history2008_check|3|mix| | |pop_check|3|mix|| |pfas_check|3|mix|| |clinical_quest_check|3|mix|| |original_dietvar_check|3|mix|| |derived_dietvar_check|3|mix|| |questionnaire_check|3|mix|| |questionnaire_whole_check|3|mix|| |questionnaire_lean_check|3|mix|| ## 2.3 pop microbiota and food variables explore ### 2.3.1 normalize_OC (mandatory) The users need to choose if they want to lipid-normalize the pollutant data or not. ![image](https://hackmd.io/_uploads/HJMQKcxDa.png) ![image](https://hackmd.io/_uploads/SJr4tcew6.png) The lipid_normlaizer3() function will not substitue missing value. but will do following things to achieve lipid-normalized OC: It will first separate variables based on if a variable has certain percent of values below LOD values to generate `remainframe` (variables below certain percentage) and `removeframe`(variables above certain percentage). No matter if a variable is in `remainframe` or `removeframe`, If a value is below LOD: When `do_normalize=T`: The new value will be `LOD/2/mean(total_lipid,na.rm=T)` When `do_normalize=F`: The new value will be `LOD/2` No matter if a variable is in `remainframe` or `removeframe`, If a value is above LOD When `do_normalize=T`: The new value will be `value/the corresponding total lipid value` When `do_normalize=F`: The new value will be the same value An argument `checkafter=T` transfers the `value/the corresponding total lipid value` to `LOD/2/mean(total_lipid,na.rm=T)` if the former value smaller than latter value. Input data file is: `cleanedata_tobematched.rda` The script is (run the function first) `lipid_normalizer3.R` `normalize_OC.R` The output is `cleanedata_tobematched.rda` which contains addtionallly: |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |pop_normalized|694|26| above LOD lipid_normalized OC, below LOD substitute as half of LOD| |pfas_normalized|692|9|PFAS below LOD substitute as half of LOD| ### 2.3.2 microbiota_explore Thsi step first remove near zero variance variables, then exclude variables with 0 prevalance >2/3 Input data file is: `simpler_metagenomics_metaphlanalpha_diversity_v2.0.tsv` `simpler_metagenomics_metaphlan_bray_curtis_dissimilarities_v2.0.tsv` `simpler_metagenomics_metaphlan_bray_curtis_dissimilarities_v2.0.tsv` `technical_variables.tsv` |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |alpha diversity.tsv|6150|3| | |beta_diversity.tsv|6150|6151| | |metaphlan_abundance.tsv|6150|7852| | |technical_variables.tsv|6150|23| | The script is `microbiota_explore.R` The output is `micro_precessed.rda` |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |micro_variables_summarize|12792|239|| |micro_all|6150|7885| | |micro_all_nonzv|6150|1753| | |micro_lessmissing|6150|976| in clusing simpkey, alpha diversity variables| |list | Nu.items | item_type |Additonal comment | | -------- | -------- |-------- |-------- | |micro_all_check|3|mix|All microbiota variables | |micro_all_nonzv_check|3| mix | all microbiota variables that do not have near zero variance| |micro_abundance_list|12819|85|k__, p__, c__, p__ , f__, g__, s__, t__ by group| ### 2.3.3 foodvariable_explore This step first remove near zero variance variables, then exclude variables with NA values >2/3 Input data file is: `cleanedata_tobematched.rda` The script is `foodvariable_explore.R` The output is `food_variable_result.rda` |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |original_dietvar|12792|143| 8 of them are other variables| |original_dietvar_nonzv|12792|175| 8 of them are other variables| |original_dietvar_all|12792|191|8 of them are other variables| |derived_dietvar|12792|56| | |derived_dietvar_nonzv|12792|56| | |derived_dietvar_all|12792|56| | |list | Nu.items | item_type |Additonal comment | | -------- | -------- |-------- |-------- | |original_dietvar_all_check|3|mix|all original diet_var without near zero variances | |derived_dietvar_all_check|3| mix |all derived diet_var without near zero variances | |original_dietvar_nonzv_check|3|mix| all original variables| |derived_dietva_nonzv_check|3| mix |all derived dietary variables| |questionnaire_data_whole_check|3| mix | questionnaire data check | ### 2.4 match_data Input data file is: `cleanedata_tobematched.rda` `micro_precessed.rda` `food_variable_result.rda` The script is `match_data.R` The output is `matcheddata_tobeanalyzed.rda` |dataframe for variables | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |clinical_variables_match_metabolomics|4898|25| | |diseases_match_metabolomics|654| 16 | | |drugs_match_metabolomics |4160|19| | |dxa_match_metabolomics|4888|7| | |family_history2008_match_metabolomics|4898|10| | |pop_match_metabolomics|681|26|| |pfas_match_metabolomics|679|9|| |pop_normalized_match_metabolomics|681|26|| |pfas_normalized_match_metabolomics|679|9|| |original_dietvar_match_metabolomics|4882|186|| |derived_dietvar_match_metabolomics|4882|61|| |questionnaire_data_whole_match_metabolomics|4898|73|| |questionnaire_data_lean_match_metabolomics|4898|43|| |dataframe for metabolomics | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |metabolomics|4898|1698| | |metabolomics_match_clinical_variables|4898|1698| | |metabolomics_match_diseases|654| 1698 | | |metabolomics_match_drugs|4160|1698| | |metabolomics_match_dxa|4888|1698| | |metabolomics_match_family_history2008|4898|1698| | |metabolomics_match_pop|681|1698|| |metabolomics_match_pfas|679|1698|| |metabolomics_match_pop_normalized|681|1698|| |metabolomics_match_pfas_normalized|679|1698|| |metabolomics_match_derived_dietvar|4882|1698|| |metabolomics_match_original_dietvar|4882|1698|| |metabolomics_match_questionnaire_data_whole|4898|1698|| |metabolomics_match_questionnaire_data_lean|4898|1698|| ### 2.5 impute confounders Input data file is: `cleanedata_tobematched.rda` `matcheddata_tobeanalyzed_TessaID.rda` The script is `confounder_imputation.R` The output is `confounder_raw_nomiss_imputed.rda` |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |confouder_raw|12781|9|Only rows with matched simpkeys for all confounder variables are used| |confounder_raw_extend|12819|9|Include those that does not have matched simpkeys for some confounders as NA| |confounder_nomiss|10260|9| Remove any rows that has missing values| |confounder_imputed||| The imputed result from confounder_raw_extend| Input data file is: `confounder_raw_nomiss_imputed.rda` The script is `confounder_build_TessaID.R` The output is `confounders_frame_TessaID.rda` |dataframe | Nu.obs | Nu.vars | Additional comments | | -------- | -------- |-------- |-------- | |confouder_MI_TessaID|||confoudner with NA| |confouder_imputed_MI_TessaID|||imputed result| |confouder_stroke_TessaID|||confoudner with NA| |confouder_imputed_stroke_TessaID|||imputed result| ### 2.6 preclean_analysis This is to clean up some variables, dealing with uneasy . Separate variables to baseline and follow up Input data file is: `matcheddata_tobeanalyzed.rda` `confounder_raw_nomiss_imputed.rda` The script is `preclean_analysis.R` The output is `ready_for_analysis.rda` Currently, the script is running MUVR models to select variables that are associated with T2D, MI and stroke. Only individuals without CVD history is used for MUVR models for MI and stroke. Only individuals without T2D history is used for MUVR models for T2D. The output contain MUVR_RF_mod, MUVR_PLS_mod,MUVR_EN_mod, MUVR_EN_adj_mod (adjusting for age). Each item is a list of three (T2D, MI, stroke). Each sublist is a list of 5 MUVR model results, when subsampling the controls to the same number as cases ### 2.7 Outcome_r_f_s_ Input data file is from 2.5: `ready_for_analysis.rda` The script is `outcome_r_f_s_.R` To check ber nVar and what variables are selected for 3 out of 5 times of using different subsampling of cases. The output is `outcome_related_features_result.rda` ![image](https://hackmd.io/_uploads/SyBppqSAa.png) The intended used result could be `T2D_variable_RF_mid`, `MI_variable_RF_mid`, `stroke_variable_RF_mid`. ### 2.8 Exposure_Outcome_r_f_s_ Input data file is: `outcome_related_features_result.rda` `matcheddata_tobeanalyzed.rda` The script is `exposure_outcome_r_f_s.R` The output is `exposure_outcome_related_features_result.rda` ### 2.9 after_E_O_features_selected Input data file is: `exposure_outcome_related_features_result.rda` `matcheddata_tobeanalyzed.rda` The script is `after_E_O_features_selected.R` The output is: ## 3. SIMPLER-VC data cleaning ### 3.1 get_rawdata_VC ### 3.2 clean_data_VC ### 3.3 normalized_OCmatch_data_VC ### 3.4 microbiota_explore ### 3.4.1 food_variable_explore ### 3.5 match_data ### 3.6 preclean_Analysis ### 3.7 Outcome_r_f_s_ ## 4. match omics ### 4.1 match selected outcome-related metabolomics features in simpler-UC with simpler-VC Check for overlap and get selected features ### 4.2 linear models to check SIMPLER-UC features in SIMPLER-VC, and vice versa ### 4.3 same as Exposure_Outcome_r_f_s_ in SIMPLER-UC ### 4.4 match seleced outcome-related metabolomics with simpler-VC with other cohort Check for overlap and get selected ### 4.5 linear models to check SIMPLER-UC features in other cohorts