GLACIAL experiments on curated PPMI data

# GLACIAL experiments on curated PPMI data By Arya Anantula ## Experiments ### Experiment 1 - BMI, bjlot, hvlt_discrimination, hvlt_immediaterecall, hvlt_retention, upsit, lexical * seed = 999 (seed for random number generation) * learning rate = 3e-4 (used by Adam optimizer) * steps = 2000 (number of epochs) * reps = 4 (how many times cross-validation is repeated) * num of rows in data = 8435 * num of unique subjects = 1162 **S1 Graph:** ![v1_s1](https://hackmd.io/_uploads/S1oaygNWR.png) **S2 Graph: (after orienting bidirectional edges)** ![v1_s2](https://hackmd.io/_uploads/SJNBeg4b0.png) **S3 Graph: (after removing indirect edges)** ![v1_s3](https://hackmd.io/_uploads/S1vdeeEZC.png) **Loss Graph:** ![loss](https://hackmd.io/_uploads/Sk7axxEZA.png) ### Experiment 2 - HVLTRDLY, lns, SDMTOTAL, VLTANIM, MSEADLG, HVLTFPRL, HVLTREC * seed = 999 * learning rate = 3e-4 * steps = 2000 * reps = 4 * num of rows in data = 8435 * num of unique subjects = 1162 **S1 Graph:** ![v2_s1](https://hackmd.io/_uploads/SJbt2Ud-C.png) **S2 Graph:** ![v2_s2](https://hackmd.io/_uploads/Hkjl6I_WA.png) **S3 Graph:** ![v2_s3](https://hackmd.io/_uploads/rkkzpU_b0.png) **Loss Graph:** ![loss2](https://hackmd.io/_uploads/SJm7TUOWR.png) ### Experiment 3 - ess, rem, gds, stai, quip, hy, NHY * seed = 999 * learning rate = 3e-4 * steps = 2000 * reps = 4 * num of rows in data = 8435 * num of unique subjects = 1162 **S1 Graph:** ![v3_s1](https://hackmd.io/_uploads/rynzWvOWC.png) **S2 Graph:** ![v3_s2](https://hackmd.io/_uploads/Sk07bPuZR.png) **S3 Graph:** ![v3_s3](https://hackmd.io/_uploads/rkuVWDuZR.png) **Loss Graph:** ![loss3](https://hackmd.io/_uploads/rkaS-Pd-R.png) ### Experiment 4 - scopa, pigd, NP1COG, NP1DPRS, NP1ANXS, NP1APAT, NP1FATG * seed = 999 * learning rate = 3e-4 * steps = 2000 * reps = 4 * num of rows in data = 8435 * num of unique subjects = 1162 **S1 Graph:** ![v4_s1](https://hackmd.io/_uploads/SyQtVP_WC.png) **S2 Graph:** ![v4_s2](https://hackmd.io/_uploads/S1atEDdWC.png) **S3 Graph:** ![v4_s3](https://hackmd.io/_uploads/r145NDuWA.png) **Loss Graph:** ![loss4](https://hackmd.io/_uploads/rJki4PdW0.png) ### Experiment 5 - abeta, tau, ptau, asyn, urate, nfl_serum, LEDD * seed = 999 * learning rate = 3e-4 * steps = 2000 * reps = 4 * num of rows in data = 8435 * num of unique subjects = 1162 **S1 Graph:** ![v5_s1](https://hackmd.io/_uploads/BkxPKDd-R.png) **S2 Graph:** ![v5_s2](https://hackmd.io/_uploads/HkPDtvOZA.png) **S3 Graph:** ![v5_s3](https://hackmd.io/_uploads/SJCPFvO-A.png) **Loss Graph:** ![loss5](https://hackmd.io/_uploads/SJFdYP_bA.png) ### Experiment 6 - NP1HALL, NP1DDS, MODBNT, TMT_A, TMT_B, total_di_18_1_BMP, total_di_22_6_BMP * seed = 999 * learning rate = 3e-4 * steps = 2000 * reps = 4 * num of rows in data = 8435 * num of unique subjects = 1162 **S1 Graph:** ![v6_s1](https://hackmd.io/_uploads/Sy3ypwOZ0.png) **S2 Graph:** ![v6_s2](https://hackmd.io/_uploads/HyHl6PuWC.png) **S3 Graph:** ![v6_s3](https://hackmd.io/_uploads/rJAg6PdWC.png) **Loss Graph:** ![loss6](https://hackmd.io/_uploads/B1DWawubA.png) ### Experiment 7 - nfl_csf, lowput_ratio, DATSCAN_CAUDATE_L, DATSCAN_CAUDATE_R, con_caudate, ips_caudate, mean_caudate * seed = 999 * learning rate = 3e-4 * steps = 2000 * reps = 4 * num of rows in data = 8435 * num of unique subjects = 1162 **S1 Graph:** ![v7_s1](https://hackmd.io/_uploads/ryqM-OO-R.png) **S2 Graph:** ![v7_s2](https://hackmd.io/_uploads/rkD7b__bA.png) **S3 Graph:** ![v7_s3](https://hackmd.io/_uploads/rJW4-udZA.png) **Loss Graph:** ![loss7](https://hackmd.io/_uploads/HJBrbudZA.png) ### Experiment 8 - DATSCAN_PUTAMEN_L, DATSCAN_PUTAMEN_R, con_putamen, ips_putamen, mean_putamen, con_striatum, ips_striatum, mean_striatum \*This experiment has 16 variables in its batch * seed = 999 * learning rate = 3e-4 * steps = 2000 * reps = 4 * num of rows in data = 8435 * num of unique subjects = 1162 **S1 Graph:** ![v8_s1](https://hackmd.io/_uploads/rJUzLd_b0.png) **S2 Graph:** ![v8_s2](https://hackmd.io/_uploads/SJ3GLOuWC.png) **S3 Graph:** ![v8_s3](https://hackmd.io/_uploads/BJVQIdub0.png) **Loss Graph:** ![loss8](https://hackmd.io/_uploads/r1k48ddbA.png) ## Feature selection I started looking at the data and researched into what each variable meant by looking up the variable/feature names in the PPMI Data Dictionary and Code List, which explained what each variable was and what values meant. The first variables/features were categorical features like sex, race, ethnicity, handedness, etc. We decided to focus on using numerical features instead of categorical features when running experiments. I used [the PPMI Overview Hack](https://hackmd.io/XG3ITA8iR2S8A2M-0O6GfA) to understand what each variable meant. Below are the numerical features we considered: **Numerical features:** - BMI - bjlot - hvlt_discrimination - hvlt_immediaterecall - hvlt_retention - HVLTRDLY - lns - SDMTOTAL - VLTANIM - MSEADLG - ess - rem - gds - stai - scopa - pigd - abeta - tau - ptau - asyn - urate - nfl_serum **Numerical features with few values:** - HVLTFPRL - HVLTREC - quip - hy - NHY - NP1COG - NP1DPRS - NP1ANXS - NP1APAT - NP1FATG **Numerical features with many of one value:** - LEDD (lot of zero values) - NP1HALL (mostly zeros) - NP1DDS (mostly zeros) **Numerical features with many missing values:** - upsit - lexical - MODBNT - TMT_A - TMT_B - total_di_18_1_BMP - total_di_22_6_BMP - nfl_csf - lowput_ratio (half missing) - DATSCAN_CAUDATE_L (half missing) - DATSCAN_CAUDATE_R (half missing) - con_caudate (half missing) - ips_caudate (half missing) - mean_caudate (half missing) - DATSCAN_PUTAMEN_L (half missing) - DATSCAN_PUTAMEN_R (half missing) - con_putamen (half missing) - ips_putamen (half missing) - mean_putamen (half missing) - con_striatum (half missing) - ips_striatum (half missing) - mean_striatum (half missing) In running the experiments, we would include only a subset of the features at a time. We would have 15 features/variables in each batch. 8 of them would be the same for every single batch - **updrs1_score, updrs2_score, updrs3_score, updrs3_score_on, updrs4_score, updrs_totscore, updrs_totscore_on, and moca**. 7 of them would change between experiments and would be picked from the above list of numerical features. ## Formatting data for GLACIAL I used the following code to convert the PPMI curated data into a formatted spreadsheet that the GLACIAL codebase can read (proper ordering of columns) and run its model on: ```python import pandas as pd import numpy as np import json # The CSV file path stored as a string csv_name = "data/data1.csv" # Read a CSV file into a DataFrame df = pd.read_csv('data/PPMI_Curated_Data_Cut_Public_20230612_rev.csv') # Filter DataFrame to have patients with atleast min_records visits recorded min_records = 4 filtered_df = df.groupby('PATNO').filter(lambda x: len(x) >= min_records) # Sort the DataFrame by 'PATNO' in ascending order sorted_df = filtered_df.sort_values(by=['PATNO', 'YEAR'], ascending=[True, True]) columns = ["updrs1_score", "updrs2_score", "updrs3_score", "updrs3_score_on", "updrs4_score", "updrs_totscore", "updrs_totscore_on", "moca"] variables = ["BMI", "bjlot", "hvlt_discrimination", "hvlt_immediaterecall", "hvlt_retention", "upsit", "lexical"] # length = 7 (variables batch size = 15) columns.extend(variables) # Create a dictionary that matches the desired JSON structure data = { "csv": csv_name, "feats": columns } final_columns = columns.copy() final_columns.extend(["age_at_visit", "PATNO", "YEAR"]) result_df = sorted_df[final_columns] result_df.rename(columns={'age_at_visit' : 'AGE', 'PATNO': 'RID', 'YEAR': 'Year'}, inplace=True) result_df.replace('.', np.nan, inplace=True) result_df.to_csv(csv_name, index=False) # Specify the file name where you want to store your JSON data json_file_name = 'data/data.json' # Writing the dictionary to a JSON file with open(json_file_name, 'w') as json_file: json.dump(data, json_file, indent=4) ``` The resulting csv file has the proper ordering of columns (all variables, then individual's age, then individual's id, then timepoint).