# GLACIAL experiments on curated PPMI data
By Arya Anantula
## Experiments
### Experiment 1 - BMI, bjlot, hvlt_discrimination, hvlt_immediaterecall, hvlt_retention, upsit, lexical
* seed = 999 (seed for random number generation)
* learning rate = 3e-4 (used by Adam optimizer)
* steps = 2000 (number of epochs)
* reps = 4 (how many times cross-validation is repeated)
* num of rows in data = 8435
* num of unique subjects = 1162
**S1 Graph:**

**S2 Graph: (after orienting bidirectional edges)**

**S3 Graph: (after removing indirect edges)**

**Loss Graph:**

### Experiment 2 - HVLTRDLY, lns, SDMTOTAL, VLTANIM, MSEADLG, HVLTFPRL, HVLTREC
* seed = 999
* learning rate = 3e-4
* steps = 2000
* reps = 4
* num of rows in data = 8435
* num of unique subjects = 1162
**S1 Graph:**

**S2 Graph:**

**S3 Graph:**

**Loss Graph:**

### Experiment 3 - ess, rem, gds, stai, quip, hy, NHY
* seed = 999
* learning rate = 3e-4
* steps = 2000
* reps = 4
* num of rows in data = 8435
* num of unique subjects = 1162
**S1 Graph:**

**S2 Graph:**

**S3 Graph:**

**Loss Graph:**

### Experiment 4 - scopa, pigd, NP1COG, NP1DPRS, NP1ANXS, NP1APAT, NP1FATG
* seed = 999
* learning rate = 3e-4
* steps = 2000
* reps = 4
* num of rows in data = 8435
* num of unique subjects = 1162
**S1 Graph:**

**S2 Graph:**

**S3 Graph:**

**Loss Graph:**

### Experiment 5 - abeta, tau, ptau, asyn, urate, nfl_serum, LEDD
* seed = 999
* learning rate = 3e-4
* steps = 2000
* reps = 4
* num of rows in data = 8435
* num of unique subjects = 1162
**S1 Graph:**

**S2 Graph:**

**S3 Graph:**

**Loss Graph:**

### Experiment 6 - NP1HALL, NP1DDS, MODBNT, TMT_A, TMT_B, total_di_18_1_BMP, total_di_22_6_BMP
* seed = 999
* learning rate = 3e-4
* steps = 2000
* reps = 4
* num of rows in data = 8435
* num of unique subjects = 1162
**S1 Graph:**

**S2 Graph:**

**S3 Graph:**

**Loss Graph:**

### Experiment 7 - nfl_csf, lowput_ratio, DATSCAN_CAUDATE_L, DATSCAN_CAUDATE_R, con_caudate, ips_caudate, mean_caudate
* seed = 999
* learning rate = 3e-4
* steps = 2000
* reps = 4
* num of rows in data = 8435
* num of unique subjects = 1162
**S1 Graph:**

**S2 Graph:**

**S3 Graph:**

**Loss Graph:**

### Experiment 8 - DATSCAN_PUTAMEN_L, DATSCAN_PUTAMEN_R, con_putamen, ips_putamen, mean_putamen, con_striatum, ips_striatum, mean_striatum
\*This experiment has 16 variables in its batch
* seed = 999
* learning rate = 3e-4
* steps = 2000
* reps = 4
* num of rows in data = 8435
* num of unique subjects = 1162
**S1 Graph:**

**S2 Graph:**

**S3 Graph:**

**Loss Graph:**

## Feature selection
I started looking at the data and researched into what each variable meant by looking up the variable/feature names in the PPMI Data Dictionary and Code List, which explained what each variable was and what values meant. The first variables/features were categorical features like sex, race, ethnicity, handedness, etc. We decided to focus on using numerical features instead of categorical features when running experiments. I used [the PPMI Overview Hack](https://hackmd.io/XG3ITA8iR2S8A2M-0O6GfA) to understand what each variable meant. Below are the numerical features we considered:
**Numerical features:**
- BMI
- bjlot
- hvlt_discrimination
- hvlt_immediaterecall
- hvlt_retention
- HVLTRDLY
- lns
- SDMTOTAL
- VLTANIM
- MSEADLG
- ess
- rem
- gds
- stai
- scopa
- pigd
- abeta
- tau
- ptau
- asyn
- urate
- nfl_serum
**Numerical features with few values:**
- HVLTFPRL
- HVLTREC
- quip
- hy
- NHY
- NP1COG
- NP1DPRS
- NP1ANXS
- NP1APAT
- NP1FATG
**Numerical features with many of one value:**
- LEDD (lot of zero values)
- NP1HALL (mostly zeros)
- NP1DDS (mostly zeros)
**Numerical features with many missing values:**
- upsit
- lexical
- MODBNT
- TMT_A
- TMT_B
- total_di_18_1_BMP
- total_di_22_6_BMP
- nfl_csf
- lowput_ratio (half missing)
- DATSCAN_CAUDATE_L (half missing)
- DATSCAN_CAUDATE_R (half missing)
- con_caudate (half missing)
- ips_caudate (half missing)
- mean_caudate (half missing)
- DATSCAN_PUTAMEN_L (half missing)
- DATSCAN_PUTAMEN_R (half missing)
- con_putamen (half missing)
- ips_putamen (half missing)
- mean_putamen (half missing)
- con_striatum (half missing)
- ips_striatum (half missing)
- mean_striatum (half missing)
In running the experiments, we would include only a subset of the features at a time. We would have 15 features/variables in each batch. 8 of them would be the same for every single batch - **updrs1_score, updrs2_score, updrs3_score, updrs3_score_on, updrs4_score, updrs_totscore, updrs_totscore_on, and moca**. 7 of them would change between experiments and would be picked from the above list of numerical features.
## Formatting data for GLACIAL
I used the following code to convert the PPMI curated data into a formatted spreadsheet that the GLACIAL codebase can read (proper ordering of columns) and run its model on:
```python
import pandas as pd
import numpy as np
import json
# The CSV file path stored as a string
csv_name = "data/data1.csv"
# Read a CSV file into a DataFrame
df = pd.read_csv('data/PPMI_Curated_Data_Cut_Public_20230612_rev.csv')
# Filter DataFrame to have patients with atleast min_records visits recorded
min_records = 4
filtered_df = df.groupby('PATNO').filter(lambda x: len(x) >= min_records)
# Sort the DataFrame by 'PATNO' in ascending order
sorted_df = filtered_df.sort_values(by=['PATNO', 'YEAR'], ascending=[True, True])
columns = ["updrs1_score", "updrs2_score", "updrs3_score", "updrs3_score_on", "updrs4_score", "updrs_totscore", "updrs_totscore_on", "moca"]
variables = ["BMI", "bjlot", "hvlt_discrimination", "hvlt_immediaterecall", "hvlt_retention", "upsit", "lexical"] # length = 7 (variables batch size = 15)
columns.extend(variables)
# Create a dictionary that matches the desired JSON structure
data = {
"csv": csv_name,
"feats": columns
}
final_columns = columns.copy()
final_columns.extend(["age_at_visit", "PATNO", "YEAR"])
result_df = sorted_df[final_columns]
result_df.rename(columns={'age_at_visit' : 'AGE', 'PATNO': 'RID', 'YEAR': 'Year'}, inplace=True)
result_df.replace('.', np.nan, inplace=True)
result_df.to_csv(csv_name, index=False)
# Specify the file name where you want to store your JSON data
json_file_name = 'data/data.json'
# Writing the dictionary to a JSON file
with open(json_file_name, 'w') as json_file:
json.dump(data, json_file, indent=4)
```
The resulting csv file has the proper ordering of columns (all variables, then individual's age, then individual's id, then timepoint).