新舊版 processor

# 新舊版 processor 在新舊版中的 processor diff 如下 :::warning 1. 我們希望 input 能夠是數值類型的變數，所以在資料處理時盡量不要在變數中出現字串。 2. NaN 相關 comment 無須理會 3. **希望能夠整理出一個包含 (X, y) 的 dataframe，也就是合併 IV SV MV 和 DV，並且移除與 y 相依的變數 (例如 y=recurrence，移除 early/late recurrence)** ::: ## SV ### delete pass ### add pass ### comment - row indices 135 age 變數有缺失 ## MV ### delete - df["stop_AED_number"] = raw["停藥前\nAED"] ### add - df["taper_AED_number"] = raw["開始減藥時AED\nNumber"] ### comment - stop AED number can be calculate with (len(raw["停藥前\nAED"]) + 2) % 5 - ['ZNS', 'ZNS_dosage', 'PER', 'PER_dosage', 'PBB', 'PBB_dosage', 'GPB', 'GPB_dosage', 'PGB', 'PGB_dosage', 'stop_AED_owns'] 變數有缺失值 ## SV ### delete pass ### add - df["etiology"] = raw["etiology\n1.structural\n2.Genetic\n3.Infections\n4.Metabolic\n5.Immune\n6.Unknown"] - 1 - df["etiology_known"] = raw["Etiology \nClassification\nunknown = 0\n其他 = 1"] ### comment - 變數 'presumed_epileptic_focus_lr' 和 'presumed_epileptic_focus_lr' 要進行向量化，如果你要使用他。 - 變數 df["last_time_onset"] = raw["最後一次發作"] 要進行向量化，如果你要使用他。 - 變數 df["status"] = raw["Status\n 住院"] 存在 "X" 和 "x" 需要進行處理 ## DV ### delete - df["last_time_appointment"] = raw["最後回診日期"] - df[f"btEEG_{eeg}"] = (raw["減藥前EEG\n1.routine\n2.3hrs\n3.3D VEEG"] == (idx + 1)) * 1 - df[f"stEEG_{eeg}_{d}"] = (raw[f"停藥後EEG\n1.routine 2.3hrs 3.3D VEEG\n{d}"] == (idx + 1)) * 1 ### add - df["MRI_findings"] = raw["1.Malformations of cortical development\n2.Vascular\n3.Hippocampal sclerosis\n4.Hypoxic-Ischemic\n5.Traumatic brain injury\n6.Tumors\n7.ADEM/inflammatory brain lesions 8. Other"].apply(lambda x: x-1 if type(x) == int else x) ### comment - 並未使用 stop treatment 和 before taper 的 EEG 類型 - 變數 "stEEG_IIED_1Y-1.5Y" 值全部都一樣，因此需要移除變數 - 變數 "recurrence_days_after_stop" 包含多個 NaN - 變數 ['abnormal_MRI', 'abnormal_MRI_epilepsy', 'MRI_findings', 'btEEG_date', 'start_treat_date', 'start_taper_date', 'stop_date', 'stop_age', 'recurrence_date', 'recurrence_EEG_discharge'] 包含字串，需要處理