6/5 聚會筆記

# 6/5 聚會筆記 [TOC] # Refactor the Project's Structure in OneDrive To manage the version for both data and code, I've copied the daㄋㄋta from ==/output_feature== to ==/preprocessed_data/0530/data== and the corresponded code from ==/code== to ==/preproessed_data/0530/code==. # Medical Project - Update [hackmd](https://hackmd.io/uBNwAOV_Qq-KopmMy-tzUQ?view) - Data Leakage Problem? - ‘MENTHLTH’:Now thinking about your **mental health**, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? , - ‘DECIDE’:Because of a physical, **mental, or emotional condition**, do you have serious difficulty concentrating,remembering, or making decisions? , - ‘POORHLTH’:During the past 30 days, for about how many days did **poor physical or mental health** keep you from doing your usual activities, such as self-care, work, or recreation? , - ‘DIFFALON’:Because of a **physical, mental, or emotional condition**, do you have difficulty doing errands alone such as visiting a doctor´s office or shopping? , - ‘_MENT14D’:3 level **not good mental health** status: 0 days, 1-13 days, 14-30 days - 問題 - 這5個變數是不是直接給答案? - 大多都是醫生診斷時會用的 - 決定 - 5個問題還是可以留 - 還是要看一下使用場景 - 如果問卷只有5題，可能就只會放兩題 - 想要找出醫生找不出來的因素 - 這5題可能就不會放 > Sam? - The number of class for target - 2.0 324035 - 1.0 75820 - 7.0 1476 - 9.0 621 - 0.0 6 - 總結 - 把7,9,0踢掉 - 最後產出得病機率 ## Verify the AI canvas - AI Canvas for health project had been updated and need some discussion. (Check Onedrive->Health project->AI Canvas->file) - 因果關係的確認 - feature importance前幾名的篩選 - 數據命題->解決方案的思考 - 其他 - 自動產生流程 ## Experiment Results and Findings - log loss interpretation? - 問Adams ## TODOs - 與Adams詢問關於Decanter AI的詳細功能，如preformance, metrics (尤其log loss/misclassification),如何說服沒有先備知識的人 - 修改AI canvas Integration的部分 - 確認目前需修正方向 - 把target value '7','9','0'的類別剔除 - 使用sklearn train_test_split依照target value的比例進行資料拆分後上傳到onedrive - 利用Decanter AI進行簡單實驗: 1. 所有features都使用 2. 剔除dominant feature: ['MENTHLTH','_MENT14D','DECIDE','POORHLTH'] 3. 追加剔除不dominant但與mental有關: ['DIFFALON'] - 將實驗結果整理後報告給大家 - 還有空會把當初清資料的code整理好上傳 - 下次可討論: - model的選擇: 考慮類似gradient boost的遞迴概念 - metrics的建立與判讀 - 製作罹病probability的功能 # House Project ## Verify the AI Canvas (Allen've updated the canvas) - Value Proposition 價值主張 - Customer / Target User 主要服務對象 - Data - Skills - Output - Integration - Stakehoder - Cost - Revenue [Question] 預售屋與成屋 - 可以都做 - 先分開Train - 篩選重要變數後，如果重要變數重疊，可以一起Train [Question] 我們設計 AI canvas 的步驟是什麼？ [Discussion] 使用情境，要使用一般民眾存取得到的資料? > Cobra? > 討論結果: 我們調查了一下藉由 clean_data_train 訓練的 GBDT 中的 feature importance, 發現與目前售屋網能夠存取到的資訊是比較一致的，除了 City_Land_Usage 都市土地使用分區的欄位之外。 [Dicussion] Check the target customer of our project > 很有錢的人不在乎預測準不準，他們在乎的東西跟一般民眾不一樣。如果我們的主要對象不是購買豪宅的人，我們有辦法側重在先預估好豪宅以外的交易案嗎？ > [name=elichen] ## Experiment Results and Findings: :::info Dataset information - clean_data_test.csv : clean_data_train.csv 的比例為 11591: 1183605 ~= 1:100 - clean_data_future_test.csv: clean_data_future_train.csv 的比例為: 10636 : 203751 ~= 1 : 20 ::: - Eli Chen - Only survey the future_price data (預售屋資料) this week - [The Jupyter Notebook for cleaning the data](https://github.com/JIElite/Taiwan-house-pricing/blob/main/notebook/EDA_future_price.ipynb) - [The Python code for building a baseline model]( https://github.com/JIElite/Taiwan-house-pricing/blob/main/train_baseline.py) - [The Python code for building a simple model with feature cleaning (Decision Tree)](https://github.com/JIElite/Taiwan-house-pricing/blob/main/train_DST.py) - [The Python code for building a simple model with feature cleaning (Random Forest)](https://github.com/JIElite/Taiwan-house-pricing/blob/main/train_RF.py) - [The Jupyter Notebook for adversarial validation](https://github.com/JIElite/Taiwan-house-pricing/blob/main/notebook/Adversarial_Validation_on_Future_Price_Data.ipynb) ## Decide the Prediction Target - Price per case - Price per Ping - 小問題 : 以房價網站來說，如果有加蓋、車位這個數字不會顯示 ## Decide the Metric for Measuring Model Performance - Eli Chen - **MAEs**: 但是分坪數大小，會有多個分數，如果需要整合在一個分數的話，可以利用我們調查大家購買坪數的分佈作為權重。他可能是一種雙峰分佈，在小坪數的部分，需求比較少，在大坪數的部分是有錢人的最愛。 - **MAPE**: 只會有一個 score, 用來計算 MAE 的比例。因為不同坪數的房子可能隱含高級或是不高級的資訊，所以我們除下去預測 Price_Per_Ping 會把資訊混雜在一起。於是我想到 MAPE 的方法，或是也可以利用 WMAPE 但是涉及權重，解讀上就不是那麼直覺。 ## Define the Acceptable Model for this Project (based on the metric), and Why? ## TODOs Discuss about it and Add new cards in the trello # Project Scheduling and Management - The resources for this project - Working hours - Allen: 5/27 ~ 6/ 12 較忙, 6/12 之後準備口試，看情況挪時間 - Sam: 六月初較忙 - Cobra: 6/6 - 6/13 忙期末考 - Martin: 應該都一樣？ - Eli: 6/6 - 6/13 忙面試，空閒時回調查一下醫學專案，該週以討論跟調查為主。每天晚上 8 點後回訊息。 - Wade: ? - The milestone - Prepare the presentation # An Introduction to MLflow (Optional) - We can solve what kinds of problems by using MLflow? - How to setup my environment - Installation - The destination of our MLflow server - Cautions - DO NOT log dataset to the server - DO NOT log the performance per training step (DDOS server, and Javascript can not render such huge data immediately) - How to use? - set_traking_uri - set_experiment - start a tracking run - set up run_name - log the information - end of tracking a run - Questions? - Trouble shooting - Ubuntu, MacOS -> call Eli - Windows -> ?? / call Eli # 臨時動議 - 之後跟Adams額外約時間 - Decanter AI 怎麼用? - 在Decanter AI內可以選擇各種指標判斷模型的好壞，能否在training的時候指定特定的指標 - 一般再判斷模型的好壞，要如何選擇需使用哪個指標判讀會較好 - 各指標判讀標準有無一定規定(ex.AUC>0.5表示比隨機還好) - 一般客戶對模型指標一點都不了解，這樣你們在模型training完後，如何說服沒有先備知識的人 - AI Canvas幫忙看 - 目前方向是否正確 - 匯報我們目前專案進度 - House Project - 以Total_price, Unit_price為Target, 預售屋與程屋合併的資料進行各種演算法測試，目前以LGBM的效果最好，但觀察valid與test結果，其誤差稍微大(100萬 vs 10萬)，在思考可以如何往下進行 - 目前使用MAEs、MAPs、R square作為模型判斷指標，想確認應使用哪種metrics作為指標較好 - 之前Adams做房價預測時，是如何判斷結果的 - 後續在實際使用時，無法讓使用者自行填寫這100多個欄位，目前的作法是參考feature importance 影響程度較高的變數(可能取前20個)作為X在進行預測，但就會犧牲模型的準確效果，是否有更好的做法 - 後續呈現時，是否需要做成互動式網頁呈現 - Medical Project - 有發現['MENTHLTH','_MENT14D','DECIDE','POORHLTH']為一般醫生都會問的問題，若將這些變數加入是否有先給答案的風險 - 目前討論後有幾種做法： - 變數仍然丟入進行預測 - 分為兩階段，第一階段只用['MENTHLTH','_MENT14D','DECIDE','POORHLTH']解決90%的客戶，剩餘10%客戶再用其他變數進行預測 - 如同House project問題，沒辦法讓客戶做如此多題的問券，應如何挑選較為重要的問題作為模型預測的變數 - 詢問成果報告的進行方式 - 是否有與其他人交流的機會 - 離metor結束前，是否還有其他的活動(怕時間衝到需先知道比較容易安排) - 其他模型 - Regress - SVM - Tree base - LGBM - 後續房價測試 - martin負責整理資料後output給大家 - 請eli幫忙使用上述資料，執行各model 結果上傳mlflow - eli 訂 MAEs - eli 檢查 adversarial validation (去除 'Month' column) - martin 執行LGBM model ## Presentation 分組 - 房屋 - Martin, Allen, Cobra - 醫療 - Sam, Eli, 步緯