110-1 Machine Learning

--- tags: 110-1 --- # 110-1 Machine Learning ## Course Info - [Course website]( https://www.csie.ntu.edu.tw/~htlin/course/ml21fall/) - [Course stream](https://www.csie.ntu.edu.tw/~htlin/course/ml21fall/screencast.php) - [Classmate collaborative note](https://hackmd.io/@-TyNLpH6RM-50upth1_LeQ/r1vwwozVK) ## Final Project - Approachs - Model - MLP - Support vector machine - Logistic Regression - Classifier - Softmax - [XGBoost](https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC5-2%E8%AC%9B-kaggle%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92%E7%AB%B6%E8%B3%BD%E7%A5%9E%E5%99%A8xgboost%E4%BB%8B%E7%B4%B9-1c8f55cffcc) - csv - demographics.csv: All missing: 770/6163 - gender - 哪個少補哪個 - age - 取 <30 >65 的平均 - *under30 - married - 哪個少補哪個 - senior citizen - dependents - 先用Number of dependents來捕 - Number of dependents - location.csv - remove - contry - state - city - drop - zip code - drop - Lat Long - *Latitude - drop - *Longitude - drop - population.csv: - Zip Code - drop - Population - 補中位數 - satisfaction.csv - Satisfaction Score - Missing: 770/6163 - services.csv - Referred a Friend - 先用Number of Referrals - Number of Referrals - 補完drop - Tenure in Months - 取中位數 - Offer - drop - Phone Service - 有就service +1 缺項+0 - Internet Service - 有就service +1 缺項+0 - Online Security - 有就service +1 缺項+0 - Online Backup - 有就service +1 缺項+0 - Device Protection Plan - 有就service +1 缺項+0 - Premium Tech Support - 有就service +1 缺項+0 - Streaming TV - 有就service +1 缺項+0 - Streaming Movies - 有就service +1 缺項+0 - Streaming Music - 有就service +1 缺項+0 - Unlimited Data - 有就service +1 缺項+0 - Contract - **Important** - Paperless Billing - 有就service +1 缺項+0 - Payment Method - drop - Monthly Charge - 補中位數 - Total Charges - drop - Total Refunds - 補0 - Total Extra Data Charges - drop - Total Long Distance Charges - drop - Total Revenue - drop - status.csv: Label - preprocessing - location and population合成一個 - 照ID字典順序排 ### 分配 - csv - demographics.csv: All missing: 770/6163 - gender - age - under30 - married - senior citizen - dependents - Number of dependents - location.csv - city - zip code - Lat Long - population.csv: - Zip Code - Population - satisfaction.csv - Satisfaction Score - =============================== Gary - services.csv - Referred a Friend - Number of Referrals - Tenure in Months - Offer - Multiple lines - Phone Service - Internet type - Internet Service - Online Security - Online Backup - Average monthly long distance charges - Average monthly GB download - Device Protection Plan - ================================ Johnny - Premium Tech Support - Streaming TV - Streaming Movies - Streaming Music - Unlimited Data - Contract - Paperless Billing - Payment Method - Monthly Charge - Total Charges - Total Refunds - Total Extra Data Charges - Total Long Distance Charges - Total Revenue - ================================ Edge ### 結果 - csv - demographics.csv: All missing: 770/6163 - ~~gender~~ - age - under30 - married - senior citizen - dependents - ~~Number of dependents~~ - location.csv - ~~city~~ - zip code - Lat Long - population.csv: - Zip Code - ~~Population~~ - satisfaction.csv - Satisfaction Score ### Next meeting - Test set 有些feature完全沒有怎麼辦，全部補-1 - training data inbalance: - ![](https://i.imgur.com/rgTDUhF.png) - 陳奕嘉大大的開示 - https://imbalanced-learn.org/stable/install.html - ![](https://i.imgur.com/rGfW04Q.png) #### 12/29 - KNN補質先補好 - 看oversample的分佈如何 - 看哪個類別分的比較不好 - 先拿有label補train再跟test合起來一起補 #### 12/30 - 嘗試做one-hot #### 1/2 - data imptue history - 更原始(drop service satisfaction)，補個平均 - 原始的補值法(照原本分佈)，看圖表，也有drop - 只補確定的，不確的用sklearn.iterimputer補 - 方法 - SVM - ADA boost - Random Forest - TODO - 2-phase prediction - 先預測yes or no - 再對yes預測屬於哪一類 ## Report > [name=Edge] 我先不管中文或英文去寫報告，直接先用英文起個頭唷 - Impute data according to attributes' data distribution under their Churn label(Edge): - Purpose - Based on an assumption that the distribution of churn label was impacted by the distribution of other attributes(we will show that we backed up this assumption by analyzing the attributes' data distribution using churn label), we decided to try to use the data distribution under their specific churn label to impute data. Since this method depends on the churn label, it can only be used on training data. For testing data, we use each attributes original data distribution to impute their missing value - Justification - For simplicity, we roughly divided the churn class by two classes, one is "No churn", the other is other kinds of churn reason (shortly abbreviated as "Churn"). - We plotted the proportion of the attributes' data distribution under our modified churn class as figures illustrated below. For clearity, we only plotted 2 figures to back up our reasoning. We can observe that some of the attributes in the training data distribution is clearly influenced by our modified churn label, thus provided us the fact that we may impute data according to the data distribution of the attribute influenced by our modified churn label. - ![](https://i.imgur.com/KwbaaJP.png) - This image shows that for "Premium tech support" attribute, their Yes/No ratio under customer's own churn label, if their Yes/No ratio is nearly 50/50, then it is telling us that the churn label merely have impact on that attribute - ![](https://i.imgur.com/XRVmmyq.png) - This image shows that for "Montly charge distribution" attribute, their Yes/No ratio under customer's own churn label, if their Yes/No ratio is nearly 50/50, then it is telling us that the churn label merely have impact on that attribute - The other reason for using this impute data technic is that each customer is an independent person, which means their choices are independently different from others based on their personal conditions. By using their churn label result, we can obtain slightly more precise possible missing attributes, and the imputed data distribution still conform to the overall data distribution of that attribute. In other word, we kind of using the concept of likelihood to impute our data. - Training - For each attribute in the training dataset, we use the data distribution of that attribute under our modified churn label to impute the missing data in that attribute of training dataset in order to ensure that the imputed training data distribution under modified churn label to be the same as the original one. - In detail, because the customer's churn label is contained in training dataset, we can directly use the specific missing data attribute's data distribution under the modified churn label obtained from that customer's churn label to randomly choose a value according to that data distribution to impute this missing data. - Testing - For each attribute in the testing dataset, because we can't obtain each customer's churn label, we can only use the data distribution of that attribute without any churn label information to impute the data, and the imputed data distribution will still be conform to the original data distribution - Pros - Because we only consider the data distribution of the attributes under the certain churn label and randomly generate impute data, the execution speed is acceptable. - This impute data technic models the independence of each customer, without their data distribution straying from the original data distribution. - Cons - This kind of technic for imputing data doesn't take other attribute's information into consideration, although we've stated that each person is an independent one, but we observed that there are still some attributes have large impacts on other attributes. In other word, we didn't take the relationships between all attributes into consideration, that is although we've observed that the churn label do be influenced by certain attributes, but actually there is some other attributes that have large impact on these attributes, and leads to certain churn label, the churn label is not the only source of influence for those attributes. - The result is that although the imputed data distribution is conform to the original one, but the relationship between attributes may be the wrong one for each customer, thus may lower the performance of our prediction generation model, especially our tree structure classifier. ## Outline ### 9/23 - Introduction - handout0 - handout1 - Assign hw0 until 11/11 ### 9/30 ### 10/7 - Assign hw1 until 10/21 13:00 ### 10/21 - Assign hw2 until 11/11 13:00 ### 11/4 - Assign hw3 until 11/25 13:00 ## Lecture Note ### 01_handout - Page 41: Since the cosine angle ($\frac{W_f^T}{||W_f||}$ and $\frac{W_T}{||W_T||}$ and so their dotted result $\frac{W_f^T\cdot{W_T}}{||W_f|| ||W_T||}$) can't grow greater than 1(They are normalized if you looked closer), the inequation is written in this form - A Simple Hypothesis Set: the "Perceptron" - $x = (x_1, x_2, ..., x_d)$: features of customer - approve if $\sum_{i=1}^d w_ix_i > threshold$ - deny if $\sum_{i=1}^d w_ix_i < threshold$ - y: {+1(good), -1(bad)}, 0 ignored - $h(x) = sign((\sum_{i=1}^dw_ix_i)-threshold)$ - 其中$-threshold$改成$(-threshold) \cdot (+1)$ 並入$w_0 x_0$ - 最終$h(x) = sign(\sum_{i=0}^dw_ix_i)=sign(w^\intercal x)$ - Perceptron Learning Algorithm(PLA) 1. find a **mistake** of $w_t$ called $(x_{n(t)}, y_{n(t)})$ $$sign(w_t^\intercal x_{n(t)}) \ne y_{n(t)}$$ 2. correct the mistake by $$w_{t+1} \leftarrow w_t + y_{n(t)} x_{n(t)}$$ ### 03_handout ![](https://i.imgur.com/QIHtJof.png) - Hoeffding's inequality - $N$: sample size (large) - $\epsilon > 0$ - "$\nu=\mu$" is **probably approximately correct** (PAC) $$\mathbb{P}[|\nu-\mu|>\epsilon]\leq 2e^{-2\epsilon^2N}$$ - M: cardinality of hypothesis set. $$\mathbb{P}[|E_{in}(g)-E_{out}(g)|>\epsilon]\leq2Me^{-2\epsilon^2N}$$ ### 04_handout - VC Bound ![](https://i.imgur.com/rf9Oo0F.png) $$ \mathbb{P}_D[|E_{in}(g) - E_{out}(g)| > \epsilon] \leq \mathbb{P}_D[\exists h \in \mathbb{H} \text{ s.t. } |E_{in}(h)-E_{out}(h)| > \epsilon] \leq 4m_\mathbb{H} (2N)e^{-\dfrac{1}{8}\epsilon^2N} \leq 4(2N)^{k-1} e^{-\dfrac{1}{8}\epsilon^2N} $$ - Looseness of VC Bound - Theory: $N \approx 10000d_{vc}$ - Practice: $N \approx 10d_{vc}$ $$ \mathbb{P}_D[|E_{in}(g) - E_{out}(g)| > \epsilon] \leq 4(2N)^{d_{vc}} e^{-\dfrac{1}{8}\epsilon^2N} $$ - VC dimension with finite set ![](https://i.imgur.com/gucYsL7.jpg)