實驗日誌 - HackMD

# 實驗日誌 ## 3/20 嘗試各種參數去看精準度的變化在查看準確率時突然發現對準確率的評量非常陌生， * confusion matrix: ![](https://i.imgur.com/Pm6Gcyn.png) 不只計算正確率，也去查看在不同的class中判斷是如何，類似在我們這個問題中，有一個病毒佔了大部分的data，假設機器只是將他們全部評估為那一個data，雖然會得到狠的正確率，但事實上是機器什麼都沒有學到，sklearn內建 ``` from sklearn.metrics import confusion_matrix actual = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0] predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0] tn, fp, fn, tp = confusion_matrix(actual, predicted).ravel() print(tn, fp, fn, tp) ``` 實際操作![](https://i.imgur.com/KIYXNPv.png) ### 論文 CLOUDS:A Decision Tree Classifier for Large Datasets(Khaled Alsabti) gini index:當決策樹尋找分裂點時，計算分裂後每個class 佔整體的多少比率並平方加起來，找到加起來最大的就是最好的分裂點 * to estimate the spiliting point for numeric attributes 1. Sampling the splitting points: it divide the data into q intervals,and each internal own almost the smae data,calculate gini index in each internal,and determine the minimum gini value as the split point. 2. Sampling the Splitting points with Estimation First,it will do the same thing as 1,and then it choose an interval to evaluate gini estimation(算出一區間的gini做為門檻)，then it will compare each $gini^{mini}$ with $gini^{est}$,eliminate $gini^{mini}$ which smaller than $gini^{est}$,alive $gini^{min}$ will calculate the entire dataset,and choose the best one. ## 3/23 嘗試各種參數與了解演算法 ### 論文 Communication and Memory Efficient Parallel Decision Tree Construction * They hope to develop a new alogrithm to construct decision tree in a more efficient approach. #### two major challenges in decision tree 1. It takes a lot of time to find the splitting,because the splitting point is related to all numerical attributes. 2. When partitioning a new node,we need to read and write a lot of data ,and it takes a lot of time. #### the flaw of decision tree SLIQ First,it sort all attributes values.Secondly,it creates a record-id to record what the responding id to the attributes values.Besides,it also create a class list to record the remaining data in the present node.Becasue they need to maintain class list ,it could not be scable(because the class list is access randomly and frequently,it must be in main memory)(why it need to be accessed randomly and frequently??) #### 實驗 1. ![](https://i.imgur.com/C4pro5b.png),0.941 2. ![](https://i.imgur.com/POGQj87.png),0.940 3. ![](https://i.imgur.com/2bFjSoV.png),0.932 4. ![](https://i.imgur.com/qAIya6c.png)，0.943 發現tree在1000以上與leaves 在32以上不管如何改變，準確率都停留在0.94 0.93左右，認為或許是以目前所給的feature量，這樣數目的樹已經做到極限了，故接著想給予更多的feature_fraction或許能夠給予更好的表現 ![](https://i.imgur.com/m0EVmA4.png),0.943 ![](https://i.imgur.com/Axs8t5Z.png),0.933 將feature_fraction調整到0.6，似乎沒有較好的表現， #### 接下來嘗試可以嘗試看看取樣較多的比例， ## 3/26 試著採樣更多比例與繼續了解3/23看的論文 ### 論文 #### Rainforest This framework propose a new way to record data and determine the spiliting criterion,AVC set. It look like : | Income | Rank | Buy mobile | | ------ | --------- | ---------- | | 75000 | Professor | yes | | 75000 | Professor | yes | | 50000 | Lecturer | yes | AVC set 1: | Income | yes | no | | ------ | --- | --- | | 75000 | 2 | 1 | | 50000 | 0 | 1 | AVC set 2: | Rank | yes | no | | --------- | --- | --- | | Professor | 2 | 0 | | Lecturer | 0 | 1 | To construct these table ,we only need number of distinct value product of number of distinct class memory which is much less than SLIQ. Although AVC set does not contain all information such that constructing the original dataset,it is enough to decide spilt criterion. #### alogrithm proposed by the papr - SPIES First,it contained three different AVC set 1. small AVC group:It contain the categorical attributes and the small number of numberical attributes. 2. Concise AVC group: It contain the numberical attributes.However,the numberical attributes is divided into intervals, while the interval parameter is the important paramter is the alogrithm which is equal width in the alogrithm. 3. Partial AVC group:It compute the subset of the concise AVC group.It computes the range of the intervals which could be split point. Second,alogrithm step: 1. It construct an estimated group for Concise AVC group and small AVC group by sampling data,and it get an highest gain,and we will prune the other interval which likely doesn't contain split point. 2. It use all data to construct three type AVC group,and find the split point. 3. It needs to check whether the split point,because the split candidate gained by sampling.It could detect the false when the complete AVC set constructed. 4. If it's false ,it needs to reconstruct the partial group or it get the split point. In this way,we could reduce a lot of time complexity and storage by sampling and pruning. Third,parallel computation: We could partition all data into many pieces,and each thread could comupte its own AVC set.Finally,it just add up each AVC set.It is very easy to parallely compute. ### 實驗 ![](https://i.imgur.com/NE5ZERB.png)0.936 看起來調整採樣比例也沒甚麼增加準確度，認為可能要增加一些其他參數調整unblance成false ![](https://i.imgur.com/GyF4QeC.png)，0.936 沒有變化，故把各個class的預測結果叫出 ![](https://i.imgur.com/HxCoZhy.png) 發現在小data的地方也幾乎沒有做錯，主要還是預測分類1時的錯誤，與前面情況相同 ## 3/31 論文:McRank:Learning to Rank Using Multiple Classification and Gradient Boosting ### Ranking Problem In this problem,a user type a query and it aims to response relvant information like google.Because users hope to get the response as soon as possible,it usually doesn't check all document. It usually use a small model to pick some relevance document and use the other model to rank it. In this paper,it builds the boosting tree by using the bin concept.They only bin the data where there are indeed data,because bootsing tree will not consider the area which it does not contain data.At first,they set a small bin length and a maximun number of bin.If they find that the data could not included into the bins,they just increment the bin length.