資料分析 - HackMD

# 資料分析筆記 ## 基礎 ### 不同種的資料 - Nominal - 重點：僅具有分類性質 - 例子 - 性別：男、女 - 血型：A、B、O、AB - 國籍：台灣、美國、中國、日本 - 分析方法 - Distribution, Mod - Ordinal - 重點：具有排序性質 - 例子 - 學生成績：優、良、中、低 - 比賽排名：第一、第二、第三、第四 - 服務評價：非常好、很好、還不錯、普通、不好 - 分析方法 - 分佈、Mod、平均、Range - Interval - 重點：具有等距性質，可做加減不能乘除 - 例子： - 溫度：攝氏 20 度、攝氏 25 度、攝氏 30 度 - 時間：1 小時、2 小時、3 小時 - 分析方法 - Frequency distribution - Mode, median, or mean - Range, standard deviation, and variance - Ratio - 重點：可做加減乘除所有的數學運算 - 例子 - 身高：170 公分、180 公分、190 公分 - 體重：60 公斤、70 公斤、80 公斤 - 分析方法 - 比 interval 多了 matrix operation ### distance computation || Minkowski | Mahalanobis | Cosine | Jaccard | | -------- | -------- | -------- | -------- | -------- | | example |不考慮相關性，多維度計算| 考慮相關性，能夠辨別出較重要的維度 | 考慮兩項量角度差異性，適用documents, text data | 字詞匹配，忽略空格，適用於“集合”的相似程度 | - [參考資料](https://medium.com/marketingdatascience/%E5%B0%BA%E5%BA%A6%E7%9A%84%E9%A1%9E%E5%9E%8B-%E5%90%8D%E7%9B%AE%E5%B0%BA%E5%BA%A6-%E9%A0%86%E5%BA%8F%E5%B0%BA%E5%BA%A6-%E5%8D%80%E9%96%93%E5%B0%BA%E5%BA%A6-%E6%AF%94%E4%BE%8B%E5%B0%BA%E5%BA%A6-d567f93b5104) ## 分類 ### Different type of Classification model 1. Decision tree 2. Naive Bayesian Classifier 3. NN ### Decision tree 介紹 #### 方法 1. Greedy 2. divide and conquer 3. use entropy to dicide how to split record 4. Attribute are categorical #### 優點 1. 易於實現 2. 快速 3. 易於解釋架構 4. 適用於簡單的dataset #### 缺點 1. 不適用correlated attribute(features之間互相有相關) 2. 容易over fitting #### 計算方法 ![image](https://hackmd.io/_uploads/HJaomkjIT.png =500x) #### 改善方法 1. prepruning / Postpruning 2. Random Forest = Bagging + Decision Tree a. construct mutiple Decision tree b. each tree choose random features c. majority vote(Aggregate) 3. Boosting: - 與Bagging類似，但更強調對錯誤部份加強學習以提升整體的效率。是透過將舊分類器的錯誤資料權重提高，加重對錯誤部分的練習，訓練出新的分類器，這樣新的分類器就會學習到錯誤分類資料(misclassified data)的特性 ### Cost-Sensitive measure ![截圖 2023-12-16 下午2.04.05](https://hackmd.io/_uploads/rJjkR25Up.png) - 像是是否確真應該使用 recall 判定抓出多少比例的真正確診者 ### Naive Bayes 介紹 ![截圖 2023-12-16 下午2.29.28](https://hackmd.io/_uploads/rJp0mp5UT.png) #### 優點 1. 易於實作 #### 缺點 1. 需假設所有類別皆是“獨立”,否則無法適用 Eg:Symptons:fever, cough Eg:Disease: lung cancer, diabetes - 可以改用信念網路 Belief Network - Bayesian belief network allows a subset of the variables conditionally independent ## 分群 ### Different type of Clustering model 1. Partitioning algorithms - Kmeans 2. Hierarchy algorithms 3. Density-based 4. Grid-based 5. Model-based ### Kmeans 介紹 #### 優點 - 簡單易懂，易於實現。 - 速度快，可以處理大型資料集。 #### 缺點 - 容易陷入局部最優解。 - 對資料的初始分群有依賴性。 - k 是一個超參數需要校調 - 對於不平均的資料會有點障礙 - [Kmeans 的三個缺點](https://ben-do.github.io/2016/08/20/Three-Shortcomings-of-K-means/) #### 適用於 1. 資料量大，特徵數量不太多。 2. 資料集的聚類結構相對明顯。 ### K-means 比較 K-medoids | | K-means | K-medoids | | -------- | -------- | -------- | | 中心點 | 虛擬點 | 實際樣本點 | | 新的中心點決定方式 | 群內樣本平均值 | 群內距離和最小 | #### 優點 - 當噪音和孤立点時, K-medoids 比 K-means 更穩健。 #### 缺點 - K-medoids 對於小資料集表現較好, 但不能很有效地應用於大資料集，計算直心需要花費太多時間O(n^2) - [參考文章](https://blog.csdn.net/databatman/article/details/50445561) ### 階層式分群法（Hierarchical Clustering） - 主要分為兩個類別聚合式階層分群法與分裂式階層分群法 - [參考連結](https://medium.com/ai-academy-taiwan/clustering-method-4-ed927a5b4377) #### 聚合式階層分群法 - Min - 優點 - Can handle non-elliptical shapes - 缺點 - 易受雜訊(Noise)與離異值(Outlier)的影響 - Max - 優點 - Less susceptible to noise and outliers - 缺點 - 傾向對具有大量資料之集群做分割。 - 對球狀分佈(Globular Clusters)的資料分割有所偏差 - Tends to break large clusters; Biased towards globular clusters - Average - 優點 - 不易受雜訊或離異值的影響。 - 缺點 - 對球狀分佈(Globular Clusters)的資料分割有所偏差(Biased) - Ward's(這個的優缺點是 google 的 bard 給我的) - 優點 - 具有較好的聚類穩定性，不會受到噪聲的影響。 - 可以生成具有緊湊形狀的簇。 - 缺點 - 計算量比較大。 ### Density-Based Clustering #### DBSCAN [參考資料真心看了就懂了](https://axk51013.medium.com/%E4%B8%8D%E8%A6%81%E5%86%8D%E7%94%A8k-means-%E8%B6%85%E5%AF%A6%E7%94%A8%E5%88%86%E7%BE%A4%E6%B3%95dbscan%E8%A9%B3%E8%A7%A3-a33fa287c0e) [在不懂去看這個影片](https://www.youtube.com/watch?v=RDZUdRSDOok) - 重點 - DBSCAN是所謂Density-Based的方法，也就是他最重視的是data的密度 - DBSCAN能夠自動處理noise - DBSCAN會依據data性質自行決定最終Cluster的數量 - 超參數 - 而DBSCAN在使用上的hyperparameters也只有兩個 1. eps，也就是我們每個點「要搜尋的周圍範圍要多大」 2. min_samples，也就是我們認為一個範圍內「有多少個點以上」才算密度夠高。 - 實際案例 - 依照noise的數量調整 - 一般我們一份data裡面先做完EDA，會對這份資料大概有多少比例的noise有一些概念，所以當我們發現DBSCAN抓出來的noise太多或太少時，我們可以再去調整我們的超參數 #### OPTICS [我的理解是跟 DBSCAN 差不多](https://jianjiesun.medium.com/dl-ml%E7%AD%86%E8%A8%98-%E5%85%AB-clustering-algorithm-optics-207279452c90) [看不懂的話這個也可以看看](https://medium.com/ai-academy-taiwan/clustering-method-6-d0c207daced6) - 以下這兩句話不是很懂 - 當可達距離(ReachDist)小於密度半徑ε時，可以視為與先前類別同類，距離較近，視為同類 - 而當可達距離大於密度半徑ε，可以分為兩種狀況: 核心距離小於密度半徑ε，建立新群，其意思為已經達到一群的條件，但與其他群距離過遠，因此建立新群；而核心距離大於密度半徑ε時，不滿足群之條件，且與其他群距離過遠，判斷為雜訊點 - 有點與 DBSCAN 差不多，不過對中 hyperparmeter 比較不敏感 1. 與DBSCAN擁有相同特性(不須指定分群數量、可以應用於任意形狀之分群、能分辨出雜訊點) 2. 此算法相對 DBSCAN 而言，對參數的選定較不敏感 #### DENCLUE [這邊講述了他的大概原理](https://tomohiroliu22.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E5%AD%B8%E7%BF%92%E7%AD%86%E8%A8%98%E7%B3%BB%E5%88%97-88-%E5%9F%BA%E6%96%BC%E5%AF%86%E5%BA%A6%E8%81%9A%E9%A1%9E-density-based-clustering-6d1a726e435b) - 優點： - 核密度的引入，給出了密度的一個精確的定義。相較於DBSCAN，更具理論基礎。 - 缺點： - 密度的精確定義帶來了算法複雜度的增加。在定義域網格中的每個點，都需要遍歷所有樣本點來計算總核密度。網格大小決定了計算開銷：網格大，計算量小，精度降低；網格小，計算量大，精度提高。 #### CLIQUE ### 分群的度量 - 外部指標：用於衡量簇標籤與外部提供的類標籤匹配的程度，例如熵（Entropy）。 - 外部指標需要外部的類別資訊，通常用於監督式學習中。 - 內部指標：用於衡量聚類結構的優越性，不考慮外部信息，例如平方誤差和（Sum of Squared Error，SSE）。 - 內部指標是無監督的，不需要外部的類別資訊，主要用於非監督式學習中。 - 相對指標：用於比較兩個不同的簇，有時被稱為標準而非指標。 #### Internal measures - 簇內凝聚度: 衡量一個簇內成員彼此之間的相似程度，數值越小，凝聚度越高，代表簇內成員越緊密相關。 - 凝聚度通常由簇內平方和誤差 (SSE) 來測量。SSE 計算每個資料點與其所屬簇中心的距離平方，然後將所有距離平方相加。SSE 越小，表示簇內成員與簇中心的距離越近，彼此之間的差異越小，凝聚度越高。 - 簇間分離度: 衡量一個簇與其他簇之間的區分程度，數值越大，分離度越高，代表各簇之間越清晰 distinct。 - 分離度通常由簇間平方和誤差來測量。它計算每個資料點與其所屬簇以外的所有簇中心的距離平方，然後將所有距離平方相加。簇間平方和誤差越大，表示資料點與其不屬於的簇中心的距離越遠，各簇之間的差異越大，分離度越高。 ### 異常值檢測 - [裡面有介紹到 Statistical Approaches 的方法](https://www.freecodecamp.org/chinese/news/how-to-detect-outliers-in-machine-learning/) ## 文字探勘 ### Basic Concepts in NLP 1. Lexical analysis(詞彙) 2. Syntatic analysis(句法) 3. Semantic analysis(語意) 4. Inference (推理) ### NLP tech 1. POS tagging(標記Noun, Adj, Verb, ...) ![截圖 2023-12-16 下午3.34.37](https://hackmd.io/_uploads/BkiGXA9L6.png) 2. Stemmer (詞幹提取) ![截圖 2023-12-16 下午3.35.02](https://hackmd.io/_uploads/BJMVQ0cI6.png) 3. Vector sapce model(Bag of word) 4. Term frequency and Inverse Document frequency a. normalizing term frequency b. scale down the coordinates of terms that occur many times in a description(Eg:the, a, of, to, and ,is, are, ...) ### Word2Vec #### Continous Bag-of-words Model 1. Predict target word by context word(上下文) 2. Utilize a neural network to learn the weights of the word vector - 缺點 - can't not capture rare word #### Skip-gram model 1. help capture rare word - 優點 - It can capture rare word - It captures the similarity of word semantic ### RNN 1. Design for obtaining language model 2. all the inputs are dependent #### 優點 1. can processany length input 2. model size doesn't increase for longer input 3. same weights on every step #### 缺點 1. Recurrent computation is slow 2. difficult to access information from many steps back ### Transformer 1. Self-Attention #### pre-train task 1. Masked LM(cloze mask) a. mask 15% of all tokens b. (80%)replace the word with the [MASK] token c. (10%)replace the word with random word d. (10%)keep the word unchange 2. Next Sentence Prediction ![截圖 2023-12-16 下午4.41.10](https://hackmd.io/_uploads/H1m3MkoI6.png) ### Topic Mining 1. input: a. A collections of text documents b. number of topics 2. Output: a. k topics b. Coverage of topics in each documents #### 缺點 1. Not generic: a. can not represent complicated topics 2. Cannot capture variation of vocalbulary 3. Word sense ambiguity(同字不同義) ## Assocaition Rules ### Apriori Algorithm - [參考這個影片](https://www.youtube.com/watch?v=43CMKRHdH30) ### FP-Growth Algorithm - [這個影片講得很清楚](https://www.youtube.com/watch?v=7oGz4PCp9jI)