資料探勘 -- 彭文志

--- robots: index, follow tags: NCTU, CS, 共筆 description: 交大資工課程學習筆記 lang: zh-tw dir: ltr breaks: true disqus: calee GA: UA-100433652-1 --- 資料探勘 -- 彭文志 ====== ## Syllabus - 彭文志 - email: wcpeng@cs.nctu.edu.tw - 書: [Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison-Wesley](http://www-users.cs.umn.edu/~kumar/dmbook/index.php) - 4-5 hw - One data analysis project - team: 3-4 people - Datasets - PM 2.5 - Tai-Power - others - [WSDM 2018](http://www.wsdm-conference.org/2018/) - [Yelp Challenge](https://www.yelp.com/dataset/challenge) - NCTU DM workshop ## Summary - Data base - [Data wherehouse](https://zh.wikipedia.org/wiki/%E8%B3%87%E6%96%99%E5%80%89%E5%84%B2) - AI - ML - ... - Material - Introduction - association rules - classification - clustering - time series data analysis - sq - Knowledge Discovery - Nontrivial process - Database - Relational - Transactional - Spatial - Time series - Multimedia - Graph - Task - Prediction - Description - Knowledge - Association - Classification - Clustering - Time series - Regression - Performance - Efficiency - Effectiveness ## Course ### Data ![](https://i.imgur.com/C26uX38.png =260x) - Attribute - 分類 - Nominal - Ordinal 順序 ex. rank, grade - Interval 區間 - Ratio 比例 - Discrete 分散 - Continuous 連續 - Data Set - type - Record - Graph - Order - Structure Data - 可以把 data 訂出 table 與 schema - Dimensionality (多維) - ex. Curse of Dimensionality - Sparsity (稀疏) - Resolution (解析度) - Unstructure Data - 像是文件，只是 data object，不太能描述出 schema - 儲存方式(?) - Record Data - Data Matrix: 用 matrix 存 - Document Data: 存詞 - Transactioin Data - Graph Data: 用圖存 (ex. [PageRank algo](https://zh.wikipedia.org/wiki/%E4%BD%A9%E5%A5%87%E6%8E%92%E5%90%8D)) - Ordered Data - Sequences of transaction - 通聯記錄 - ![](https://i.imgur.com/KRswP6T.png =260x) - Genomic sequences - ![](https://i.imgur.com/PCBF2qI.png =260x) - [Spatio-Temporal Data](https://www.ibm.com/developerworks/cn/analytics/library/ba-cn-spatio-temporal-prediction-application/index.html) - Data Quality - 如何判斷資料品質 - 遇到不好的 data quality，該如何處理 - noise / outliers - 可能是要替掉，但也可能是應該分析的異常 - duplication - missing - 消掉這個 data object - 補上資料 - 去除段落 - 把區段放大 - Data Prepossing - Aggregation (聚合) - Data reduction 把不要的 attribute or object 去除掉 - Change of scale: ex. 把城市放大成國家 - More “stable” data - Sampling (取樣) - 可以作初步分析 - 策略 - random - without replacement - 被抓出來就不要放回去 - with replacement - stratified sampling - 把 data 先分組，在從每組裡抽樣 - Sample size - Curse of Dimensionality - [參考](https://hackmd.io/OwBghgxgnFCMAcBaWAWARhRKAm3iKmDCWxQGYJgUAzNANmoFYwg=#feature-selection) - Dimensionality Reduction - 避免 curse of dimension - ex. PCA, SVD, [auto encoding](https://en.wikipedia.org/wiki/Autoencoder) - Feature Subset Selection - Brute-force: 都試試看就好啦 XD - embeded: 用 ML train 一下 - filter - Wrapper - Feature Creating - 用現有的 feature，創造出新的 feature - 會比較乾淨、有用 - Similarity and Dissimilarity - Proximity 看像的部分 - distance funcetion 看不像的部分 - 聽說 similarity + dissimilarity 一起看可能會比較慘ㄟ - Distance - [參考](https://hackmd.io/OwBghgxgnFCMAcBaWAWARhRKAm3iKmDCWxQGYJgUAzNANmoFYwg=#ch5-similarity-based-learning) - 距離需要符合條件 1. $d(p, q) ≥ 0$ for all p, q: Positive definiteness 2. $d(p, q) = d(q, p)$ only if $q = p$ 3. $d(p, q) = d(q, p)$ for all p, q: Symmetry 4. $d(p, r) ≤ d(p, q) + d(q, r)$: Triangle Inequality ### Association Association Rule Mining - 名詞 - Item - Itemset - k-itenset - Support - 出現 - Support count ($\sigma$) - 計算出現數量 - Frequent Itemset - itemset 的 support >= min support - Association Rule - X -> Y - Rule Evaluation Metrics - Support - Itemset 可以出現 - $s = \dfrac{\sigma()}{|全部數|}$ - Confidence - 指出現在 rule 的機率 - ex. {a, b} -> {c} => support(a,b,c) / support(a,b) - 條件機率 - 若發生 X 狀態之下，Y 出現的機率 - Strong rule - $>=$ min support & $>=$ min confidence - Support 代表可性度(出現次數夠多)，confidence 比較像規則 - 方法 - Brute-force - 列舉所有 rules (排組 + DP) - 計算數量 - 刪掉不符合的 - Candidates Generation - 先找 frequent itemset - 先找出所有的 itemset - 透過 lattice 找出所有 itemset - Apriori Algo - 如果 lattice 上面的出現次數已經不夠多了，下面就不用走下去了 - 先將 1 item 列出來 - 刪掉 count 不夠的 - join 成 2 item - subset testing (如果這個 node 的媽媽中，有人被刪掉了，那這個人就可以被刪掉了) - 重複 - 挑戰 - Multiple scans of transaction database - Huge number of candidates - Tedious workload of support counting for candidates - 再檢查 min confidence - 用 hash 處理 - Redundant Rules - 在 {a,d}=>{c,e,f,g} 是已知 rule 條件下，哪些是可以確定直接是 rule 的？ - {a,d}=>{c,e,f} - {a}=>{c,e,f,g} - {a,d,c}=>{e,f,g} - {a}=>{d,c,e,f,g} - Improvement of Apriori Algo - 想法 - 減少 database scans - 減少候選人 - 促進候選人的 support count - 方法 - DHP - 用 hash 先篩一輪，把不可能是 frequence itemset 的刪掉 - 這樣在 scan 的時候，也相對需要 scan 的數量會比較少 - Partition - 將 dataset 切塊 - 每個區塊都可以放入 memory - 針對每一個 part 執行 Apriori，存進 Li ＊ - C = Li union - 重新在 scan 一次避免只是 partition 的頻率很高) - 用到 2 次 disk scan - Sampling - 針對 sample database 作 Apriori - Sampling 完之後的存入 Potentially - Border - 把 sample 裡覺得不是 frequence 的，拿出來再 check 一次(拿的 item 是有技巧的，不是所有不再 freq 理的 item 都要拿) - 遞迴掃下去 - 用的是 smaller support (放低標準) ＊ - 最好的 disk scan 是 1，最差是 2 > 是否所有的 freq 都可以找到？ - FP-Growth - 想法 - 不用 Candidation generation 的方法 - 減少 Disk Scan (DB scan) - 數量越多的，越可能出現在 freq item set - 作法 - 先對數量作排序 - 建 FP-tree - ![](https://i.imgur.com/9GvCl7s.png =250x) - FP-SubTree - 最後在檢查 confidence - [參考](https://kknews.cc/zh-tw/tech/j6bxakl.html) - 其他 - Multiple-Level Association Rules - 產品通常有相互關係 - ![](https://i.imgur.com/jEGR9Cx.png =250x) - Frequent closed itemset - ![](https://i.imgur.com/LHUphCO.png =250x) - Quantitative Association Rules - 不同維度(多個)的 association - 若連續，可以作 bining - ![](https://i.imgur.com/0OsokQy.png =250x) - Static(等寬) 切法 - ![](https://i.imgur.com/JABkjlY.png =250x) - 分布成空間中的點，在等距離切割 - data cube - ![](https://i.imgur.com/xFTmQOL.png =250x) - Dynicmic - 可能在 result 轉折處 - 可能在分布方式 (下圖第三種方法) - ![](https://i.imgur.com/eDLYPvf.png =260x) - clustering - ARCS (Association Rule Clustering System) - 先作 binner，用 binner 的切割做出 Association Rule，這時候 rule 已經出來了，在做 clustering - 作法 1. binning 2. frequent predicateset 3. clustering 4. optimize - ![](https://i.imgur.com/yKwvUdF.png =260x) - Mining Quantitative Association Rules in Large Relational Tables - 在生 rule 的同時，會回去修改 binning 的範圍 - ex. 如果這個 bin 太小了(< min support)，可能可以把兩個 bin merge 起來 - Correlation Analysis - Corr(A,B) = P(AUB)/(P(A)P(B)) - 有時候只算完 association rule 還不夠，需要在做一些分析 - Association rules weight - 每個 association rules 的權重不一樣 - MOTIF - Time Series Motif Discovery - 尋找時序間重複的 pattern - 可能要切 windows - 可能 y 軸要先 match 一個 function - [Tool](https://github.com/GrammarViz2/grammarviz2_src) - discord - 跟別人最不像的 ### Clustering 分群 - 三重點 - dist - 如何定義 distance function - algo - 用哪一種演算法 - analy - 分析分群結果好壞 - 影響分群效果 - 分群數量 - 希望可以抵抗 outliner - 想法 - center-based - density-based - 算法 - partition - 分成 k 群 - k-means - 作法 - 先隨便找 k 個中心 - 將所有點區分出與他們最近的中心 (分群) - 計算每個群的真的平均 - 將真的平均作為新的中心 - 一直做到沒有新的變動 - 複雜度: - t: 迭代次數 - n: 點數 - k: 分群數 - 多數時候 k, t << n - O(tkn) - 討論 - 受 noice 影響 - 受到 data size / densities 影響 - 受一開始初始 k 個點影響 => 多作幾次 - 找不出不規則形狀的分群結果 - 如果已經有分群的 label 了(監督式學習)，就不要用 k-means - 或者多切幾份在做 merge - Bisection K-means - 每次只切一刀 - 作法 - 每次找 cost 最高的群切割 - k-medoids algo. (PAM) - algo. - 任選 n 個點 (從原來的點中) - 把其他點劃分到離他們比較進的點群 - 計算 total cost - 每次選不同點當機制準，最後看是哪幾個點當基準產生最小的 total cost - 對抗 noise 比較強 - 對於數量級很大的資料，效率低 - => sampling 加速 (CLARA) - hierarchy - 分群成樹狀結構 - Agglomerative - bottom up - Algo. (disjoin set) - 把每個物件先都把自己當一群 - 每次把認兩個最相近的*群*合併 - 如何算*群*的距離？ - average link: 把 n\*m 個組合配起來算，加總 - centroid: 找中點 - single link: 找兩個群中，距離最遠的 - complete link: 距離最近的 - 實作 - proximity matrix - birge - (R-Tree) - Cure - Divisive - top down - density-base - DB Scan - 畫圓 - Optics - 自己找參數，讚 - Denclue - Clique - graph-base - SNN - model-base - 評估效果 - Outlier Analysis - 跟別人不一樣 ### Web Mining - 觀察 - Web 很巨大 - 複雜 - 動態 - 99% 無用 - 種類 - Web Content - ex. Search Engine - Text Mining - NLP - Classification / Clustering - Similarity search - term association - keyword - Web Traversal Robots - Crawler - 個人化 - 分群 - Multilevel WebDB - Layer 0: Web - Layer 1: time / URL / keywords / ... - Layer 2: summary / classification / clustering - Structure - Hyper Link - 建圖 - 角色 - hub - authorityies - a(p) = sum h(q) - Usage - Log ### Page Rank - Page Rank Matrix - $PR(A) = (\dfrac{PR(T_1)}{C(T_1)} + ... + \dfrac{PR(T_n)}{C(T_n)})$ - Dead ends - 節點只有入度，沒有出度 - 他們的 PR 會產生 $\dfrac1 0$ 的狀況 - 直接把這些節點遞迴的刪掉 - Topic-Sensitive PageRank - 加入 interesting Vector - $v_i = \beta M v_{i-1} + \dfrac{(1-\beta)e_s}{|S|}$ - $S = \{B, D\}$: 代表 B, D 都是使用者喜歡的 - $\beta$: 參數比 - $M$: 原來的 PageRank Matrix - $e_s$: 如果 S 裡有的就是 1 ，沒有的就是 0 所產生的 Vector - ex. ![](https://i.imgur.com/RkK50Is.png =260x) - Link Spam - 垃圾連結，但是技術性騙過 PageRank - 網頁分成三種 - Inaccessible pages - spammer 不會影響的 site - Accessible pages - 不是 spammer 的 site，但會被 spammer 影響 - Own pages - spammer 的 site - ex. ![](https://i.imgur.com/iHtdgGv.png =350x) - 利用 TrustRank 與 Spam mass 解決這個問題 - Trusk Rank - 利用 Topic Sensitive PR - 缺點：Spam 可以輕易的建出 link 到 good topic 上 - Spam Mass - $Spam Mass = \dfrac{r-t}{r}$ - r: Page Rank - t: Trust Rank - Spam Mass 越小 or 負的，代表越不可能是 Spam - Spam Mass 越接近 1，代表越可能是 Spam ### Sequential Patterns - sequential data - ... - in / out - input: sequential data, minimum support - output: all subsequence - 提取 candidate - 笨方法 - Candidate 1-subsequences - Candidate 2-subsequences - Candidate 3-subsequences - ... - Generalized Sequential Pattern (GSP) - raw data: - ![](https://i.imgur.com/eSsI6WS.png =250x) - Sort phase - 仿照 tuple 的 sort，從前面往後排 - ![](https://i.imgur.com/7ijXAZ6.png =250x) - Litemset phase - 找出 large item (出現的次數超過 minSup) - ![](https://i.imgur.com/ZQTYU3F.png =250x) - 作 map to 是為了之後查找方便 - Transformation phase - 把 origin 作一些轉換 - 同時如果不是 frequent large item，就把他刪掉 - ![](https://i.imgur.com/1RXIGyI.png =250x) - Sequence phase - Apriori-like algorithm - candidate generation - ![](https://i.imgur.com/DX9dF4x.png =250x) - ![](https://i.imgur.com/BfU8SPU.png =260x) - Maximal phase - Sliding Window ## HW ### HW0 - 正規化 - 距離 - offset - 振幅放大 - linear trend (找到趨勢線相減) - noice (smoothing function) - 分群 - [DTW](https://en.wikipedia.org/wiki/Dynamic_time_warping): 左右被移動了 - fastDTW - 降低維度 ### HW1 - 定義自己的 transaction - 離散化 binning - 用別人的 package 記得標出處 ### Final - 3-4 人 - 自己找 dataset - Proposal (8 page slide) - Title - Background, purpose, challenge - Related work - Problem define - Expected result - Reference

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.