C.A.Lee
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    --- robots: index, follow tags: NCTU, CS, 共筆 description: 交大資工課程學習筆記 lang: zh-tw dir: ltr breaks: true disqus: calee GA: UA-100433652-1 --- 資料探勘 -- 彭文志 ====== ## Syllabus - 彭文志 - email: wcpeng@cs.nctu.edu.tw - 書: [Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison-Wesley](http://www-users.cs.umn.edu/~kumar/dmbook/index.php) - 4-5 hw - One data analysis project - team: 3-4 people - Datasets - PM 2.5 - Tai-Power - others - [WSDM 2018](http://www.wsdm-conference.org/2018/) - [Yelp Challenge](https://www.yelp.com/dataset/challenge) - NCTU DM workshop ## Summary - Data base - [Data wherehouse](https://zh.wikipedia.org/wiki/%E8%B3%87%E6%96%99%E5%80%89%E5%84%B2) - AI - ML - ... - Material - Introduction - association rules - classification - clustering - time series data analysis - sq - Knowledge Discovery - Nontrivial process - Database - Relational - Transactional - Spatial - Time series - Multimedia - Graph - Task - Prediction - Description - Knowledge - Association - Classification - Clustering - Time series - Regression - Performance - Efficiency - Effectiveness ## Course ### Data ![](https://i.imgur.com/C26uX38.png =260x) - Attribute - 分類 - Nominal - Ordinal 順序 ex. rank, grade - Interval 區間 - Ratio 比例 - Discrete 分散 - Continuous 連續 - Data Set - type - Record - Graph - Order - Structure Data - 可以把 data 訂出 table 與 schema - Dimensionality (多維) - ex. Curse of Dimensionality - Sparsity (稀疏) - Resolution (解析度) - Unstructure Data - 像是文件,只是 data object,不太能描述出 schema - 儲存方式(?) - Record Data - Data Matrix: 用 matrix 存 - Document Data: 存詞 - Transactioin Data - Graph Data: 用圖存 (ex. [PageRank algo](https://zh.wikipedia.org/wiki/%E4%BD%A9%E5%A5%87%E6%8E%92%E5%90%8D)) - Ordered Data - Sequences of transaction - 通聯記錄 - ![](https://i.imgur.com/KRswP6T.png =260x) - Genomic sequences - ![](https://i.imgur.com/PCBF2qI.png =260x) - [Spatio-Temporal Data](https://www.ibm.com/developerworks/cn/analytics/library/ba-cn-spatio-temporal-prediction-application/index.html) - Data Quality - 如何判斷資料品質 - 遇到不好的 data quality,該如何處理 - noise / outliers - 可能是要替掉,但也可能是應該分析的異常 - duplication - missing - 消掉這個 data object - 補上資料 - 去除段落 - 把區段放大 - Data Prepossing - Aggregation (聚合) - Data reduction 把不要的 attribute or object 去除掉 - Change of scale: ex. 把城市放大成國家 - More “stable” data - Sampling (取樣) - 可以作初步分析 - 策略 - random - without replacement - 被抓出來就不要放回去 - with replacement - stratified sampling - 把 data 先分組,在從每組裡抽樣 - Sample size - Curse of Dimensionality - [參考](https://hackmd.io/OwBghgxgnFCMAcBaWAWARhRKAm3iKmDCWxQGYJgUAzNANmoFYwg=#feature-selection) - Dimensionality Reduction - 避免 curse of dimension - ex. PCA, SVD, [auto encoding](https://en.wikipedia.org/wiki/Autoencoder) - Feature Subset Selection - Brute-force: 都試試看就好啦 XD - embeded: 用 ML train 一下 - filter - Wrapper - Feature Creating - 用現有的 feature,創造出新的 feature - 會比較乾淨、有用 - Similarity and Dissimilarity - Proximity 看像的部分 - distance funcetion 看不像的部分 - 聽說 similarity + dissimilarity 一起看可能會比較慘ㄟ - Distance - [參考](https://hackmd.io/OwBghgxgnFCMAcBaWAWARhRKAm3iKmDCWxQGYJgUAzNANmoFYwg=#ch5-similarity-based-learning) - 距離需要符合條件 1. $d(p, q) ≥ 0$ for all p, q: Positive definiteness 2. $d(p, q) = d(q, p)$ only if $q = p$ 3. $d(p, q) = d(q, p)$ for all p, q: Symmetry 4. $d(p, r) ≤ d(p, q) + d(q, r)$: Triangle Inequality ### Association Association Rule Mining - 名詞 - Item - Itemset - k-itenset - Support - 出現 - Support count ($\sigma$) - 計算出現數量 - Frequent Itemset - itemset 的 support >= min support - Association Rule - X -> Y - Rule Evaluation Metrics - Support - Itemset 可以出現 - $s = \dfrac{\sigma()}{|全部數|}$ - Confidence - 指出現在 rule 的機率 - ex. {a, b} -> {c} => support(a,b,c) / support(a,b) - 條件機率 - 若發生 X 狀態之下,Y 出現的機率 - Strong rule - $>=$ min support & $>=$ min confidence - Support 代表可性度(出現次數夠多),confidence 比較像規則 - 方法 - Brute-force - 列舉所有 rules (排組 + DP) - 計算數量 - 刪掉不符合的 - Candidates Generation - 先找 frequent itemset - 先找出所有的 itemset - 透過 lattice 找出所有 itemset - Apriori Algo - 如果 lattice 上面的出現次數已經不夠多了,下面就不用走下去了 - 先將 1 item 列出來 - 刪掉 count 不夠的 - join 成 2 item - subset testing (如果這個 node 的媽媽中,有人被刪掉了,那這個人就可以被刪掉了) - 重複 - 挑戰 - Multiple scans of transaction database - Huge number of candidates - Tedious workload of support counting for candidates - 再檢查 min confidence - 用 hash 處理 - Redundant Rules - 在 {a,d}=>{c,e,f,g} 是已知 rule 條件下,哪些是可以確定直接是 rule 的? - {a,d}=>{c,e,f} - {a}=>{c,e,f,g} - {a,d,c}=>{e,f,g} - {a}=>{d,c,e,f,g} - Improvement of Apriori Algo - 想法 - 減少 database scans - 減少候選人 - 促進候選人的 support count - 方法 - DHP - 用 hash 先篩一輪,把不可能是 frequence itemset 的刪掉 - 這樣在 scan 的時候,也相對需要 scan 的數量會比較少 - Partition - 將 dataset 切塊 - 每個區塊都可以放入 memory - 針對每一個 part 執行 Apriori,存進 Li * - C = Li union - 重新在 scan 一次避免只是 partition 的頻率很高) - 用到 2 次 disk scan - Sampling - 針對 sample database 作 Apriori - Sampling 完之後的存入 Potentially - Border - 把 sample 裡覺得不是 frequence 的,拿出來再 check 一次(拿的 item 是有技巧的,不是所有不再 freq 理的 item 都要拿) - 遞迴掃下去 - 用的是 smaller support (放低標準) * - 最好的 disk scan 是 1,最差是 2 > 是否所有的 freq 都可以找到? - FP-Growth - 想法 - 不用 Candidation generation 的方法 - 減少 Disk Scan (DB scan) - 數量越多的,越可能出現在 freq item set - 作法 - 先對數量作排序 - 建 FP-tree - ![](https://i.imgur.com/9GvCl7s.png =250x) - FP-SubTree - 最後在檢查 confidence - [參考](https://kknews.cc/zh-tw/tech/j6bxakl.html) - 其他 - Multiple-Level Association Rules - 產品通常有相互關係 - ![](https://i.imgur.com/jEGR9Cx.png =250x) - Frequent closed itemset - ![](https://i.imgur.com/LHUphCO.png =250x) - Quantitative Association Rules - 不同維度(多個)的 association - 若連續,可以作 bining - ![](https://i.imgur.com/0OsokQy.png =250x) - Static(等寬) 切法 - ![](https://i.imgur.com/JABkjlY.png =250x) - 分布成空間中的點,在等距離切割 - data cube - ![](https://i.imgur.com/xFTmQOL.png =250x) - Dynicmic - 可能在 result 轉折處 - 可能在分布方式 (下圖第三種方法) - ![](https://i.imgur.com/eDLYPvf.png =260x) - clustering - ARCS (Association Rule Clustering System) - 先作 binner,用 binner 的切割做出 Association Rule,這時候 rule 已經出來了,在做 clustering - 作法 1. binning 2. frequent predicateset 3. clustering 4. optimize - ![](https://i.imgur.com/yKwvUdF.png =260x) - Mining Quantitative Association Rules in Large Relational Tables - 在生 rule 的同時,會回去修改 binning 的範圍 - ex. 如果這個 bin 太小了(< min support),可能可以把兩個 bin merge 起來 - Correlation Analysis - Corr(A,B) = P(AUB)/(P(A)P(B)) - 有時候只算完 association rule 還不夠,需要在做一些分析 - Association rules weight - 每個 association rules 的權重不一樣 - MOTIF - Time Series Motif Discovery - 尋找時序間重複的 pattern - 可能要切 windows - 可能 y 軸要先 match 一個 function - [Tool](https://github.com/GrammarViz2/grammarviz2_src) - discord - 跟別人最不像的 ### Clustering 分群 - 三重點 - dist - 如何定義 distance function - algo - 用哪一種演算法 - analy - 分析分群結果好壞 - 影響分群效果 - 分群數量 - 希望可以抵抗 outliner - 想法 - center-based - density-based - 算法 - partition - 分成 k 群 - k-means - 作法 - 先隨便找 k 個 中心 - 將所有點區分出與他們最近的中心 (分群) - 計算每個群的真的平均 - 將真的平均作為新的中心 - 一直做到沒有新的變動 - 複雜度: - t: 迭代次數 - n: 點數 - k: 分群數 - 多數時候 k, t << n - O(tkn) - 討論 - 受 noice 影響 - 受到 data size / densities 影響 - 受一開始初始 k 個點影響 => 多作幾次 - 找不出不規則形狀的分群結果 - 如果已經有分群的 label 了(監督式學習),就不要用 k-means - 或者多切幾份在做 merge - Bisection K-means - 每次只切一刀 - 作法 - 每次找 cost 最高的群切割 - k-medoids algo. (PAM) - algo. - 任選 n 個點 (從原來的點中) - 把其他點劃分到離他們比較進的點群 - 計算 total cost - 每次選不同點當機制準,最後看是哪幾個點當基準產生最小的 total cost - 對抗 noise 比較強 - 對於數量級很大的資料,效率低 - => sampling 加速 (CLARA) - hierarchy - 分群成樹狀結構 - Agglomerative - bottom up - Algo. (disjoin set) - 把每個物件先都把自己當一群 - 每次 把認兩個最相近的*群*合併 - 如何算*群*的距離? - average link: 把 n\*m 個組合配起來算,加總 - centroid: 找中點 - single link: 找兩個群中,距離最遠的 - complete link: 距離最近的 - 實作 - proximity matrix - birge - (R-Tree) - Cure - Divisive - top down - density-base - DB Scan - 畫圓 - Optics - 自己找參數,讚 - Denclue - Clique - graph-base - SNN - model-base - 評估效果 - Outlier Analysis - 跟別人不一樣 ### Web Mining - 觀察 - Web 很巨大 - 複雜 - 動態 - 99% 無用 - 種類 - Web Content - ex. Search Engine - Text Mining - NLP - Classification / Clustering - Similarity search - term association - keyword - Web Traversal Robots - Crawler - 個人化 - 分群 - Multilevel WebDB - Layer 0: Web - Layer 1: time / URL / keywords / ... - Layer 2: summary / classification / clustering - Structure - Hyper Link - 建圖 - 角色 - hub - authorityies - a(p) = sum h(q) - Usage - Log ### Page Rank - Page Rank Matrix - $PR(A) = (\dfrac{PR(T_1)}{C(T_1)} + ... + \dfrac{PR(T_n)}{C(T_n)})$ - Dead ends - 節點只有入度,沒有出度 - 他們的 PR 會產生 $\dfrac1 0$ 的狀況 - 直接把這些節點遞迴的刪掉 - Topic-Sensitive PageRank - 加入 interesting Vector - $v_i = \beta M v_{i-1} + \dfrac{(1-\beta)e_s}{|S|}$ - $S = \{B, D\}$: 代表 B, D 都是使用者喜歡的 - $\beta$: 參數比 - $M$: 原來的 PageRank Matrix - $e_s$: 如果 S 裡有的就是 1 ,沒有的就是 0 所產生的 Vector - ex. ![](https://i.imgur.com/RkK50Is.png =260x) - Link Spam - 垃圾連結,但是技術性騙過 PageRank - 網頁分成三種 - Inaccessible pages - spammer 不會影響的 site - Accessible pages - 不是 spammer 的 site,但會被 spammer 影響 - Own pages - spammer 的 site - ex. ![](https://i.imgur.com/iHtdgGv.png =350x) - 利用 TrustRank 與 Spam mass 解決這個問題 - Trusk Rank - 利用 Topic Sensitive PR - 缺點:Spam 可以輕易的建出 link 到 good topic 上 - Spam Mass - $Spam Mass = \dfrac{r-t}{r}$ - r: Page Rank - t: Trust Rank - Spam Mass 越小 or 負的,代表越不可能是 Spam - Spam Mass 越接近 1,代表越可能是 Spam ### Sequential Patterns - sequential data - ... - in / out - input: sequential data, minimum support - output: all subsequence - 提取 candidate - 笨方法 - Candidate 1-subsequences - Candidate 2-subsequences - Candidate 3-subsequences - ... - Generalized Sequential Pattern (GSP) - raw data: - ![](https://i.imgur.com/eSsI6WS.png =250x) - Sort phase - 仿照 tuple 的 sort,從前面往後排 - ![](https://i.imgur.com/7ijXAZ6.png =250x) - Litemset phase - 找出 large item (出現的次數超過 minSup) - ![](https://i.imgur.com/ZQTYU3F.png =250x) - 作 map to 是為了之後查找方便 - Transformation phase - 把 origin 作一些轉換 - 同時如果不是 frequent large item,就把他刪掉 - ![](https://i.imgur.com/1RXIGyI.png =250x) - Sequence phase - Apriori-like algorithm - candidate generation - ![](https://i.imgur.com/DX9dF4x.png =250x) - ![](https://i.imgur.com/BfU8SPU.png =260x) - Maximal phase - Sliding Window ## HW ### HW0 - 正規化 - 距離 - offset - 振幅放大 - linear trend (找到趨勢線相減) - noice (smoothing function) - 分群 - [DTW](https://en.wikipedia.org/wiki/Dynamic_time_warping): 左右被移動了 - fastDTW - 降低維度 ### HW1 - 定義自己的 transaction - 離散化 binning - 用別人的 package 記得標出處 ### Final - 3-4 人 - 自己找 dataset - Proposal (8 page slide) - Title - Background, purpose, challenge - Related work - Problem define - Expected result - Reference

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully