mikechantw
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    1
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Day 25, 28-31 ###### tags: `cupoy`, `ML100` :::info + + 2022-03-17 19:30 ::: [TOC] ## D25:類別型特徵 - 均值編碼 ### 重點 當類別特徵與目標明顯相關時 =>該用什麼編碼方式 當使用均值編碼時 =>會發生什麼問題 用什麼方法去修正 =>均值編碼所衍生的問題 + 額外線索 : 如果類別特徵看起來與目標值有顯著相關,應該如何編碼? ![](https://i.imgur.com/1rRwjuD.png) + 均值編碼 (Mean Encoding) : 使用目標值的平均值,取代原本的類別型特徵 *在部分模型中,使用均值編碼作為類別型特徵預設編碼方式 ![](https://i.imgur.com/CyuNFhF.png) + 平滑化 ( Smoothing ) 如果交易樣本非常少, 且剛好抽到極端值, 平均結果可能會有誤差很大 ![](https://i.imgur.com/NsfObXw.png) + 小結 •當平均值的可靠度低時, 我們會傾向相信全部的總平均 •當平均值的可靠度高時, 我們會傾向相信類別的平均 •依照紀錄筆數, 在這兩者間取折衷 + 平滑化公式與小提醒 均值編碼平滑化 ![](https://i.imgur.com/ZHGsaIj.png) *調整因子用來調整平滑化的程度,依總樣本數調整 + 小提醒 :均值編碼容易 overfitting 雖然均值編碼符合直覺, 並且也是強大的編碼方式 但實際上使用時很容易 overfitting (即使使用了平滑化) 所以需確認是否適合再使用 (用 cross validation 確認使用前後分數) + 平均數編碼 :針對高基數定性特徵(類別特徵)的數據處理/ 特徵工程 •就實務上而言,均值編碼的意義在於當一個特徵有明顯意義,但是類別數量特別多(這裡說的「高基數」)時可能有用, 但最麻煩的點在於極度容易OverFitting, 所以需要不同的平滑化方式 •在課程內使用平均因子的方法只是其一,這邊的內容也介紹了另一種較複雜的平滑化方式,提供同學參考 [TOC] ## Day28-數值與數值的組合 ### 重點 + 在資料量與模型有限的情況下,引入合理的領域知識對數值特徵之間進行組合 + 結合可表不同意義: + 起點位置+終點位置:方向、距離 + 多個頂點位置:周長、面積 + 開始時間+結束時間:經過時長 + 長x寬:≈面積(例如iris資料集的花辦長、花瓣寬) + 多個特徵之間的統計值: + 用在多個特徵都屬於某一個大分類的時候(例如都是長度) + 平均、標準差... + 降維方法也算是一種 + 為什麼特徵之間的組合可以提升效果? + ==因為資料量和模型參數量是有限的==,且特徵和特徵直接不是完全獨立的 + 有意義的組合可以更快的幫助模型學習到**重要**的部分 + 例如去檢視特徵關係的變異性的話,其實通常會看到前面都會有幾個重要的特徵就可以含蓋了很大一部分的意義(Recall之前的特徵重要性部分) + ![](https://i.imgur.com/2nXMMv1.png) ### 補充:案例分享 聲音事件分類 + 橫軸為時間、縱軸為頻率,此圖表示一個聲音的頻譜圖,為一個二維矩陣 + ![](https://i.imgur.com/TMWhqzi.png) + 引入合理的認知 + 「不同聲音的類別中,在不同頻率成分不同」 + 對同一個頻率不同時間的做平均 + 思考:要用線性(原始)還是對數(分貝)的值? + 可以一定程度得到該聲音頻率的特徵 + 「一些急促的聲音類別頻率變化很大」 + 對於上述操作,可以換成用標準差來取得變化率 ### 結語 + 合理的假設的引入,可以大幅簡化模型學到對應內容的成本 + 但任何的處理,一定會在一定程度上損失了一些東西 + 例如: + ![](https://i.imgur.com/EfqlJiY.png) ## Day29-特徵組合 - 類別與數值組合 ### 重點 + 類似均值編碼的概念,可以取類別平均值 (Mean) 取代險種作為編碼。但因為比較像性質描寫,因此還可以取其他統計值。 1.群聚編碼有哪些操作運算可以使用? 如中位數 (Median),眾數(Mode),最大值(Max),最小值(Min),次數(Count)...等 2.群聚編碼與之前的均值編碼最主要有什麼不同? 2_1.一個是特徵與目標值之間的差異 另一個是特徵之間的差異。 2_2.群聚編碼與特徵值無關,因此不容易overfitting,故比均值編碼使用頻率高得多。 參考來源:https://github.com/jshuang0520/2nd-ML100Days ![](https://i.imgur.com/PSMIEK6.png) 3.群聚編碼的常見疑問 Q1 : 什麼時候需要群聚編碼? Ans : 與數值特徵組合相同,先以 領域知識 或 特徵重要性 挑選強力特徵後,再將特徵組成更強的特徵,兩個特徵都是數值就用特徵組合,其中之一是類別型就用聚類編碼 Q2 : 聚類編碼時, 該如何挑選平均 / 最大值 / 次數 ... 等統計值? Ans : 依照 領域知識 挑選,或亂槍打鳥後再以特徵重要性 挑選 Q3 : 亂槍打鳥? 不會造出無用的特徵嗎? Ans : 機器學習的特徵是 寧濫勿缺 的,因為以前以非樹狀模型為主,為了避免共線性,會很注意類似的特徵不要增加太多,但現在強力的模型都是樹狀模型,所以只要有可能就通通做特徵囉! ## Day30 ### 重點 + 特徵需要適當的增加與減少,以提升精確度並減少計算時間 + 增加特徵 : 特徵組合 (Day 28) ,群聚編碼 (Day 29) + 減少特徵 : 特徵選擇 (Day 30) > 特徵選擇是特徵工程裡的一個重要問題,其目標是尋找最優特徵子集。特徵選擇能剔除不相關(irrelevant)或冗餘(redundant )的特徵,從而達到減少特徵個數,提高模型精確度,減少執行時間的目的。另一方面,選取出真正相關的特徵簡化模型,協助理解資料產生的過程。並且常能聽到 ``資料和特徵決定了機器學習的上限,而模型和演算法只是逼近這個上限而已`` ![](https://i.imgur.com/DJmXvjJ.png) ### 特徵選擇三大類法 + 過濾法 (Filter) : 選定統計數值與設定門檻,刪除低於門檻的特徵 + 包裝法 (Wrapper) : 根據目標函數,逐步加入特徵或刪除特徵 + 嵌入法 (Embedded) : 使用機器學習模型,根據擬合後的係數,刪除係數低於門檻的特徵 ### 課程提供的方法 + 過濾法 : 相關係數過濾法 + 嵌入法 : L1(Lasso)嵌入法,GDBT(梯度提升樹)嵌入法 + 相關係數過濾法 > ![](https://i.imgur.com/gmj13Sm.png) > 預設顏色越紅表示越正相關,越藍越負相關 ,因此要刪除紅框中顏色較淺的特徵 : 訂出相關係數門檻值,特徵相關係數絕對值低於門檻者刪除 + Lasso(L1) 嵌入法 > Lasso Regression 時,調整不同的正規化程度,就會自然使得一部分的特徵係數為0,因此刪除的是係數為0的特徵,不須額外指定門檻,但需調整正規化程度 > ![](https://i.imgur.com/ayHuoVB.png) + GDBT(梯度提升樹) 嵌入法 > ![](https://i.imgur.com/Sye1PYC.png)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully