LBear
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    ###### tags: `期末專題` # Home Credit Default Risk 數據資料集分析 ![](https://i.imgur.com/K4XCxhv.jpg) [toc] ## 專題報告要點 1. 研究動機 2. 研究目的 3. 研究方法&流程 4. 預計分工狀況 5. 預計達成結果 6. 預計面臨的困難&可能的解決方法 7. 使用工具 8. 參考資料 ### 研究動機 由於投資、資產配置、自身居住,房屋購買的需求越來越高,但對買賣雙方而言,價格是一般人難以直接負擔、出售的,多數人會透過貸款的方式交易。 對於信貸公司而言,房屋貸款業務具有高額貸款的金額、長時間償還期限的特質,這些都是信貸公司需要承擔的風險。 ### 研究目的 如何讓信貸公司有效運用客戶過往的信用數據,預測客戶未來的還款能力,接受有還款能力的客戶、拒絕無還款能力的客戶,讓信貸公司避免不必要的呆帳,達到信貸公司利益最大化。 ### 研究方法&流程 資料來源: Kaggle Outbrain Click Prediction 環境建置: hadoop叢集系統、Linux作業系統 資料清洗: R 或是 Python刪減欄位、調整缺失值,或使用爬蟲新增資料 模型建置: 機器學習各種方法(XGboost,隨機森林) 視覺化: D3.js ,tableau(暫定) ### 預計分工狀況 - 環境建置: 文軍皓(Lead) + rest of the team - 資料準備: 嚴浩祖(Lead) + rest of the team - 模型建置: 林冠穎(Lead) + rest of the team - 視覺化: 張詠茹(Lead) + rest of the team ### 預計達成結果 利用各個顧客資料預測未來顧客違約的風險 ### 預計面臨的困難&可能的解決方法 1. 由於是開源資料,對於資料的熟悉度不太夠(多花點時間了解資料) 2. 資料不太足夠,分析有限(尋找能增加欄位的方式、爬蟲等等) 3. 欄位說明不足。(參考外部相關產業的資料) 4. 專業知識不足。(請教相關專業人員或有經驗人士) 5. pyspark等使用不熟悉(自己精進) 6. 視覺化呈現方式未定(討論過後應該能有答案) ### 使用工具 語言:R ,python ,MySQL 環境:Linux ,hadoop ,Hive ,Zeppelin 視覺化:D3.js ,tableau ## 檔案說明 ### application_test.csv 測試集 / application_train.csv 訓練集 主要的培訓和測試數據以及關於Home Credit每個貸款申請的信息。每筆貸款都有自己的行,並由功能SK_ID_CURR標識。培訓申請數據附帶TARGET表示0:貸款已還清或1:貸款未還清。 :::info - Days_employed, 有極端值 - train 表的是否提供各種電話的欄位可以考慮加總 - region rating 可以收入判斷高低(居住分級和其他資料比對) - Days_Registration 跟貸款日是否有關 - 是否有房產是否可解釋房子特徵為空值 - 房產特徵可看位於人多區塊還是人少區塊 - Years_B* (房產相關,正規化) - 大量正規化的資料可以分析是否有錢 - 相關性高的:EXT 3人組、DAYS_BIRTH ::: ### bureau.csv 額外檔案 有關客戶之前來自其他金融機構的信貸的數據。以前的每一筆信貸都有自己的分行,但申請數據中的一筆貸款可能有多筆先前信貸。 :::info - Credit_Active 是 close 但 Credit_day_overdue不為0 - Credit_Active 是 Active 但 Days_EndDate_FACT 還有值 ::: :::danger 用ID 去Bureau_balance 看情況 第二點問題出在Bureau_balance 對上的ID都是空值,建議設立新欄位紀錄 ::: ### bureau_balance.csv 額外檔案 關於主管局以前信用的月度數據。每一行都是上一個信用的一個月,並且一個先前的信用可以有多個行,每個信用額度的每個月有一個行。 :::info - X :不知道狀況 - 1 ~ 5:嚴重度 (5已經到被債務賣掉或是房產抵押) - c:還完 - 0:期限前,準時還款 - 可依ID,將狀態次數加入bureau ::: :::danger C 和 0 差別標準未知,有些資料C後面還有其它的資料 ::: ### credit_card_balance.csv 額外檔案 有關之前的信用卡客戶與Home Credit有關的每月數據。每行都是信用卡餘額的一個月,一張信用卡可以有多行。 :::info - 待增加 ::: ### installments_payments.csv 額外檔案 Home Credit以前貸款的付款記錄。每筆付款都有一行,每筆未付款都有一行。 :::info - 待增加 ::: ### POS_CASH_balance.csv 額外檔案 關於客戶以前的銷售點或現金貸款與住房貸款有關的每月數據。每一行都是前一個銷售點或現金貸款的一個月,以前的一筆貸款可以有多行。 :::info - 待增加 ::: ### previous_application.csv 額外檔案 之前在申請數據中擁有貸款的客戶的Home Credit貸款申請。申請數據中的每個當前貸款都可以有多個以前的貸款。每個以前的應用程序都有一行,並由功能SK_ID_PREV標識。 :::info - 待增加 ::: <span></span> <span></span> ![](https://i.imgur.com/Wz7I756.png) ## 進度流程 ### 資料探索與分析 [Github](https://github.com/stansuo/BDSE12-Group3/tree/master/notebooks/homecdt_eda) ### 資料清洗 1. 將類別型資料空值與Na,改成一樣的空格,並新增一欄標注它是空白 2. 將數值型資料補0,並新增一欄標注它是空白;如適合補其它非0的值也可以,但要紀錄原因 3. 異常資料也要改成與空值一樣,像是365243的時間 4. 找出不合理的資料,像是bureau 中不合理的active貸款狀態,到bureau balance 表中發現狀態都是X的那些。這些就要經由驗證欄位資料合理性來發覺,也要把它們改成跟空值一樣,且新增一欄標注,如何清洗每一欄的資料都請紀錄下來 5. 其他人則請在我們不知道怎麼code時提供==溫暖的協助==,謝謝~ ### 特徵工程 根據各自覺得重要的欄位新增新的特徵欄位,類別型做one-hot encoding,其餘適當欄位做基本統計量(max, min, var, std ...等等)的新欄位 [github](https://github.com/stansuo/BDSE12-Group3/tree/master/notebooks/homecdt_fteng) ### 建模驗證 用XGBoost、LightGBM、Randomforest 建立模型,LGBM (boost_type) 分別使用gbdt,goss,dart等方法,其中goss明顯比其他來得快。XGBoost若不用GPU 速度會相當的慢,大約跟LightGBM相差20倍,若使用GPU則交叉驗證的時候會因為GPU記憶體沒有釋放而無法繼續下去,目前除了不跑交叉驗證無法解決此問題,Randomforest方面和其他兩個模型相比預測結果遜色不少,目前調參方式是採用貝氏優化調參,相關細項及原理可以看以下連結。 :::warning 遇到困難 : - XGBoost 可以使用GPU跑,但是會因為GPU記憶體沒有釋放而無法跑交叉驗證 - LightGBM 尚未能跑GPU - 有點overfitting的現象 ::: [Github](https://github.com/stansuo/BDSE12-Group3/tree/master/notebooks/homecdt_model) [模型建置](https://hackmd.io/@LBear/Sk83b33kL) ### 視覺化 ## 參考資料 [Kaggle 首頁](https://www.kaggle.com/) [Kaggle-Home Credit Default Risk 競賽主頁](https://www.kaggle.com/c/home-credit-default-risk) [Kaggle:Home Credit Default Risk 數據探索及可視化(上)](https://www.twblogs.net/a/5b8127e82b71772165ab4d2e) [Kaggle:Home Credit Default Risk 數據探索及可視化(下)](https://www.cnblogs.com/mtcnn/p/9411598.html) [印度阿三詳細講解](https://www.kaggle.com/codename007/home-credit-complete-eda-feature-importance) [1st Kaggle](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction#Data) [Kaggle 討論區-More domain knowledge from former Home Credit analyst](https://www.kaggle.com/c/home-credit-default-risk/discussion/63032) [模型相關資料](https://www.itread01.com/content/1545255739.html) [kaggle LighGBM](https://www.kaggle.com/ogrellier/lighgbm-with-selected-features) [LightGBM bayes opt](https://www.kaggle.com/tilii7/olivier-lightgbm-parameters-by-bayesian-opt/code) [Catboost](https://kknews.cc/zh-tw/code/ejrk4pr.html) [How to Handle Imbalanced Classes in Machine Learning](https://elitedatascience.com/imbalanced-classes)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully