111新興科技-AI
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Help
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title: template- 單元教案小範本 tags: 新興科技, 資料蒐集 --- # 第2單元 資料蒐集與資料前處理 > [安德森鳶尾花卉資料集](https://zh.wikipedia.org/zh-tw/%E5%AE%89%E5%BE%B7%E6%A3%AE%E9%B8%A2%E5%B0%BE%E8%8A%B1%E5%8D%89%E6%95%B0%E6%8D%AE%E9%9B%86)(Anderson's Iris data set)包含了150個樣本,都屬於鳶尾屬下的3個亞屬,分別是山鳶尾、變色鳶尾和維吉尼亞鳶尾。(摘自[維基百科](https://zh.wikipedia.org/zh-tw/%E5%AE%89%E5%BE%B7%E6%A3%AE%E9%B8%A2%E5%B0%BE%E8%8A%B1%E5%8D%89%E6%95%B0%E6%8D%AE%E9%9B%86)) > 有了這些花的[花萼和花瓣的長度和寬度]四種特徵,可以發展出鳶尾花的分類依據了([舉例](https://upload.wikimedia.org/wikipedia/commons/thumb/5/56/Iris_dataset_scatterplot.svg/440px-Iris_dataset_scatterplot.svg.png))。 ![Kosaciec szczecinkowaty Iris setosa.jpg](https://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg =240x300)![commons.wikimedia.org/wiki/File:Iris_versicolor_FWS.jpg](https://upload.wikimedia.org/wikipedia/commons/c/cd/Iris_versicolor_FWS.jpg =300x240) ![https://commons.wikimedia.org/wiki/File:Iris_virginica.jpg](https://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg =300x240) --- ## 資料(Data)蒐集的方法 > 小天想要知道全班的段考平均,於是小天要了班上所有同學的考卷,記錄所有同學的分數來計算結果。 上述說明其實就是資料蒐集的過程。若我們把計量放大,變成全校、全區、全縣市、全國... 難度也隨之提升。若只是要學習人工智慧的觀念,卻要大費周章地蒐集資料,這樣太辛苦了!有沒有更輕鬆的方法? 一般常見幾種可以用來蒐集人工智慧學習資料的方式: * 自行蒐集想要的資料 * 撰寫網路爬蟲 或 運用API取得資料 * 利用開放資料庫網站或套件提供的資料 * 花錢購買他人已蒐集好的資料 > Data 稱為資料亦可稱為數據,因為很多時候會以數值來表示。 ---- ### 補充介紹:從網站取得資料 隨AI興起,有些網站開始提供所謂的「資料集(Datasets)」減輕初學者蒐集資料的辛勞。例如: * [Kaggle](https://www.kaggle.com/) - 提供學習人工智慧學習需要大量訓練資料。 * [OpenML](https://www.openml.org/) - 開放資料庫,提供各種全世界蒐集到的資料集,提供機器學習使用。 * [Scikit-learn](https://scikit-learn.org/stable/) - 程式語言整合的工具,其中也提供了Python使用的套件、資料集。 --- ## 資料前處理 > 小天蒐集了全班的國文小考成績,但是有同學登記不確實或是分數有缺,讓小天計算班平均有點困擾... 如果是你,會如何處理這些有缺的資料? | 座號 | 小考1 | 小考2 | 小考3 | 小考4 | 小考5 | 平均 | |---| -------- | -------- | -------- | -------- | -------- | -------- | | 1|100|65|55|63|82|**73.0**| |2|53|-|84|72|99|**?**| |3|97|-|76|-|77|**?**| |...|...|...|...|...|...|... |35|57|80|-|80|55|**?**| 當我們在蒐集資料時,偶爾會發生些許意外,造成我們蒐集來的資料有缺失,而所謂*資料的前處理*,就是要透過一些方法來彌補缺失的資料。 一般而言,缺失資料的處理不外乎下列幾種方法: 1. **[填值]**:用平均值、中值、分位數、眾數、隨機值等替代 * 效果一般,因為等於人為增加了雜質。 2. **[刪除]**:直接將資料刪除 * 僅能在資料量大的時候使用。 3. **[推估]**:用其他變數做預測模型來算出缺失變數 * 例如運用內插法計算缺值。 ---- ### pandas 基本操作 Pandas是一套Python 的套件,常用來處理矩陣類型資料的工具。 Python人工智慧學習的領域裡,就常運用pandas來處理各種資料。 將資料載入後,我們使用 pandas 套件來處理資料: ```python= import pandas as pd # 載入pandas grades = [ ["王**", 95, 67], ["林**", 80, 60], ["陳**", 85, 77] ] # 轉成pandas的DataFrame格式並印出 print( pd.DataFrame(grades) ) ``` ``` 0 1 2 0 王** 95 67 1 林** 80 60 2 陳** 85 77 ``` ---- ### pandas 資料類型 pandas 有兩種可以存放數據的資料類型,分別是 **Series** 與 **DataFrame**: * Series - 儲存**一維**資料,每個元素擁有自己的標籤。 * DataFrame - 儲存**二維**資料,每列或欄擁有自己的標籤。 這次練習的 iris 資料集屬於二維,須轉為 DataFrame 操作。 以下片段是載入資料,並隨機製造資料缺損的程式片段。 完整範例程式可以參考[程式範例](#程式實作)。 ```python= from sklearn.datasets import load_iris # 載入資料集用 import pandas as pd import numpy as np import random # 載入轉Pandas DataFrame data = pd.DataFrame( load_iris()['data'] , columns=load_iris()["feature_names"] ) for i in range(data.size): # 隨機製造資料缺損(約30%) data.iat[i//4,i%4] = np.nan if random.randint(0,10)<3 else data.iat[i//4,i%4] data.info() # 印出 ``` 結果: *數字可能不盡相同,因為是隨機刪除某些資料* ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal length (cm) 101 non-null float64 1 sepal width (cm) 109 non-null float64 2 petal length (cm) 104 non-null float64 3 petal width (cm) 116 non-null float64 dtypes: float64(4) memory usage: 4.8 KB ``` ---- ### 關於 `pandas.DataFrame` 的常用功能: 詳細請見[官方網站](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame)說明;這邊只講解跟練習有關的項目: ```python=5 # 載入轉Pandas DataFrame data = pd.DataFrame( load_iris()['data'] , columns=load_iris()["feature_names"] ) ``` * `pandas.DataFrame( data, columns)` :轉換二維資料變成DataFrame 可接受5種參數在此我們只用data 和index 兩項參數: * data 二維的資料,在這個練習中是指鳶尾花iris 的資料 * columns 是欄位名稱,在本練習中是指iris 的特徵 ---- ### 資料缺損的處理方法 [利用 pandas 進行資料補值] 我們從前面的示範,抓挑出關於補值相關的語法 ```python=11 # 資料填補 d1 = data.fillna(0) #[填值] d2 = data.dropna() #[刪除] d3 = data.interpolate(method='linear') #[推估](線性) ``` 我們來介紹 pandas 進行資料補值的方法,一般而言,缺失資料的處理不外乎下列幾種方法: 1. **[填值]**:用平均值、中值、分位數、眾數、隨機值等替代 * 效果一般,因為等於人為增加了雜質。 * 使用`DataFrame.fillna(x)` 方法,填值(x)進去 2. **[刪除]**:直接將資料刪除 * 僅能在資料量大的時候使用 * `DataFrame.dropna()` 方法,直接將有缺的資料刪除 3. **[推估]**:用其他變數做預測模型來算出缺失變數 * 如果其他變數和缺失變數無關,則預測的結果無意義; * 如果預測結果相當準確,則又說明這個變數是沒必要加入建模的 * `DataFrame.interpolate(method=<method>)` 方法,可以透過指定補值的方法來進行填補 - `linear`:忽略索引值,視為等間距 - `index`:參考索引的數值差做內差 - `time`:以時間差做為內差比例,以插入給定的間隔長度 - [(more...)](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html) --- ## 練習題 1. [觀念題] 小葉正在學習機器學習,想蒐集雙北地區的空氣 pm2.5指標資料,請推薦適合小葉的資料蒐集方法? 2. [觀念題] 補習班高老闆招募許多工讀生在街頭隨機訪談高中生關於補習的經驗。問卷回收後發現有些欄位有缺失不完整例如「每周補習時數」這個欄位的空缺值,若希望修補,何種方法最不適當? 3. [觀念題] 運用Pandas來儲存待處裡的資料,一維的資料適合用_____;二維的資料適合用_____。 ## 程式實作 我們以Scikit-learn 預存的[安德森鳶尾花資料集](https://zh.wikipedia.org/zh-tw/安德森鸢尾花卉数据集)(Anderson's Iris data set)為例。參考Scikit-learn[官方網站](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)的程式範例,修改成以下的示範程式: 請複製到你的Python 程式編輯器執行。 ```python= # 載入 資料集 from sklearn.datasets import load_iris # 載入資料集用 import pandas as pd # 匯入pandas套件 import numpy as np # 匯入numpy套件 import random # 匯入random亂數套件,模擬資料丟失用 # 載入轉Pandas DataFrame data = pd.DataFrame(load_iris()['data'],columns=load_iris()["feature_names"]) for i in range(data.size): # 隨機製造資料缺損(約30%) data.iat[i//4,i%4] = np.nan if random.randint(0,10)<3 else data.iat[i//4,i%4] data.info() # 資料填補 d1 = data.fillna(0) d2 = data.dropna() d3 = data.interpolate(method='linear') ``` 如果要顯示各種(`d1`,`d2`,`d3`)填值後的情況,請自行在最下面加上: ```python=16 d1 # 顯示data.fillna的結果 ; 請自行修改成d2 或 d3 來看結果 ``` ### 程式問題: 4. 運用Pandas儲存 Python 2D list ```python= import pandas as pd data = [ [5,2,7], [8,1,4], [3,9,6] ] data_in_pandas = _4._(data) ``` ``` data_in_pandas = pd.dataframe(data) ``` 5. 填補缺損的小考成績,請運用pandas 將缺的小考成績填入 0 分 ```python= import numpy as np import pandas as pd grade=pd.DataFrame([ [1, 69, 98, np.nan, 81, 97], [2, 74, 81, 86, 77, 84], [3, 84, 70, 87, 73, np.nan], [4, 97, np.nan, 94, 99, 62], [5, 53, np.nan, 100, 75, 76] ]) ### 從這裡開始 ### new_g = _5._ new_g # 顯示修正後成績 ``` 6. 練習載入scikit-learn套件,並下載紅酒資料集,並轉換成pandas的DataFrame。 ```python= import pandas as pd from sklearn.datasets import load_wine ### 從這裡開始 ### wine = load_wine() wine_in_pd = pd._6.a_(wine['data'],columns=_6.b_) wine_in_pd # 顯示所有資料 ```

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully