YH Hsu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    2
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) #### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) ##### GenAI - [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [LangChain for LLM Application Development。使用LangChain進行LLM應用開發](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ) - [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) ##### RAG - [Building and Evaluating Advanced RAG。建立與評估進階RAG](https://hackmd.io/@YungHuiHsu/rkqGpCDca) - [Preprocessing Unstructured Data for LLM Applications。大型語言模型(LLM)應用的非結構化資料前處理](https://hackmd.io/@YungHuiHsu/BJDAbgpgR) - [標準化文件內容。Normalizing the Content](https://hackmd.io/a6pABFa5RyCk6uitgJ9qEA?both) - [後設資料的提取語文本分塊。Metadata Extraction and Chunking](https://hackmd.io/@YungHuiHsu/HyJhA80lA) - [PDF與影像的預處理。Preprocessing PDFs and Images](https://hackmd.io/@YungHuiHsu/SkJUlPCeA) - [表格提取。Extracting Tables](https://hackmd.io/@YungHuiHsu/HJEE5rkbC) - [Build Your Own RAG Bot]() --- # [Preprocessing Unstructured Data for LLM Applications<br>大型語言模型(LLM)應用的非結構化資料前處理](https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications/) ## [Introduction](https://learn.deeplearning.ai/courses/preprocessing-unstructured-data-for-llm-applications/lesson/1/introduction) ### 課程重點 * 如何為 LLM 應用開發預處理數據、處理不同的文件類型 * 如何將各種文件提取和規範化為通用的 JSON 格式,並用metadata豐富其內容,以改善搜索結果 * 文件圖像分析技術,包括排版(layout)檢測和視覺轉換器(vision transformers),以提取和理解 PDF、圖片和表格 * 如何構建一個能夠讀取 PDF、PowerPoint 和 Markdown 文件等不同文件的RAG bot --- ## Overview of LLM Data Preprocessing ![image](https://hackmd.io/_uploads/BJaxKeae0.png =600x) > [2023.08。unstructured.io/。How to Build an End-to-End RAG Pipeline with Unstructured’s AP](https://unstructured.io/blog/how-to-build-an-end-to-end-rag-pipeline-with-unstructured-s-api) ### Data Preprocessing and LLMs ![image](https://hackmd.io/_uploads/r1lqlZTxR.png =600x) - **Retrieval Augmented Generation (RAG)**: A technique for grounding LLM responses on validated external information. - **Contextual Integration** - RAG apps load context into a database, then retrieve content to insert into a prompt. ### Preprocessing Outputs - **Document Content** Text content from the documents Used for keyword or similarity search in RAG apps - **Document Elements** The basic building blocks of a document. Useful for various RAG tasks, such as filtering and chunking. * Title * Narrative Text * List Item * Table * Image * **Element Metadata** Additional information about an element. :pencil2:Useful for ==filtering in hybrid search== and for identifying the source of a response. * Filename * Filetype * Page Number * Section * keywords * summary ### 資料前處裡的難點 - **內容提示(Content Cues)** - 在處理不同類型的文件時,需要根據文件類型(如圖片或Markdown文檔)來識別不同的資訊或結構標記 - **標準化需求(Standardization Need)** - 不同類型的文件(如Word文檔、PDF等)在處理前需轉化為一種通用格式,以便統一處理和分析(標準化) - **提取變異性(Extraction Variability)** - 根據文件的格式和結構,數據提取的方法會有所不同,如從表格中提取數據與從學術文章中提取內容可能採用不同技術 - **後設資料洞察(Metadata Insight)** - 提取後設資料(如作者、發布日期等資訊)常需分析文件的特定結構,以確定相應的後設資料位於哪裡 ## 標準化文件內容。Normalizing the Content ### Normalizing Diverse Documents - **格式多樣性(Format Diversity)** - 文件可能以多種不同的電子格式存在,每種格式都有其特定的文件結構和呈現方式(PDF、Word、EPUB、Markdown 等) - **通用格式(Common Format)** - 在處理不同格式的文件前,為了便於統一處理和分析,需要將它們轉換成一個標準化的格式 - 這種通用格式能夠識別並提取文件中的共通元素,例如標題和正文,這樣不同來源的文件在接下來的處理階段可以被一視同仁 - **標準化好處(Normalization Benefit)** 標準化格式讓任何文件都能以相同的方式處理,無論其來源格式如何 * 過濾不需要的元素,如頁眉(headers)和頁腳 * 將文件元素切塊到不同的章節 - 降低處理成本(Reduced Processing Cost) 文件重新處理的初步步驟是整個過程中最昂貴的部分 - 例如切塊這樣的下游任務,在標準化的輸出上是成本較低的操作 - 使實驗不同的切塊技術成為可能,無需重新處理文件 ### 資料序列化Data Serialization #### **資料序列化 (Data Serialization)** :::info 資料序列化是指在電腦科學中將資料結構或物件轉換為位元組序列的過程,以便可以在檔案、記憶體或網絡中存儲或傳輸。序列化後的資料可以在需要時進行反序列化,即將位元組序列還原回原始的資料結構或物件。 #### 主要功能和好處: 1. **持久性(Persistence)**: 序列化允許數據持久保存在硬碟或其他存儲媒介上,即使程式結束後仍能再次讀取數據。 2. **通信(Communication)**: 在不同的計算機系統或網絡中,序列化允許數據在節點間進行交換和傳輸,支持分散式計算應用。 3. **資料交換(Data Exchange)**: 序列化使得不同應用程式或不同平台間可以交換數據,即使這些程式是用不同的程式語言編寫的。 #### 實現方式: - **二進制格式(Binary Format)**: 將資料轉換為二進制代碼,這種格式通常更緊湊、效率更高,但可讀性差。 - **文字格式(Textual Format)**: 如XML、JSON等,這些格式可讀性好,便於人類理解和編輯,但通常比二進制格式佔用更多空間。 #### 應用實例: - **Web API通信**:網絡應用程式常用JSON或XML格式序列化數據,以便在網際網路上交換資料。 - **應用程式設定**:許多應用程式將配置選項序列化到配置文件(如.ini或.cfg文件),以便在應用啟動時載入。 ::: - **序列化好處(Serialization Benefits)**: - 允許文件前處理的結果能夠在未來被再次使用 - JSON的優勢 - 結構普遍且易於理解,是標準的HTTP回應格式 - 能夠在多種程式語言中使用 - JSON作為一種輕量級的數據交換格式,被廣泛支持於各種程式語言,如Python、Java、JavaScript等 - 可以轉換為JSONL格式,用於串流的應用場景 - JSONL(JSON Line)格式是一種將多個JSON對象分行存儲的方式,適合於串流數據處理和實時數據載入 ![image](https://hackmd.io/_uploads/rJ0XbMaeC.png) ### HTML 頁面提取 ![image](https://hackmd.io/_uploads/BJouWM6eR.png =400x) - **LLM相關性(LLM Relevance)** 為了使大型語言模型能夠反映當前的語言使用和知識,重要的是定期更新其訓練數據。包括最新的網絡內容可以幫助模型更好地理解當前流行的話題和語言變化 - **HTML理解(HTML Understanding)** HTML 是網頁的基礎,其結構化元素幫助定義了網頁上資訊的呈現方式,使用元素標籤如 `<h1>` 作為標題和 `<p>` 作為文本。 - **數據提取和分類(Data Extraction and Categorization)** 通過分析HTML元素,可以從網頁中提取有用的資訊並將其組織成結構化的數據。這包括識別標題、段落、列表等,並將這些內容分類以便於後續的數據分析或應用 這種結構化的提取是資訊檢索、內容管理和數據分析等領域的關鍵技術 :pencil: 這不就是網頁爬蟲在做的事? 可以用LLM進行網頁解析? #### code - [lesson/3/normalizing-the-content](https://learn.deeplearning.ai/courses/preprocessing-unstructured-data-for-llm-applications/lesson/3/normalizing-the-content) - 環境設定 ```python= from IPython.display import JSON import json from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError from unstructured.partition.html import partition_html from unstructured.partition.pptx import partition_pptx from unstructured.staging.base import dict_to_elements, elements_to_json from Utils import Utils utils = Utils() DLAI_API_KEY = utils.get_dlai_api_key() DLAI_API_URL = utils.get_dlai_url() s = UnstructuredClient( api_key_auth=DLAI_API_KEY, server_url=DLAI_API_URL, ) ``` ```python= filename = "example_files/medium_blog.html" elements = partition_html(filename=filename) ``` ![image](https://hackmd.io/_uploads/Sym0gbCx0.png) ```PYTHON= from unstructured.partition.html import partition_html filename = "example_files/medium_blog.html" elements = partition_html(filename=filename) element_dict = [el.to_dict() for el in elements] example_output = json.dumps(element_dict[11:15], indent=2) print(example_output) ``` ```text= { "type": "Title", "element_id": "29887a5ff9846ccc23327565a07e17fa", "text": "Share", "metadata": { "category_depth": 0, "last_modified": "2024-03-30T04:25:39", "page_number": 1, "languages": [ "eng" ], "file_directory": "example_files", "filename": "medium_blog.html", "filetype": "text/html" } ... ``` ```python= JSON(example_output) ``` ![image](https://hackmd.io/_uploads/rkbVbZAeA.png =400x) ### Powerpoint擷取 ![image](https://hackmd.io/_uploads/Hyx5bZRlA.png =400x) ![image](https://hackmd.io/_uploads/BJV9kbCgC.png =400x) ```python= from unstructured.partition.pptx import partition_pptx filename = "example_files/msft_openai.pptx" elements = partition_pptx(filename=filename) element_dict = [el.to_dict() for el in elements] JSON(json.dumps(element_dict[:], indent=2)) ``` ![image](https://hackmd.io/_uploads/SJf6ZZRxC.png =400x) ### PDF :warning: model-based。需要api key ![image](https://hackmd.io/_uploads/SkdxoZCgC.png =400x) ```python= Image(filename="images/cot_paper.png", height=600, width=600) ``` ![image](https://hackmd.io/_uploads/SJyLs-0gR.png) ```python= filename = "example_files/CoT.pdf" with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = shared.PartitionParameters( files=files, strategy='hi_res', pdf_infer_table_structure=True, languages=["eng"], ) try: resp = s.general.partition(req) print(json.dumps(resp.elements[:3], indent=2)) except SDKError as e: print(e) ``` ```python= { "type": "Title", "element_id": "bff1fd0ec25e78f1224ad7309a1e79c4", "text": "B All Experimental Results", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 1, "filename": "CoT.pdf" } }, { "type": "NarrativeText", "element_id": "ebf8dfb149bcbbd8c4b7f9a7046900a9", "text": "This section contains tables for experimental results for varying models and model sizes, on all benchmarks, for standard prompting vs. chain-of-thought prompting.", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 1, "parent_id": "bff1fd0ec25e78f1224ad7309a1e79c4", "filename": "CoT.pdf" } }, ``` ![image](https://hackmd.io/_uploads/SkIvKMClR.png) --- ## Resources #### [https://unstructured.io/api-key-free](https://unstructured.io/api-key-free) Get started in minutes with our free API hosted by Unstructured. Usage is capped at 1,000 pages per month.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully