2023.08。unstructured.io/。How to Build an End-to-End RAG Pipeline with Unstructured’s AP
Retrieval Augmented Generation (RAG): A technique for grounding LLM responses on validated external information.
Contextual Integration
Document Elements
The basic building blocks of a document.
Useful for various RAG tasks, such as filtering and chunking.
內容提示(Content Cues)
標準化需求(Standardization Need)
提取變異性(Extraction Variability)
後設資料洞察(Metadata Insight)
格式多樣性(Format Diversity)
通用格式(Common Format)
標準化好處(Normalization Benefit)
標準化格式讓任何文件都能以相同的方式處理,無論其來源格式如何
降低處理成本(Reduced Processing Cost)
文件重新處理的初步步驟是整個過程中最昂貴的部分
資料序列化是指在電腦科學中將資料結構或物件轉換為位元組序列的過程,以便可以在檔案、記憶體或網絡中存儲或傳輸。序列化後的資料可以在需要時進行反序列化,即將位元組序列還原回原始的資料結構或物件。
序列化好處(Serialization Benefits):
JSON的優勢
LLM相關性(LLM Relevance)
為了使大型語言模型能夠反映當前的語言使用和知識,重要的是定期更新其訓練數據。包括最新的網絡內容可以幫助模型更好地理解當前流行的話題和語言變化
HTML理解(HTML Understanding)
HTML 是網頁的基礎,其結構化元素幫助定義了網頁上資訊的呈現方式,使用元素標籤如 <h1>
作為標題和 <p>
作為文本。
數據提取和分類(Data Extraction and Categorization)
通過分析HTML元素,可以從網頁中提取有用的資訊並將其組織成結構化的數據。這包括識別標題、段落、列表等,並將這些內容分類以便於後續的數據分析或應用
這種結構化的提取是資訊檢索、內容管理和數據分析等領域的關鍵技術
from IPython.display import JSON
import json
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements, elements_to_json
from Utils import Utils
utils = Utils()
DLAI_API_KEY = utils.get_dlai_api_key()
DLAI_API_URL = utils.get_dlai_url()
s = UnstructuredClient(
api_key_auth=DLAI_API_KEY,
server_url=DLAI_API_URL,
)
filename = "example_files/medium_blog.html"
elements = partition_html(filename=filename)
from unstructured.partition.html import partition_html
filename = "example_files/medium_blog.html"
elements = partition_html(filename=filename)
element_dict = [el.to_dict() for el in elements]
example_output = json.dumps(element_dict[11:15], indent=2)
print(example_output)
{
"type": "Title",
"element_id": "29887a5ff9846ccc23327565a07e17fa",
"text": "Share",
"metadata": {
"category_depth": 0,
"last_modified": "2024-03-30T04:25:39",
"page_number": 1,
"languages": [
"eng"
],
"file_directory": "example_files",
"filename": "medium_blog.html",
"filetype": "text/html"
}
...
JSON(example_output)
from unstructured.partition.pptx import partition_pptx
filename = "example_files/msft_openai.pptx"
elements = partition_pptx(filename=filename)
element_dict = [el.to_dict() for el in elements]
JSON(json.dumps(element_dict[:], indent=2))
Image(filename="images/cot_paper.png", height=600, width=600)
filename = "example_files/CoT.pdf"
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy='hi_res',
pdf_infer_table_structure=True,
languages=["eng"],
)
try:
resp = s.general.partition(req)
print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
print(e)
{
"type": "Title",
"element_id": "bff1fd0ec25e78f1224ad7309a1e79c4",
"text": "B All Experimental Results",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"filename": "CoT.pdf"
}
},
{
"type": "NarrativeText",
"element_id": "ebf8dfb149bcbbd8c4b7f9a7046900a9",
"text": "This section contains tables for experimental results for varying models and model sizes, on all benchmarks, for standard prompting vs. chain-of-thought prompting.",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "bff1fd0ec25e78f1224ad7309a1e79c4",
"filename": "CoT.pdf"
}
},
Get started in minutes with our free API hosted by Unstructured. Usage is capped at 1,000 pages per month.