# llama_index 的 文字切割處理 RAG 在文檔的切割會影響查詢時找到的文字片段以及語言模型的回覆品質, 因此對於各種不同的切割方法最好可以瞭解實際切割細節, 避免誤用不適當的切割方式。 ## SentenceSplitter llama_index 預設會以 [`SentanceSplitter`](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#sentencesplitter) 以完整語句為單位切割文件為片段, 它預設會以 1024 個 token 為界切割片段, 每個片段的開頭重疊上一個片段的 200 個 token 的內容。不過實際上的運作還有許多細節要注意, 以下就分別說明。 ### 切割用的參數 建立 `SentenceSplitter` 物件時, 可以一併設定以下參數: ```python chunk_size=1024, # 切片 token 數限制 chunk_overlap=200, # 切片開頭與前一片段尾端的重複 token 數 paragraph_separator='\n\n\n', # 段落的分界 secondary_chunking_regex='[^,.;。?!]+[,.;。?!]?' # 單一句子的樣式 separator=' ', # 最小切割的分界字元 ``` 也可以在建立後透過屬性修改。 不過要注意的是, 實際上每一切片還會放置**詮釋資料 (metadata)**, 單一切片的 token 數量要扣除詮釋資料的 token 數, 才是真正可以放置內容的 token 數量。 使用 `SimpleDirectoryReader` 讀取資料夾內的所有檔案時, 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑;但若是直接提供檔名只讀取特定檔案時, 詮釋資料的內容是 "filepath: " 加上傳入的檔案路徑, 不會轉成完整路徑。我測試時所有檔案都放在程式檔相同位置的 data 資料夾下。 ### 切割步驟 實際切割的步驟會分成分解與合併: - 分解 - 如果整份文件的 token 數小於限制的 token 數, 就不分解。 - 嘗試以 `paragraph_separator` 屬性設定的字串 (預設就是 ''\n\n\n') 分割段落。 - 如果分割出的段落超過 token 數限制, 就採用建立`SentenceSplitter` 物件時可傳入的 `chunking_tokennizer_fn` 參數指定的函式切割單句, 未指定時預設採用透過 [nltk](https://www.nltk.org/api/nltk.tokenize.PunktSentenceTokenizer.html) 模組內的預訓練模型將段落切割成單一句子的 `split_by_sentence_tokenizer` 函式。 - 如果切割出的單一句子還是超過 token 數限制, 就用 `secondary_chunking_regex` 屬性設定的規則運算式再切割成短句子。 - 如果短句子還是超過 token 數限制, 就使用 `separator` 屬性的設定再細部切割。 - 如果還是超過 token 數限制, 就以字元為單位切割。 最後切割出來的每一個小片段就稱為 **split**。這樣的切割方法可以盡量保留完整的段落, 或至少是完整的句子, 讓切割出來的片段可以呈現完整的文意, 只有在不得以的時候, 才會把句子再切割。 - 合併 - 每一個 split 跟下一個 split 合併, 直到再合併會超過 token 數限制為止。合併結果就稱為一個 chunk。 - 第二個 chunk 開始, 會從上一個 chunk 的最後一個 split 開始, 往回合併 split, 一直到超過 chunk_overlap 屬性的設定數量為止, 合併結果會放在新的 chunk 開頭, 也就是除了第一個 chunk 外, 每個 chunk 都會重複前一個 chunk 尾端的內容。如果前一個 chunk 的最後一個 split 就已經超過 chunk_overlap 的 token 數限制, 就完全不會有重複內容。 - 每一個新的 chunk 除了重複前一個 chunk 的部分以外, 都至少會把下一個 split 加入, 不論是否會超過 token 限制數。 這樣合併完的結果, 就會是盡可能以語意完整的位置切割出最長的片段, 而且每個片段都開頭都涵蓋前一個片段的尾端內容, 保留前後文的脈絡。 最後會將切割出來的 chunk 包裝為 `TextNode` 類別的物件, 除了文字片段以外, 還包含有詮釋資料、文字片段在原始文章中的起、迄點等等。 ### 測試文章 本文以 `SimpleDirectoryReader` 讀取 data 資料夾下的 test.txt 文字檔為例, 這個文字檔的內容如下: ```python= texts=[""" Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. Notes [1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting. """] ``` 各別非空白行包含的 token 數量如下: - 3:68 - 5:118 - 15:1 - 17:44 ### 實測--預設設定 首先我們以底下的程式透過 `SentenceSplitter` 的預設設定來切割前面測試文件: ```python from llama_index.core import SimpleDirectoryReader from llama_index.core.node_parser import SentenceSplitter documents = SimpleDirectoryReader( # input_files=['./data/ch.txt'] input_files=['./data/test.txt'] ).load_data() splitter = SentenceSplitter.from_defaults() nodes = splitter.get_nodes_from_documents(documents) for i, node in enumerate(nodes): print(f'[節點{i}]:{node.text}') ``` 由於預設的 token 數限制是 1024, 而詮釋資料的 token 數是 23, 因此實際切片的 token 限制數量是 1017。不過因為整個文件的全部內容 token 數量不到 1017, 所以實際切割的結果就是整份文件就是一個切片, 執行結果如下: ``` [節點0]:Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. Notes [1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting. ``` ### 實測--調整設定 為了看出切片的效果, 我們把 `SentenceSplitter` 的 `chunk_size` 以及 `chunk_overlap` 調小, 這裡要注意兩件事: 1. `chunk_size` 不能調得比詮釋資料的 token 數少, 這很容易理解, 不然光是放詮釋資料就把切片塞滿了。如果這樣做, 會看到以下的錯誤訊息: ``` ValueError: Metadata length (23) is longer than chunk size (10). Consider increasing the chunk size or decreasing the size of your metadata to avoid this. ``` 1. 切片實際可容納的 token 數不要小於 50, 這樣可能無法涵蓋具有完整內容的切片, 如果這樣做, 會看到以下的警告訊息: ``` Metadata length (23) is close to chunk size (70). Resulting chunks are less than 50 tokens. Consider increasing the chunk size or decreasing the size of your metadata to avoid this. ``` #### chunk_size=80, chunk_overlap=80 調整過參數的程式如下: ```python from llama_index.core import SimpleDirectoryReader from llama_index.core.node_parser import SentenceSplitter documents = SimpleDirectoryReader( # input_files=['./data/ch.txt'] input_files=['./data/test.txt'] ).load_data() splitter = SentenceSplitter.from_defaults( chunk_size=80, chunk_overlap=80 ) nodes = splitter.get_nodes_from_documents(documents) for i, node in enumerate(nodes): print(f'[節點{i}]:{node.text}') ``` 首先是分解, 第一步是以 '\n\n\n' 切割段落, 由於是以連續三個換行作為切割字串, 所以原本單純空一行的連續段落會被切在一起: ```python 0: '\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.' 1: '' 2: '' 3: '\n\nNotes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.' 4: '' ``` 切好之後, 除了第 0 段外, 每一段都還會把分割字串 '\n\n\n' 加回開頭, 最後結果為: ```python 0: '\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.' 1: '\n\n\n' 2: '\n\n\n' 3: '\n\n\n\n\nNotes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.' 4: '\n\n\n' ``` 其中第 0 段 token 數 187, 超過限制, 所以會使用 nltk 模組切割成句子: ```python 0: '\n\nBefore college the two main things I worked on, outside of school, were writing and programming. ' 1: "I didn't write essays. " 2: 'I wrote what beginning writers were supposed to write then, and probably still are: short stories. ' 3: 'My stories were awful. ' 4: 'They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\n' 5: 'The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." ' 6: 'This was in 9th grade, so I was 13 or 14. ' 7: "The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. " 8: "It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights." ``` 其餘段落都沒有超過限制, 最後會把個別片段都包裝成內含文字片段與 token 數量的 `_Split` 物件, 總共 0~12 共 13 個, 其中的 is_sentence 表示是以段落或是單句切割的結果: ```python 00: _Split(text='\n\nBefore college the two main things I worked on, outside of school, were writing and programming. ', is_sentence=True, token_size=21) 01: _Split(text="I didn't write essays. ", is_sentence=True, token_size=7) 02: _Split(text='I wrote what beginning writers were supposed to write then, and probably still are: short stories. ', is_sentence=True, token_size=20) 03: _Split(text='My stories were awful. ', is_sentence=True, token_size=6) 04: _Split(text='They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\n', is_sentence=True, token_size=19) 05: _Split(text='The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." ', is_sentence=True, token_size=28) 06: _Split(text='This was in 9th grade, so I was 13 or 14. ', is_sentence=True, token_size=18) 07: _Split(text="The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. ", is_sentence=True, token_size=34) 08: _Split(text="It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.", is_sentence=True, token_size=41) 09: _Split(text='\n\n\n', is_sentence=True, token_size=1) 10: _Split(text='\n\n\n', is_sentence=True, token_size=1) 11: _Split(text='\n\n\n\n\nNotes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.', is_sentence=True, token_size=47) 12: _Split(text='\n\n\n', is_sentence=True, token_size=1) ``` 分解完就要進行合併, 扣除詮釋資料的 token 數, 實際有效的 token 數為 73, 目前各個 split 的 token 數如下: |序號|token 數| |---|---| |0|21| |1|7| |2|20| |3|6| |4|19| |5|28| |6|18| |7|34| |8|41| |9|1| |10|1| |11|47| |12|1| 所以: 1. split 0,1,2,3,4 合併 token 數剛好 73, 若再併入 split 5 就會超過 73。由 split 0,1,2,3,4 組成 chunk 0。 2. 接著組下一個 chunk, 由於 chunk_overlap 是 80, 因此可以重疊整個 chunk 0, 先加入 split 5, 不過這樣 token 數已達 101, 已經超過 token 數限制, 所以不會再加入其它 split, 因此 split 0,1,2,3,4,5 組成 chunk 1。 3. 接著組 chunk 2, 不能完整成疊整個 chunk 1, 所以捨棄 split 0, 重疊 split 1~5 的 80 個 token, 再加入 split 6, token 數達 98。 4. 組 chunk 3 時, 不能完整重疊 chunk 2, 捨棄 split 1,2, 只重複 split 3~6, token 數 71, 再加入 split 7, token 數 105。 6. 之後的狀況都類似。 最後合併後的 chunk 結構如下: |split 序號|token 數|chunk 序號|token 數|split 組成| |---|---|---|---|---| |0|21| |1|7| |2|20| |3|6| |4|19|0|73|0,1,2,3,4| |5|28|1|101|(0,1,2,3,4),5| |6|18|2|98|(1,2,3,4,5),6| |7|34|3|105|(3,4,5,6),7 |8|41|4|121|(5,6,7),8| |9|1|5|76|(7,8),9| |10|1|6|77|(7,8,9),10| |11|47|7|124|(7,8,9,10),11| |12|1|8|50|(9,10,11),12| 你可以看到每一個 chunk 都會重複前一個 chunk 的若干內容, 上表中以圓括號表示重複的 split。以下就是合併後的結果: ```python 0: "\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\n" 1: '\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." ' 2: 'I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. ' 3: 'My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. ' 4: 'The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.' 5: "The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\n\n" 6: "The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\n\n\n\n\n" 7: "The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\n\n\n\n\n\n\n\n\n\nNotes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting." 8: '\n\n\n\n\n\n\n\n\n\n\nNotes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.\n\n\n' ``` 合併之後還會進行後置處理, 移除 chunk 開頭及尾端的空白類字元, 然後移除沒有任何內容的 chunk, 最後結果如下: ```python 0: "Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep." 1: 'Before college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing."' 2: 'I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14.' 3: 'My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.' 4: 'The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.' 5: "The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights." 6: "The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights." 7: "The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\n\n\n\n\n\n\n\n\n\nNotes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting." 8: 'Notes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.' ``` #### chunk_size=80, chunk_overlap=20 把重複的 token 數限制減少, 完全不會影響切割結果, 但合併時就會變成這樣: |split 序號|token 數|chunk 序號|token 數|原始 split 組成| |---|---|---|---|---| |0|21| |1|7| |2|20| |3|6| |4|19|0|54|0,1,2,3,4| |5|28| |6|18|1|53|(4),5,6| |7|34|2|52|(6),7 |8|41| |9|1| |10|1|3|43|8,9,10| |11|47| |12|1|4|50|(9,10),11,12| 要特別留意的是 chunk 3 因為上一個 chunk 的最後一個 split 超過限制的重複 token 數, 所以完全無法納入重疊, 因此不含前一個 chunk 的內容。以下就是實際合併後的結果: ```python 0: "Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep." 1: 'They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14.' 2: "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it." 3: "It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights." 4: 'Notes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.' ``` 從這個例子也可以看到, 如果限制重疊的 token 數太小, 就可能會造成根本不會涵蓋前一個 chunk 尾段內容的結果, 這會導致 RAG 在擷取內容時, 可能會取到不完整文意的片段。 另外, 建立 `SentenceSplitte` 時雖然可以指定 `include_metadata=False`, 但這個參數只會影響最後包裝好的 `TextNode` 物件是否包含詮釋資料, 每個片段實際 token 數限制都還是會扣除詮釋資料的 token 數。 ### 中文文字的切割 接著我們以中文文字測試, 測試的檔案為 ch.txt, 同樣放在 data 資料夾下, 以下是檔案內容: ```python r'''不過要注意的是, 實際上每一切片還會放置詮釋資料 (metadata), 單一切片的 token 數量要扣除詮釋資料的 token 數, 才是真正可以放置內容的 token 數量。以本例來說, 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑, 我測試時是放在 "C:\Users\meebo\code\python\debug_llama_index\data\test.txt", token 數量為 23, 所以若是預設的 1024 個 token, 減去 23 就是 1001 個 token。''' ``` 程式碼也修改讀取的文件: ```python input_files=['./data/ch.txt'] # input_files=['./data/test.txt'] ``` 由於檔案內容沒有 '\n\n\n', 所以無法分割段落, 而且 nltk 模組的訓練並沒有包含中文詞彙, 所以也無法分割單句, 因此實際切割文字是使用規則樣式, 以中英文的逗號、據點、問號、驚嘆號為分割界線, 切割出來的 split 如下: ```python 00: '不過要注意的是,' 01: ' 實際上每一切片還會放置詮釋資料 (metadata),' 02: ' 單一切片的 token 數量要扣除詮釋資料的 token 數,' 03: ' 才是真正可以放置內容的 token 數量。' 04: '以本例來說,' 05: ' 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑,' 06: ' 我測試時是放在 "C:\\Users\\meebo\\code\\python\\debug_llama_index\\data\\test.' 07: 'txt",' 08: ' token 數量為 23,' 09: ' 所以若是預設的 1024 個 token,' 10: ' 減去 23 就是 1001 個 token。' ``` 再包裝成 `_Split` 物件: ```python 00: _Split(text='不過要注意的是,', is_sentence=False, token_size=8) 01: _Split(text=' 實際上每一切片還會放置詮釋資料 (metadata),', is_sentence=False, token_size=26) 02: _Split(text=' 單一切片的 token 數量要扣除詮釋資料的 token 數,', is_sentence=False, token_size=28) 03: _Split(text=' 才是真正可以放置內容的 token 數量。', is_sentence=False, token_size=17) 04: _Split(text='以本例來說,', is_sentence=False, token_size=8) 05: _Split(text=' 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑,', is_sentence=False, token_size=31) 06: _Split(text=' 我測試時是放在 "C:\\Users\\meebo\\code\\python\\debug_llama_index\\data\\test.', is_sentence=False, token_size=31) 07: _Split(text='txt",', is_sentence=False, token_size=2) 08: _Split(text=' token 數量為 23,', is_sentence=False, token_size=10) 09: _Split(text=' 所以若是預設的 1024 個 token,', is_sentence=False, token_size=19) 10: _Split(text=' 減去 23 就是 1001 個 token。', is_sentence=False, token_size=16) ``` 注意到 `is_sentence` 屬性為 `False`, 表示這不是以段落或是單句切割的結果, 也就是每個 split 可能會斷在不是完整句子結束的地方。這會在合併的時候產生影響, 以段落或是單句切割所得 (也就是 `is_sentence` 為 `True`) 的 split 即使併入 chunk 後會超過 token 數限制, 也會合併, 但 `sentence` 為 `False` 的 split, 只會在併入後不會超過 token 數限制時才會併入 chunk, 因此永遠不會切出大於 token 數限制的 chunk。 最後合併的順序如下: |split 序號|token 數|chunk 序號|token 數|原始 split 組成| |---|---|---|---|---| |0|8| |1|26| |2|28|0|62|0,1,2 |3|17| |4|8| |5|31|1|56|3,4,5| |6|31| |7|2| |8|10| |9|19|2|62|6,7,8,9 |10|16|3|35|(9),10| 這是最後合併的結果: ```python 0: '不過要注意的是, 實際上每一切片還會放置詮釋資料 (metadata), 單一切片的 token 數量要扣除詮釋資料的 token 數,' 1: '才是真正可以放置內容的 token 數量。以本例來說, 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑,' 2: '我測試時是放在 "C:\\Users\\meebo\\code\\python\\debug_llama_index\\data\\test.txt", token 數量為 23, 所以若是預設的 1024 個 token,' 3: '所以若是預設的 1024 個 token, 減去 23 就是 1001 個 token。' ``` ## SentenceWindowNodeParser [SentenceWindowNodeParser](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#sentencewindownodeparser) 也是以完整語句為單位, 不過它不是以 token 數量為限制, 而是每一個片段都只包含單一語句, 查詢時就是以問題跟個別片段內單一語句的相似度來比較, 但實際送給語言模型時, 會連同前後指定數量的片段內容一起送出, 也就是查詢時相對小範圍的精準度, 但回覆時提供較豐富的脈絡給語言模型參考。 我們仍舊以之前的測試檔案為例, 以底下的程式碼觀察結果: ```python from llama_index.core import SimpleDirectoryReader from llama_index.core.node_parser import SentenceWindowNodeParser documents = SimpleDirectoryReader( # input_files=['./data/ch.txt'] input_files=['./data/test.txt'] ).load_data() splitter = SentenceWindowNodeParser(window_size=3) nodes = splitter.get_nodes_from_documents(documents) for i, node in enumerate(nodes): print(f'[節點{i}]:{node.text}') print(f'[Window]:{node.metadata["window"]}') ``` 其中建立 `SentenceWindowNodeParser` 時的參數 `WindowSize` 就是指送給模型時要包含前後的片段數。最後得到的節點中, `metadata` 裡面鍵 'window' 的值就是包含前後片段的內容。實際執行結果如下: ``` [節點0]: Before college the two main things I worked on, outside of school, were writing and programming. [Window]: Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. [節點1]:I didn't write essays. [Window]: Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. [節點2]:I wrote what beginning writers were supposed to write then, and probably still are: short stories. [Window]: Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." [節點3]:My stories were awful. [Window]: Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. [節點4]:They had hardly any plot, just characters with strong feelings, which I imagined made them deep. [Window]:I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. [節點5]:The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." [Window]:I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. [節點6]:This was in 9th grade, so I was 13 or 14. [Window]:My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. Notes [1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. [節點7]:The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. [Window]:They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. Notes [1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting. [節點8]:It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. [Window]:The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. Notes [1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting. [節點9]:Notes [1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. [Window]:This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. Notes [1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting. [節點10]:I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting. [Window]:The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. Notes [1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting. ``` 你可以看到 node 0 的 window 就是 0,(1,2,3) 片段的內容, 也就是 node 0 自己的內容加上 `window_size` 指定的後面 3 個片段的內容。以此類推, node 1 的 window 包含 (0),1,(2,3,4) 片段的內容、node 2 的 window 包含 (0,1),2,(3,4,5) 片段的內容、node 3 的 window 包含 (0,1,2),3,(4,5,6) 片段的內容、...、node 10 的 window 包含 (7,8,9),10 片段的內容。 ### 預設的切割方式無法處理中文 如果改用之前的中文檔案測試, 你會發現切割結果就只有一個 node: ``` [節點0]:不過要注意的是, 實際上每一切片還會放置詮釋資料 (metadata), 單一切片的 token 數 量要扣除詮釋資料的 token 數, 才是真正可以放置內容的 token 數量。以本例來說, 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑, 我測試時是放在 "C:\Users\meebo\code\python\debug_llama_index\data\test.txt", token 數量為 23, 所以若是預設的 1024 個 token, 減去 23 就是 1001 個 token。 [Window]:不過要注意的是, 實際上每一切片還會放置詮釋資料 (metadata), 單一切片的 token 數量要扣除詮釋資料的 token 數, 才是真正可以放置內容的 token 數量。以本例來說, 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑, 我測試時是放在 "C:\Users\meebo\code\python\debug_llama_index\data\test.txt", token 數量為 23, 所以若是預設的 1024 個 token, 減去 23 就是 1001 個 token。 ``` 這是因為預設切割語句的是透過 [nltk](https://www.nltk.org/api/nltk.tokenize.PunktSentenceTokenizer.html) 模組的 `split_by_sentence_tokenizer` 函式, 這個函式不懂中文, 所以切不出語句。我們可以透過 `sentence_splitter` 參數或屬性替換自訂的函式, 例如以下就改用以中文全形句點切割語句: ```python from llama_index.core import SimpleDirectoryReader from llama_index.core.node_parser import SentenceWindowNodeParser documents = SimpleDirectoryReader( input_files=['./data/ch.txt'] # input_files=['./data/test.txt'] ).load_data() def period_splitter(text): splits = text.split('。') if len(splits) > 1: # 把句子結尾的句號加回去 splits = ( [split + '。' for split in splits[:-1]] + splits[-1:] # 最後一個句子不加句號 ) splits = [split for split in splits if split] return splits splitter = SentenceWindowNodeParser( window_size=3, sentence_splitter=period_splitter # 用自定義的句子分割器 ) nodes = splitter.get_nodes_from_documents(documents) for i, node in enumerate(nodes): print(f'[節點{i}]:{node.text}') print(f'[Window]:{node.metadata["window"]}') ``` `period_splitter` 函式很簡單, 就是以全形中文句點切割, 切割後會把中文句點加回去, 並且移除空的字串。以下就是執行結果: ``` [節點0]:不過要注意的是, 實際上每一切片還會放置詮釋資料 (metadata), 單一切片的 token 數量要扣除詮釋資料的 token 數, 才是真正可以放置內容的 token 數量。 [Window]:不過要注意的是, 實際上每一切片還會放置詮釋資料 (metadata), 單一切片的 token 數量要扣除詮釋資料的 token 數, 才是真正可以放置內容的 token 數量。 以本例來說, 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑, 我測試時是放在 "C:\Users\meebo\code\python\debug_llama_index\data\test.txt", token 數量為 23, 所以若是預設的 1024 個 token, 減去 23 就是 1001 個 token。 [節點1]:以本例來說, 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑, 我測試時是放在 "C:\Users\meebo\code\python\debug_llama_index\data\test.txt", token 數量為 23, 所以若是預設的 1024 個 token, 減去 23 就是 1001 個 token。 [Window]:不過要注意的是, 實際上每一切片還會放置詮釋資料 (metadata), 單一切片的 token 數量要扣除詮釋資料的 token 數, 才是真正可以放置內容的 token 數量。 以本例來說, 詮釋資料的內容是 "filepath: " 加上文件檔案的完整路徑, 我測試時是放在 "C:\Users\meebo\code\python\debug_llama_index\data\test.txt", token 數量為 23, 所以若是預設的 1024 個 token, 減去 23 就是 1001 個 token。 ``` 由於檔案內只有兩個中文句點, 所以可以得到兩個節點, 兩個節點的 window 都樣, 包含兩個節點的內容。