A Cracked Engineer's Guide to Text Segmentation (Chunking)

# A Cracked Engineer's Guide to Chunking ## Introduction: The Deceptive Complexity of Chunking When I began my first retrieval-augmented generation (RAG) project, I naively thought that splitting text into manageable chunks would trivial. How hard could it be to break down a document into smaller pieces? As it turns out--incredibly hard. This is the story of my multi-year journey through a labyrinth chunking strategies in relentless pursuit of better text segmentation. Every AI engineer eventually comes to realize that chunking is the unsung hero of many language model applications. It's the foundation upon which we build our retrieval systems, the silent workhorse that can make or break the performance of our RAG and search. Yet, for something so crucial, it's astounding how little attention it often receives. In this deep dive, I'm going to share my personal experiences, the strategies I've tried, the pitfalls I've encountered, and the insights I've gained. This isn't just a technical guide; it's a narrative of trial and error, of the constant push and pull between theoretical ideals and practical realities. So, buckle up, fellow AI devs, as we embark on this chunking odyssey. ## The Chunking Challenge: Why It's Harder Than You Think Before we delve (derogatory) into the various strategies and their intricacies, let's take a moment to appreciate why chunking is such a challenging problem. On the surface, it seems straightforward: take a long piece of text and break it into smaller, manageable pieces. But as with many things in AI, once we start to do things in practice, they are more challenging than we might expect. ### The Balancing Act Effective chunking is a delicate balance between several competing factors: - **Semantic Coherence**: Ideally, each chunk should contain a complete thought or idea. Splitting mid-sentence or breaking up closely related concepts can lead to loss of context and meaning. We also want to create chunks that are readable by both LLMs and humans. - **Size Constraints**: LLMs have limited context windows. We need chunks that are small enough to fit within these constraints but large enough to contain meaningful information. - **Retrieval Efficiency**: The way we chunk text directly impacts how effectively we can retrieve relevant information later. Too-large chunks might contain irrelevant information, while too-small chunks might miss important context. - **Computational Efficiency**: In both production and development environments, chunking needs to be fast. We can't afford to spend minutes processing a single document when we're dealing with large datasets. We also want to be able to re-chunk our dataset in a reasonable time and be able to run experiments/evals quickly! - **Consistency**: For many applications, it's crucial that chunking is deterministic and consistent. If we chunk the same text twice, we should get the same result. ### Maintaining Context One of the most challenging aspects of chunking is preserving context. Natural language is full of references, both explicit and implicit. A pronoun in one sentence might refer to a subject introduced several sentences earlier. An acronym might be defined at the beginning of a paragraph and used throughout the rest of the text. How do we ensure that these connections aren't lost when we split the text? ### Not All Documents Are Created Equal Another complicating factor is the sheer diversity of text we might encounter. A chunking strategy that works well for academic papers might fall flat when applied to financial returns statements. Legal documents, creative writing, technical manuals--each type of text presents its own unique challenges. The file type also matters. We can have PDFs, slides, code, business-specific formats--the list goes on. As I've grappled with these issues, I've come to realize that there's no one-size-fits-all solution. Instead, we need a toolkit of strategies, each with its own strengths and weaknesses. In the following sections, I'll walk you through the evolution of my chunking approaches, from naive beginnings to more sophisticated techniques. But if you don't want to read this entire article, here's the TL;DR: Just try the simplest solution that works, and try more complex strategies until something does. ## The Evolution of Chunking: From Naive to Nuanced Picking a chunking strategy can be extremely challenging--after all, check out this array of different segmentation techniques: ![f8d7bcc0-c5c8-4b85-91e3-7dd821461b53](https://hackmd.io/_uploads/S1AT-LB31e.png) But by the end of the ebook, you'll be able to choose for yourself the right strategy for your project. ### The Naive Approach: Fixed-Size Chunking (Bad!) When I first started working with LLMs and RAG systems, I did what many beginners do: I reached for the simplest possible solution. I split the text into chunks of a fixed number of tokens or characters. It seemed logical – if I needed chunks of 512 tokens to fit my model's context window (which were much smaller at the time), why not just split the text every 512 tokens? Here's what that looked like: ```python def naive_chunker(text, chunk_size=512): return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)] ``` Simple, right? Well, as it turns out, simplicity comes at a cost. Here's what I quickly discovered: - **Semantic Butchery**: This method ruthlessly cuts through sentences, paragraphs, and even words. The resulting chunks were often nonsensical, starting and ending mid-thought. - **Context Loss and Retrieval Issues**: Important context was frequently split across chunks. If a chunk ends with "The main reasons for the observed behavior are" and the next chunk starts with "increased pressure, reduced stability, and higher temperatures," it becomes challenging for retrieval systems to maintain coherence. This leads to retrieval issues where chunks starting or ending with partial sentences are often irrelevant or misleading. It quickly became clear that while this approach was computationally efficient, it was a disaster for maintaining the semantic integrity of the text. It was time to level up. ### Structure (Hueristic)-Based Methods: A Step in the Right Direction My next stop on the chunking journey was to explore structure-based methods. These approaches use predefined rules or patterns to make more intelligent splitting decisions. One popular tool in this category is the RecursiveCharacterTextSplitter from LangChain. Here's a basic implementation: ```python from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ".", " ", ""] ) chunks = splitter.split_text(long_text) ``` This method was a significant improvement over the naive approach. It tries to split at natural boundaries like paragraph breaks or sentence endings. The chunk_overlap parameter also helps maintain some context between chunks. However, I soon discovered that while this method worked reasonably well for well-structured text, it still had limitations: - **Inconsistent Chunk Sizes**: Depending on the structure of the text, you could end up with wildly varying chunk sizes. Some might be near the maximum, while others could be very small. - **Semantic Blindness**: While it respects some structural elements, it doesn't understand the meaning of the text. It might still split in the middle of a complex idea if that idea spans multiple paragraphs. - **Parameter Tuning**: Finding the right balance of chunk_size, chunk_overlap, and separators often required a lot of trial and error. Despite these limitations, structure-based methods like this are often a good starting point for many applications. They're relatively fast, easy to implement, and produce decent results for well-structured text. They are also by far the most common approach for chunking text. I'd experiment specifically with the separators that work best with your data, as the defaults of ```["\n\n", "\n", " ", ""]``` might not be ideal for your text. It's common practice to use a larger chunk size of 300+ tokens, and using a small overlap. A good resource is chunkviz.com, a tool created by one of the earliest proponents (or possibly the creator) of semantic chunking, Greg Kamradt. ![image](https://hackmd.io/_uploads/HkvBSHJz1l.png) In this example we are splitting using the RecursiveCharacterTextSplitter with a chunk size of 589. In this dataset, we are able to use this tool to clearly see that a chunk size of 500-1000 works best. It's also worth noting that there are specific splitters created for coding languages, and notably Markdown. If you are splitting HTML or PDFs, a great strategy is first turning the document into Markdown, and then splitting the Markdown! Because Markdown has so many natural boundaries to split on, we can develop very good heuristics. ```python md_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0 ) ``` LlamaIndex and Langchain each have their own splitters, which could give you different results. Here's a broad overview of some of the splitters available in both libraries: ### **Langchain Splitters:** 1. **CharacterTextSplitter** - Splits by characters 2. **RecursiveCharacterTextSplitter** - Splits by characters recursively for better chunk boundaries 3. **HTMLHeaderTextSplitter** - Splits based on HTML headers and elements 4. **CodeTextSplitter** - Splits code with support for multiple programming languages 5. **RecursiveJsonSplitter** - Splits JSON objects into smaller, manageable chunks 6. **SemanticChunker** - Splits text based on semantic similarity 7. **TokenTextSplitter** - Splits based on token count, helpful for LLMs with token limits ### **LlamaIndex Splitters:** 1. **SimpleFileNodeParser** - General-purpose file parser for various file formats 2. **HTMLNodeParser** - Parses and splits HTML by tags 3. **JSONNodeParser** - Splits JSON documents into chunks by traversing nested structures 4. **MarkdownNodeParser** - Parses and splits Markdown files 5. **CodeSplitter** - Splits code based on the specified programming language and line count 6. **SentenceSplitter** - Splits text while preserving sentence boundaries 7. **SentenceWindowNodeParser** - Splits by sentence with neighboring window sentences included in metadata. Great for small-to-big embedding, for when you want to embed smaller chunks but show larger context to the LLM. 8. **SemanticSplitterNodeParser** - Dynamically splits based on semantic similarity using embeddings 9. **TokenTextSplitter** - Splits based on raw token counts 10. **HierarchicalNodeParser** - Splits nodes into hierarchical structures for various chunk sizes These are regularly updated and renamed, so this can be out of date by the time you read this. But the point is, there are many different structure based segmentation strategies. The best one can depend on the specific file format and content of your document. Interestingly, AI21 implemented a very robust splitter for HTML, and when they released Wordtune Read, a platform that breaks articles into chunks and summarizes each chunk, they managed to monetize their simple HTML splitter through an API! However, I was very sad to hear that this splitter is no longer supported. Jina AI now maintains a simple segmentation API: https://jina.ai/segmenter/, which might make sense for some applications if you don't want to build your own segmentation microservice (but do you even need a segmentation microservice? Probably not.) ### Semantic-Aware Chunking: The Quest for Meaning As I continued to push for better chunking results, I realized I needed a method that could understand the semantic content of the text. This led me to explore semantic-aware chunking techniques. One approach I came across was Greg Kamradt's semantic chunking algorithm, which uses embedding models to detect semantic shifts in the text. Here's a simplified version of how it works: 1. Split the text into sentences. This can be done with any of the methods above. 2. Generate embeddings for each sentence 3. Calculate the cosine similarity between adjacent sentences 4. Identify 'breakpoints' where the similarity drops below a threshold 5. Create chunks based on these breakpoints ```python import numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity from model2vec import StaticModel import pysbd def semantic_chunker(text, model, threshold=0.1, max_chunk_size=1000, min_chunk_size=50): # Split into sentences (simplified for brevity) sentences = pysbd.Segmenter(language="en", clean=False).segment(text) # Generate embeddings embeddings = model.encode(sentences) # Calculate similarities similarities = cosine_similarity(embeddings) chunks = [] distances = [] split_points = [] current_chunk = [] current_size = 0 for i, sentence in enumerate(sentences): if i > 0: similarity = similarities[i-1][i] if (similarity < threshold and len(current_chunk) > min_chunk_size) or current_size + len(sentence) > max_chunk_size: chunks.append(' '.join(current_chunk)) current_chunk = [] current_size = 0 split_points.append(i) distances.append(similarity) current_chunk.append(sentence) current_size += len(sentence) if current_chunk: chunks.append(' '.join(current_chunk)) return chunks, distances, split_points # Usage model = StaticModel.from_pretrained("minishlab/potion-base-8M") chunks, distances, split_points = semantic_chunker(long_text, model) for chunk in chunks: print(chunk) print('---') # visualize distances and where chunks are split import matplotlib.pyplot as plt # Visualize distances and split points plt.plot(distances) plt.title('Sentence Similarities') plt.xlabel('Sentence Index') plt.ylabel('Cosine Similarity') # Highlight split points for split_point in split_points: plt.axvline(x=split_point - 1, color='red', linestyle='--') # Adjust index for plotting plt.show() ``` ![state of the union](https://hackmd.io/_uploads/rynT0NyM1l.png) Here we chunked the state of the union address--and got very good results! Notice how evenly we split the text once we found the right hyperperameters for threshold, min_chunk_size, and max_chunk_size. The min_chunk_size is very important! Without it we would have made many more splits, resulting in many useless tiny chunks. Chunking the state of the union was also pretty fast--we were able to do it in .46 seconds, because pybsd is rule-based and very fast, and our embeddings were generated very quickly. ![image](https://hackmd.io/_uploads/SJAhqHJzyg.png) This approach was a significant step forward, producing chunks that were much more semantically coherent. However, it also introduced new challenges: - **Computational Overhead**: Generating embeddings for each sentence is computationally expensive, especially for large documents. But with a small local model, this isn't really a bottleneck in most cases. - **Model Dependence**: The quality of the chunks might well depend heavily on the quality of the embedding model used. I haven't tested enough to tell if this is true, but I suspect it isn't. - **Threshold Tuning**: Finding the right similarity threshold can be tricky and may need to be adjusted for different types of text. When splitting text for vector search, your choice of splitter is crucial. While RecursiveCharacterTextSplitter is common, NLP-based approaches often yield better results. I initially used NLTK's sentence tokenizer, which produced smarter splits than character-based methods. However, it had two drawbacks: difficult to reverse the splitting and occasionally creating illogical breaks. After testing alternatives, I found pysbd to be superior. It's essentially an improved version of NLTK that works consistently across my document collection and integrates seamlessly with Spacy. ![image](https://hackmd.io/_uploads/rJY63G1z1x.png) However, depending on the text you are splitting, it might make sense to use a markdown splitter, or some other splitter that is more specific to your use case, or even a RecursiveCharacterTextSplitter with a very small chunk size. Another decision I made with this semantic chunking function was to use a very small model: the minishlab/potion-base-8M, a compact Model2Vec embedding model with just 8 million parameters. Models in this range (2-8M parameters) are lightweight, fast, and cost-effective to run. There’s little evidence to suggest that the chunk boundaries they produce are significantly worse than those from larger models. This choice accelerates our chunking process by orders of magnitude and removes reliance on external providers. By leveraging Model2Vec, we achieve a streamlined, efficient solution that maintains quality while enabling rapid, inexpensive, and fully independent embedding generation. Langchain's documentation recommends OpenAI embeddings, but I'm not confident that a large embedding model will create significantly better boundaries than a very tiny one for this task. And the potion model can embed millions of chunks in minutes on a CPU, for free! A common piece of advice in semantic chunking is to use the same embedding model for both chunking and final embeddings, as this is often said to improve retrieval performance. For instance, if you use OpenAI embeddings for your search, the recommendation is to use the same embeddings for chunking. However, I haven’t seen sufficient evidence to back this up. In practice, we can still achieve good results by embedding with a different model! ## Advanced Chunking Strategies: Pushing the Boundaries As I continued to explore and experiment, I found myself pushing into more advanced territory. Two approaches, in particular, caught my attention: clustering-based methods and LLM-based chunking. ### Clustering-Based Chunking: Dynamic Programming for Semantics Clustering-based chunking uses unsupervised learning to group text by meaning. The ClusterSemanticChunker splits text into sentences, generates embeddings (e.g., with minishlab/potion-base-8M), and builds a similarity grid. Dynamic programming then optimizes chunks by maximizing intra-group similarity within a size limit, preserving sequence. It’s fast with small models, captures semantic shifts well (great for papers or reports), but scales poorly with long texts and may yield uneven chunks. Tuning the max size is key. Check out [this](https://github.com/brandonstarxel/chunking_evaluation/blob/main/chunking_evaluation/chunking/cluster_semantic_chunker.py) implemination if you want to see how it works. ### LLM-Based Chunking: The Double-Edged Sword As large language models became more powerful and accessible, I couldn't resist the temptation to try using them for chunking. The idea is simple: why not leverage the language understanding capabilities of LLMs to identify optimal split points in the text? Let's imagine how this would work: ![6de636b8-a548-45b3-bf3d-a5378af8e3af](https://hackmd.io/_uploads/rJqzNUr2kx.png) Here's a basic implementation of LLM-based chunking, adapted from the Chroma Technical Report implementation: ```python import pysbd import openai import re from openai import OpenAI # Set your OpenAI API key OPENAI_API_KEY = "OPENAI_API_KEY" client = OpenAI(api_key=OPENAI_API_KEY) # Sample text to process text = "Your sample text goes here." def segment_text(text): """ Segments text into sentences using pysbd. """ # Initialize the pysbd segmenter segmenter = pysbd.Segmenter(language="en", clean=False) sentences = segmenter.segment(text) return sentences def get_merge_indices(sentences): """ Sends segmented sentences to the LLM to get merge indices. """ # Insert artifacts <|start_chunk_X|> and <|end_chunk_X|> around each sentence chunked_text = '' for idx, sentence in enumerate(sentences): chunked_text += f"<|start_chunk_{idx}|>{sentence}<|end_chunk_{idx}|>" # Prepare the prompt for the LLM prompt = f""" You are an assistant specialized in merging text into thematically consistent sections. The text has been divided into chunks, each marked with <|start_chunk_X|> and <|end_chunk_X|> tags, where X is the chunk number. Your task is to identify the points where merges should occur, such that consecutive chunks of similar themes stay together. Respond with a list of chunk IDs where you believe a merge should be made. For example, if chunks 1 and 2 belong together but chunk 3 starts a new topic, you would suggest a merge after chunk 2. Your response should be in the form: 'merge_after: 2, 5'. Do not create chunks that are too small to have any meaning. Keep code snippets together. CHUNKED_TEXT: {chunked_text} Respond only with the IDs of the chunks where you believe a merge should occur. YOU MUST RESPOND WITH AT LEAST ONE MERGE. THESE MERGES MUST BE IN ASCENDING ORDER STARTING FROM 0. """ # Call the OpenAI API to get the merge indices response = client.chat.completions.create( model='gpt-4o', messages=[ { 'role': 'user', 'content': prompt } ], temperature=0.0, ) # Extract the merge indices from the response response_text = response.choices[0].message.content.strip() print("LLM Response:", response_text) # Use regular expression to extract numbers after 'merge_after:' match = re.search(r'merge_after:\s*(.*)', response_text) if match: numbers = match.group(1) merge_indices = [int(num.strip()) for num in numbers.split(',')] else: print("No merge indices found.") merge_indices = [] return merge_indices def merge_chunks(sentences, merge_indices): """ Merges chunks based on the provided indices. """ chunks = [] current_chunk = '' for idx, sentence in enumerate(sentences): current_chunk += sentence if idx in split_indices: chunks.append(current_chunk) current_chunk = '' if current_chunk: chunks.append(current_chunk) return chunks # Step 1: Segment the text sentences = segment_text(text2) # Step 2: Get merge indices from the LLM merge_indices = get_merge_indices(sentences) # Step 3: Merge the segments based on the LLM's response merged_chunks = merge_chunks(sentences, merge_indices) # Print the merged chunks for i, chunk in enumerate(merged_chunks): print(f"\nMerged Chunk {i+1}:\n---------\n{chunk}\n") ``` ![image](https://hackmd.io/_uploads/Bk4sFrJMJx.png) The results were... interesting. On one hand, the LLM-based approach produced some of the most semantically coherent chunks I've seen. It was able to understand context, maintain thematic consistency, and even handle complex linguistic structures that tripped up other methods, while requiring basically zero hyperperameter tuning (the hyperperameter is the prompt). However, it also introduced a whole new set of challenges: 1. **Speed**: LLM-based chunking is slow. Really slow. Each API call takes time, and for long documents, you might need multiple calls. Of course, this can be parallelized with async calls, so you can chunk millions of documents in minutes (with high enough API limits). But each individual call will take at least a few seconds. 2. **Cost**: Using an LLM for chunking can get expensive quickly, especially for large datasets. 3. **Consistency**: The non-deterministic nature of LLM outputs means you might get different chunks each time you run the algorithm on the same text. This can be mitigated by setting the temperature to 0. 4. **Prompt Engineering**: The quality of the chunks depends heavily on how well you craft the prompt. It's yet another hyperparameter to tune. 5. **Token Limits**: Most LLM APIs have token limits for both input and output, which can make it challenging to chunk very long documents. Luckily we can just split our document into reasonable chunks before sending it to the LLM, but having to chunk just so we can chunk is annoying! In practice, I found that while LLM-based chunking can produce excellent results for small, sensitive datasets where quality is paramount, it's often not suitable for large-scale applications due to its speed and cost, as semantic chunking methods can offer similar performance with far less compute, especially if using small embedding models. But if chunking is a one-off task, and your dataset is small, this is a great option. Why not use an LLM? You won't have to scratch your head trying to set semantic chunking parameters, and you can focus on the rest of the pipeline. The best evaluation I've seen for different chunking models was the Chroma Technical Report: Evaluating Chunking Strategies for Retrieval. They found that semantic chunking techniques can often match an LLM. They also did some magic to compare chunking methods, creating a unique methodology and detaset in the process. If you are really interested in chunking, I recommend reading the whole thing, and checking out the code: [https://github.com/brandonstarxel/chunking_evaluation](code) ![image](https://hackmd.io/_uploads/B1XfXryGJe.png) ## The Pragmatic Approach: Choosing the Right Tool for the Job After all this experimentation, what have I learned? The key takeaway is that there's no one-size-fits-all solution to chunking. The best approach depends on your specific use case, the nature of your data, and your computational resources. Here's my general recommendation for approaching chunking: 1. **Start Simple**: Begin with a basic structure-based method like `RecursiveCharacterTextSplitter`. It's fast, easy to implement, and often good enough for many applications. Still, take your hyperperameters very seriously, and confirm that you are getting good chunks visually! 2. **Evaluate and Iterate**: Test your chunking results. Look at the chunks, try them in your retrieval system, and see how they perform. If you're not satisfied, move on to more advanced methods. 3. **Consider Semantic Methods**: If maintaining context is crucial and you have the computational resources, try a semantic-aware or clustering-based approach. 4. **Use LLM-Based Chunking Sparingly**: Reserve LLM-based chunking for small, high-value datasets where quality is paramount and speed/cost are less of a concern. 5. **Combine Methods**: Don't be afraid to mix and match. For example, you might use a structure-based for initial splitting, then use a semantic method to refine the chunks. Remember, chunking is just one part of the pipeline. Often, improvements in other areas (like hybrid search or re-ranking) can have a bigger impact on overall performance than squeezing out the last bit of quality from your chunking algorithm. ## Chonkie ![34061849-beaf-41b3-b120-74cde09023e8](https://hackmd.io/_uploads/BynftIS2Jx.png) One frustrating aspect of chunking is that the functionality is fragmented across several libraries, including Aurelio, LlamaIndex, and LangChain. LlamaIndex and LangChain, in particular, have become bloated and overly complicated, making them annoying to use for chunking. These libraries also aren't always the best-maintained or optimized for speed, contributing further to inefficiencies. There's a library called Chonkie, which has recently grown to over 2,000 stars. It's extremely simple, has minimal dependencies, and excels at one thing—chunking. This is the library I typically use whenever advanced chunking techniques aren't necessary. I'm actually the first OS contributor to it, and believe it does a lot of things right! Its minimalistic design ensures it's both very fast and easy to use. If your goal is simply to get tasks done efficiently, Chonkie is the library I'd recommend. It mutliple different chunking strategies, from structure-based to semantic--and is FAST. ![76b17e9d-64da-4a64-ac8f-ea6e47c8d681](https://hackmd.io/_uploads/BkFzFUBnJe.png) ### Simplifying RAG with KDB.AI + Chonkie: A Complete Tutorial Retrieval-Augmented Generation (RAG) needs to be simple, efficient, and scalable. Combining **KDB.AI** and **Chonkie** achieves exactly this by providing excellent defaults, fast performance, and easy integration. Here's a step-by-step tutorial to get started quickly: ### 1. Getting Your KDB.AI API Key First, sign up for a free-tier account at [KDB.AI](https://kdb.ai): - Go to [kdb.ai](https://kdb.ai) - Sign up and log in to your account - Navigate to your dashboard, find your API key, and copy it for later use ### 2. Install Dependencies Open a python notebook and run: ```bash pip install "chonkie[st]" kdbai-client sentence-transformers pandas requests ``` ### 3. Set Up LateChunker Late chunking is a chunking strategy that takes advantage of long-context embedding models to optimize document chunking. Rather than immediately splitting a document into fixed chunks, LateChunker first generates embeddings for the entire document within the embedding model’s context window. It then intelligently partitions these embeddings into meaningful chunks based on sentence or token boundaries, ensuring semantic coherence and maximizing context utilization. This approach helps maintain contextual integrity across chunks, especially beneficial when embedding long or complex documents. ![image](https://hackmd.io/_uploads/HkhHgud2kx.png) (Source https://jina.ai/news/late-chunking-in-long-context-embedding-models/) Initialize Chonkie's LateChunker in Python: ```python from chonkie import LateChunker chunker = LateChunker( embedding_model="all-MiniLM-L6-v2", mode="sentence", chunk_size=512, min_sentences_per_chunk=1, min_characters_per_sentence=12, ) ``` ### 4. Set Up KDB.AI as Your Vector Store Create a KDB.AI session and database: ```python import kdbai_client as kdbai session = kdbai.Session( api_key="your_api_key", endpoint="your_endpoint" # Find this in your KDB.AI dashboard ) db = session.create_database("documents") schema = [ {"name": "sentences", "type": "str"}, {"name": "vectors", "type": "float64s"}, ] indexes = [{ 'type': 'hnsw', 'name': 'hnsw_index', 'column': 'vectors', 'params': {'dims': 384, 'metric': "L2"}, }] table = db.create_table( table="chunks", schema=schema, indexes=indexes ) ``` ### 5. Chunk and Embed Your Text Use Paul Graham’s essays for a practical example: ```python import requests import pandas as pd urls = ["www.paulgraham.com/wealth.html", "www.paulgraham.com/start.html"] texts = [requests.get('http://r.jina.ai/' + url).text for url in urls] batch_chunks = chunker(texts) chunks = [chunk for batch in batch_chunks for chunk in batch] embeddings_df = pd.DataFrame({ "vectors": [chunk.embedding.tolist() for chunk in chunks], "sentences": [chunk.text for chunk in chunks] }) # Upload embeddings to KDB.AI table.insert(embeddings_df) ``` ### 6. Query Your Embeddings Test the retrieval system: ```python import sentence_transformers search_query = "to get rich do this" search_embedding = sentence_transformers.SentenceTransformer("all-MiniLM-L6-v2").encode(search_query) results = table.search(vectors={'hnsw_index': [search_embedding]}, n=3) for result in results[0]['sentences']: print(result) ``` ### 7. Cleanup Once finished, you can drop your database to save resources: ```python db.drop() ``` ### Why Choose KDB.AI + Chonkie? - **Easy Setup**: Straightforward API and sensible defaults. - **High Performance**: Generate embeddings in minutes on standard hardware. - **Scalable**: Easily handles small experiments to large-scale production. - **Contextually Rich**: Late chunking significantly reduces retrieval inaccuracies. Using KDB.AI with Chonkie means spending less time tuning infrastructure and more time enhancing your RAG systems. ## The Future of Chunking: What Lies Ahead As we look to the future, I see several exciting developments on the horizon: 1. **Specialized Chunking Models**: I believe we'll see the development of smaller, specialized models trained specifically for text segmentation tasks. These could offer a middle ground between structure-based methods and full LLM-based chunking. 2. **Adaptive Chunking**: Future chunking algorithms might adapt their strategy based on the content and structure of the text, automatically choosing the best approach for each document. 3. **Multimodal Chunking**: As we deal with more diverse data types, we'll need chunking strategies that can handle not just text, but also images, audio, and video. 4. **Improved Evaluation Metrics**: We need better ways to evaluate chunking quality. I expect to see the development of more sophisticated metrics that consider both semantic coherence and retrieval performance. 5. **Integration with Retrieval Systems**: Chunking might become more tightly integrated with retrieval systems, with the chunking strategy adapting based on feedback from retrieval performance. ## Conclusion: Don't Overcomplicate Chunking Chunking’s deceptive simplicity hooked me—it’s a small task with big complexity, mirroring AI’s core struggles: efficiency vs. meaning, simplicity vs. depth. It’s not flashy, but it’s the bedrock of model performance, especially for LLMs and retrieval. I’ve learned it’s worth mastering. My advice is start simple—don’t overengineer unless it fails you. Happy chunking!