Building a Privacy-First Document QA System with Gaia and Qdrant

# Building a Privacy-First Document QA System with Gaia and Qdrant Here's an example project that helps you run PDF to RAG and query your PDFs locally using a local Gaia node. ## What and Why? Many of us work with PDFs daily - technical documentation, research papers, legal documents, and more. While tools like ChatGPT can help understand these documents, they require uploading potentially sensitive information to external servers. Additionally, the responses aren't always grounded in the source material, leading to potential hallucinations. ## How? Gaia PDF RAG addresses these challenges by combining several powerful technologies: 1. Local LLM processing using Gaia nodes 2. Efficient vector search with Qdrant 3. Smart reranking using cross-encoders 4. Privacy-first architecture Let's dive into how it works and how you can use it. ## Code Overview ### 1. Document Processing The first step is processing PDF documents into manageable chunks. Here's how we do it: ```python def process_document(uploaded_file: UploadedFile) -> List[Document]: """Process uploaded PDF file into text chunks.""" temp_file = tempfile.NamedTemporaryFile("wb", suffix=".pdf", delete=False) temp_file.write(uploaded_file.read()) temp_file.close() loader = PyMuPDFLoader(temp_file.name) docs = loader.load() os.unlink(temp_file.name) text_splitter = RecursiveCharacterTextSplitter( chunk_size=400, chunk_overlap=100, separators=["\n\n", "\n", ".", "?", "!", " ", ""], ) return text_splitter.split_documents(docs) ``` This code: - Handles PDF uploads - Splits documents into semantic chunks - Preserves context through overlap - Cleans up temporary files ### 2. Vector Storage with Qdrant We use Qdrant for efficient vector storage and retrieval: ```python def init_collection(client: QdrantClient): """Initialize Qdrant collection if it doesn't exist or has wrong dimensions""" try: collection_info = client.get_collection(COLLECTION_NAME) current_size = collection_info.config.params.vectors.size if current_size != VECTOR_SIZE: client.delete_collection(COLLECTION_NAME) raise Exception("Collection deleted due to dimension mismatch") except Exception: client.create_collection( collection_name=COLLECTION_NAME, vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE), ) ``` This ensures: - Proper vector dimensions - Cosine similarity search - Efficient storage and retrieval ### 3. Smart Reranking One key innovation is the use of cross-encoders for reranking: ```python def re_rank_cross_encoders(prompt: str, documents: List[str]) -> Tuple[str, List[int]]: """Re-rank documents using cross-encoder model.""" encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') ranks = encoder.rank(prompt, documents, top_k=3) relevant_text = "" relevant_text_ids = [] for rank in ranks: relevant_text += documents[rank["corpus_id"]] relevant_text_ids.append(rank["corpus_id"]) return relevant_text, relevant_text_ids ``` This improves accuracy by: - Re-scoring candidate passages - Considering full context - Filtering irrelevant results ### 4. Integration with Gaia The local LLM integration happens through the Gaia node: ```python def call_gaia_llm(context: str, prompt: str): """Call local Gaia node for chat completion.""" messages = [ { "role": "system", "content": system_prompt }, { "role": "user", "content": f"Context: {context}\nQuestion: {prompt}" } ] response = requests.post( f"{GAIA_NODE_URL}/chat/completions", json={ "messages": messages, "stream": True }, stream=True ) ``` ## Results and Benefits The combination of these technologies provides several advantages: 1. **Privacy**: All processing happens locally 2. **Accuracy**: Cross-encoder reranking ensures relevant results 3. **Speed**: Local processing means fast responses 4. **Cost**: No API fees or usage limits 5. **Flexibility**: Easy to customize and extend ## Getting Started Want to try it yourself? Here's how: 1. Set up your environment: ```bash git clone https://github.com/harishkotra/gaia-pdf-rag.git cd gaia-pdf-rag python -m venv venv source venv/bin/activate pip install -r requirements.txt ``` 2. Start the required services: ```bash # Start Gaia node gaianet init gaianet start # Start Qdrant docker run -d -p 6333:6333 qdrant/qdrant ``` 3. Run the application: ```bash streamlit run app.py ``` ## Future Developments This project is just the beginning. Future plans include: - Multi-document support - Additional file formats - Custom embedding models - Enhanced reranking strategies - Document summarization ## Contribute Gaia PDF RAG demonstrates that we can have powerful AI capabilities without compromising on privacy. By leveraging local LLMs, efficient vector search, and smart reranking, we can build tools that are both powerful and privacy-respecting. The project is open source and welcomes contributions. Check it out on [GitHub](https://github.com/harishkotra/gaia-pdf-rag) and give it a try! ### Credits Inspired by[ this example](https://github.com/yankeexe/llm-rag-with-reranker-demo).