Sprint 2 UserStory

# Sprint 2 UserStory ![image](https://hackmd.io/_uploads/ByEHCuvtA.png) ## Preprocessing >[name=Fausto, David, Lance, Richard] ### Description Enable the RAG system to support various document types (xls, pdf, txt, doc, csv, opendoc). #### Input: The files in the /data/ folder #### Output: The chunks to be uploaded into the vectorstore. ### DOD - Build document loader for different document types - (Optional) Build preprocessor for specific document type if (ex. xls to markdown) - Chunks maximize token limit, i.e. chunks contain as much information as they can while still contextually divided (by sheet or page). ### Demo - Integrate preprocessor to RAG system - Chunk number is same or similar to original document number. #### Tasks - Create a chunker for any filetype - Write a function to prevent metadata conflicts, i.e. return "cleaned" documents. :construction_worker: --- ## Summary >[name=Fausto] ### Description Generate summaries to allow the LLM to access generalized information about the documents. #### Input: The files in the /data/ folder #### Output: Summaries for each file with one chunk per summary ### DOD - Create summaries for each documents - The summaries are saved in a JSON/txt file for future reference. ### Demo - Allow the retriever to upload the chunks into a 'summaries' collection. --- ## Retrieval >[name=David, Aaron] ### Description #### Input: The files in the /data/ folder and summaries from summary component #### Output: A Json object containing the summaries for each file. ### DOD - The retriever can stop itself if the files in the /data/ folder are not changed. - The retriever exists with multiple collection for different data structures, e.g. one folder -> one collection in DB - The retriever can be invoked by a LLM with metadata attached. ### Demo - Data can be retrieved from the database, according to the query. - (Optional): The LLM can select between different types of retrievers, based on different collections. #### Tasks - Create a file changes checker to avoid re-runs of pre-processing on unchanged data :construction_worker: - Check for collection existence, if not create a new collection based on the folder. - Upload the data for each folder after collection check. - Troubleshooter can route between different types of retriever. - (Optional) Create a file changes function that will only update new files - --- ## Memory >[name=Yihsin] ### Description By adding memory layer to RAG, enable a continuous improvement of the system based on user feedback or evaluation results. ### DOD - Survey mem0 memory reduction and storage method - Test whether the "correction" prompt works - Implement query-context retriever - Implement memory reduction and storage ### Demo - Live demo of POC - Integrate memory into RAG pipeline ### Tasks - Survey Milvus vector search :heavy_check_mark: - Test current feedback system prototype :heavy_check_mark: - Implement memory add, update, get and delete function :construction_worker: - Refine query vector retrieval precision, determine memory retrieval threshold - Refine memory deduction and update prompts - Integrate feedback system with simple RAG - Make conversational demo for memory system - use method from Adrian to refine user method - Think about different use cases : different feedback types? (on PEGAAi) different collection for different users --- ## Evaluation >[name=Debby, Zhiting] ### Description Build evaluation pipeline and provide evaluation report for other parts of the system. For more details can refernce to [this notebook](https://hackmd.io/@manufacturing-ai-answer/B1DxLaxtA/edit). ### DOD - Test validation of evaluation methods - Build/refine evaluation pipeline ### Demo - Evaluation reports with scores and current parameters. - (Optional) Charts reflecting improvement/decline accross time, for reference of better parameters. ### Tasks - create QA pairs based on question types - validate current evaluation method on different question types - validate current RAG system - survey on how to spot RAG pipeline weakness - provide evaluation report --- ## Guardrails >[name=Yuqing] ### Description Survey [Guardrails](https://www.guardrailsai.com/docs) to add input/output guard to our RAG pipeline, enabling detection of unrelated questions or harmful answers. ### dod - Survey Guardrails, see whether its suitable for our system - Guardrial experiments and implementation ### demo - live demo of POC ### tasks - survey Guardrails - survey using llm on spotting harmful/unrelated question (prompt engineering) - test Guardrails performance - integrate Guardrails or query checking mechanism into RAG pipeline # Other general tasks - Build a documentation --- # Multi Agent >[name=Interns] ### tasks - survey different agent system integration methods : opengpt, crewai, genai @Lance - integrate tools @Lance - build RAG tool @Fausto, David - build memory tool @Yihsin - build web search tool - design user scenario and new features @Aaron

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.