# Sprint 2 UserStory

## Preprocessing
>[name=Fausto, David, Lance, Richard]
### Description
Enable the RAG system to support various document types (xls, pdf, txt, doc, csv, opendoc).
#### Input: The files in the /data/ folder
#### Output: The chunks to be uploaded into the vectorstore.
### DOD
- Build document loader for different document types
- (Optional) Build preprocessor for specific document type if (ex. xls to markdown)
- Chunks maximize token limit, i.e. chunks contain as much information as they can while still contextually divided (by sheet or page).
### Demo
- Integrate preprocessor to RAG system
- Chunk number is same or similar to original document number.
#### Tasks
- Create a chunker for any filetype
- Write a function to prevent metadata conflicts, i.e. return "cleaned" documents. :construction_worker:
---
## Summary
>[name=Fausto]
### Description
Generate summaries to allow the LLM to access generalized information about the documents.
#### Input: The files in the /data/ folder
#### Output: Summaries for each file with one chunk per summary
### DOD
- Create summaries for each documents
- The summaries are saved in a JSON/txt file for future reference.
### Demo
- Allow the retriever to upload the chunks into a 'summaries' collection.
---
## Retrieval
>[name=David, Aaron]
### Description
#### Input: The files in the /data/ folder and summaries from summary component
#### Output: A Json object containing the summaries for each file.
### DOD
- The retriever can stop itself if the files in the /data/ folder are not changed.
- The retriever exists with multiple collection for different data structures, e.g. one folder -> one collection in DB
- The retriever can be invoked by a LLM with metadata attached.
### Demo
- Data can be retrieved from the database, according to the query.
- (Optional): The LLM can select between different types of retrievers, based on different collections.
#### Tasks
- Create a file changes checker to avoid re-runs of pre-processing on unchanged data :construction_worker:
- Check for collection existence, if not create a new collection based on the folder.
- Upload the data for each folder after collection check.
- Troubleshooter can route between different types of retriever.
- (Optional) Create a file changes function that will only update new files
- ---
## Memory
>[name=Yihsin]
### Description
By adding memory layer to RAG, enable a continuous improvement of the system based on user feedback or evaluation results.
### DOD
- Survey mem0 memory reduction and storage method
- Test whether the "correction" prompt works
- Implement query-context retriever
- Implement memory reduction and storage
### Demo
- Live demo of POC
- Integrate memory into RAG pipeline
### Tasks
- Survey Milvus vector search :heavy_check_mark:
- Test current feedback system prototype :heavy_check_mark:
- Implement memory add, update, get and delete function :construction_worker:
- Refine query vector retrieval precision, determine memory retrieval threshold
- Refine memory deduction and update prompts
- Integrate feedback system with simple RAG
- Make conversational demo for memory system
- use method from Adrian to refine user method
- Think about different use cases : different feedback types? (on PEGAAi) different collection for different users
---
## Evaluation
>[name=Debby, Zhiting]
### Description
Build evaluation pipeline and provide evaluation report for other parts of the system. For more details can refernce to [this notebook](https://hackmd.io/@manufacturing-ai-answer/B1DxLaxtA/edit).
### DOD
- Test validation of evaluation methods
- Build/refine evaluation pipeline
### Demo
- Evaluation reports with scores and current parameters.
- (Optional) Charts reflecting improvement/decline accross time, for reference of better parameters.
### Tasks
- create QA pairs based on question types
- validate current evaluation method on different question types
- validate current RAG system
- survey on how to spot RAG pipeline weakness
- provide evaluation report
---
## Guardrails
>[name=Yuqing]
### Description
Survey [Guardrails](https://www.guardrailsai.com/docs) to add input/output guard to our RAG pipeline, enabling detection of unrelated questions or harmful answers.
### dod
- Survey Guardrails, see whether its suitable for our system
- Guardrial experiments and implementation
### demo
- live demo of POC
### tasks
- survey Guardrails
- survey using llm on spotting harmful/unrelated question (prompt engineering)
- test Guardrails performance
- integrate Guardrails or query checking mechanism into RAG pipeline
# Other general tasks
- Build a documentation
---
# Multi Agent
>[name=Interns]
### tasks
- survey different agent system integration methods : opengpt, crewai, genai @Lance
- integrate tools @Lance
- build RAG tool @Fausto, David
- build memory tool @Yihsin
- build web search tool
- design user scenario and new features @Aaron