Nerobagel query tool ai

# Nerobagel query tool ai ## Project Setup - clone the repo `git clone https://github.com/neurobagel/query-tool-ai.git` - create virtual environment `python3 -m venv venv` `source venv/bin/activate` - set up pre-commit ( flake8, black, mypy) `pre-commit install` - complete installations `pip install -r requirements.txt` ## Milestone 1 - Parsing user prompt This task has to be completed by leveraging LLMs. One major issue with LLMs is - **hallucinations** - #### google/flan-t5-xxl ```python chain = LLMChain(llm=HuggingFaceHub(repo_id='google/flan-t5-xxl',model_kwargs={'temperature': 0.01}), prompt=prompt)) ``` 1. Extracting all parameters together - prompt: How many female subjects older than 50 with a Parkinson’s diagnosis? ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/e7ae683f-60f9-4696-b91e-d979f6a9d85e) **Issue** - The model here is assuming certain values which are not mentioned in the prompt like imaging_sessions and phenotypic_sessions. 2. Extracting values one by one - - extract_age.py ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/a6fd1306-3889-4a6a-99d3-760e54eae5b9) - extract_sex.py ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/3aff87fd-f6a1-4a7e-b2be-0f3448ebd950) - extract_sessions ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/a0505380-eeeb-4cfe-8442-93462a63d078) **Issue** - The model here is extracting correct values, but the google/flan-t5-xxl model from hugging-face has a rate limit. Also this model could not perform well when it came to categorical values like diagnosis, assessment tool, health-control and image-modality. - #### llama-2 - Though it is a larger model it was not able to provide accurate solutions. ```python llm = ChatOllama(model="llama2") chain = LLMChain(llm=llm, prompt=prompt) output = chain.run(user_query) return extract_output(output) ``` **Example of LLM Response -** ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/2026f9fe-2de6-4381-b944-ea9d3447b97f) - #### gemma - This LLM is somewhat better that llama-2 but still the issue of hallucination persists till some extent ```python llm = ChatOllama(model="gemma") chain = LLMChain(llm=llm, prompt=prompt) output = chain.run(user_query) return extract_output(output) ``` **Example of LLM Response -** ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/9ea0108e-7103-4c00-a963-9eb4c49b8d37) ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/f8bea6d3-8f26-43a5-93ed-03cc63b5855a) ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/1f31e7b0-8492-4aef-ab6a-fa3153a877f9) ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/dc8de042-29f9-4d6d-a71b-9b1580be63b4) ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/07d8f0be-5ec7-439e-8ac5-8337ddda5400) - #### mistral - So far this LLM proves to be the best when it comes to extracting all the parameter values from the user query at once. - A Pydantic model defining the schema for information extraction. ```python class Parameters(BaseModel): """ Parameters for information extraction. """ max_age: Optional[str] = Field(description="maximum age if specified", default=None ) min_age: Optional[str] = Field(description="minimum age if specified", default=None) sex: Optional[str] = Field(description="sex", default=None) diagnosis: Optional[str] = Field(description="diagnosis", default=None) is_control: Optional[bool] = Field(description="healthy control subjects", default=None) min_num_imaging_sessions: Optional[str] = Field(description="minimum number of imaging sessions", default=None) min_num_phenotypic_sessions: Optional[str] = Field(description="minimum number of phenotypic sessions", default=None) assessment: Optional[str] = Field(description="assessment tool used or assessed with", default=None) image_modal: Optional[str] = Field(description="image modal", default=None) ``` **Examples of LLM Response** - ![image](https://github.com/Raya679/Healthcare-Chatbot/assets/113240231/5be2f93a-16cc-4759-a830-83f79fc28d44)