Annotation Tool AI Documentation

# Annotation Tool AI Documentation ## Categorizaton ## Parsing ### Introduction The codebase is designed to handle and annotate TSV data by converting it into JSON format. It leverages the Pydantic library to enforce data structures, ensuring the consistens and valid data throughout the annotation process. The main components of the code include defining data structures for various annotation categories, handling the annotation of the categories itself and create a JSON file containing the annotations. The scope of this milestone was to create annotations for the entities available in the current Neurobagel data model (i.e., ParticipantID, SessionID, Age, Sex, Diagnosis, and Assessment Tool). ### Assumptions and Requirements Since we have separated the categorization and parsing steps, some assumptions have been made about the format of the LLM response for the different data model entities. The goal was to produce correct annotations with a minimum of information. Thus, the following LLM responses are assumed for the different entities: - Participant ID: `{"TermURL":"nb:ParticipantID"}` - Session ID: `{"TermURL":"nb:Session"}` - Age: ``` { "TermURL": "nb:Age", "Format": "europeanDecimalValue" } ``` - Sex: ``` { "TermURL": "nb:Sex", "Levels": { "M": "male", "F": "female" } } ``` - Diagnosis: ``` { "TermURL": "nb:Diagnosis", "Levels": { "MDD": "Major depressive disorder", "CTRL": "healthy control" } } ``` - Assessment Tool: ``` { "TermURL": "nb:AssessmentTool", "AssessmentTool": "future events structured interview" } ``` By now the code depends on - `typing` for providing type hints for defined functions. (required for mypy) - `pandas` TSV file handling - `json` JSON handling - `pydantic` for data validation and modularized parsing ### Representing Data Model Entities Conceptually, the desired JSON output has a common base for all entities (i.e. the `IsAbout` section), but depending on the entity being handled, different additional fields are present. For instance, the example below demonstrates that the `participant_id` contains the additional field `Identifies` and the `age` contains the additional fields `Transformation` and `MissingValues` but does not contain the `Identifies` entry. ```json { "participant_id": { "Description": "A participant ID", "Annotations": { "IsAbout": { "Label": "Subject Unique Identifier", "TermURL": "nb:ParticipantID" }, "Identifies": "participant" } }, "age": { "Annotations": { "IsAbout": { "Label": "Age", "TermURL": "nb:Age" }, "Transformation": { "Label": "integer value", "TermURL": "nb:FromInt" }, "MissingValues": [] }, "Description": "The age of the participant at data acquisition", "Unit": "years" } } ``` To keep the code readable, a separate class was defined for each entity to be annotated. The common base in the desired output - the `IsAbout` section, which contains two strings (a TermURL and a Label), serves as the initial data structure for all entities in the data model. Subsequently, the `IsAbout` base model is further differentiated for each data model entity (e.g., `IsAboutParticipant`, `IsAboutSession`, etc.) and adds the specific (static) label to it. ```python class IsAboutBase(BaseModel): Label: str TermURL: str class IsAboutParticipant(IsAboutBase): Label: str = Field(default="Subject Unique Identifier") TermURL: str ``` The `IsAbout` section is encapsulated in the `Annotations` section, but here potentially different fields can be included. To implement this, another data structure has been defined that contains the specific `IsAbout` section (depending on the `TermURL' entry returned by the LLM) and all the optional fields. ```python class Annotations(BaseModel): IsAbout: Union[ IsAboutParticipant, IsAboutSex, IsAboutAge, IsAboutSession, IsAboutGroup, IsAboutAssessmentTool, ] Identifies: Optional[str] = None Levels: Optional[Dict[str, Dict[str, str]]] = None Transformation: Optional[Dict[str, str]] = None IsPartOf: Optional[Dict[str, str]] = None ``` As a final step in creating the JSON output format, fields such as `Description` and, in the case of categorical variables (optional), the `Levels` present in the TSV file should be added. To implement this, another data structure was introduced that contains these fields and the `Annotations` (which in turn contains the `isAbout`). ```python class TSVAnnotations(BaseModel): Description: str Levels: Optional[Dict[str, str]] = None Annotations: Annotations ``` Based on the data structures described above, annotations can be composed for each entity. Here is a graphical representation of the data structures and how they are used for the final TSV annotation. The pink boxes represent fields that depend on the LLM response. ```mermaid %%{init: {'theme': 'forest', "flowchart" : { "curve" : "basis" } } }%% flowchart LR subgraph TSV-Annotations Description([Description:\n set for each entity]) Levels-Description([Levels-Description:\n used in Sex and Diagnosis, responded by \n the LLM, mapped to the pre-defined terms \nand used for annotation in Levels-Explanation]) subgraph Annotations subgraph Identifies identifies([used for ParticipantID \nand SessionID]) end subgraph Levels-Explanation levels-explanation([used to provide a TermURL and \nLabel for the Elements of \nLevels-Description]) end subgraph Transformation transformation([used for Age, \nresponded by the LLM \nand used for annotation.]) end subgraph IsPartOf ispartof([used for AssessmentTool,\n provides TermURL and Label\n for the Assessment Tool.]) end subgraph IsAbout isabout([TermURL responded by \n the LLM categorization \n serves as controller \nfor further annotation]) end end end style isabout fill:#f542bc style transformation fill:#f542bc style Levels-Description fill:#f542bc style ispartof fill:#f542bc ``` ### Functions Creating the desired JSON output requires several steps. These include: | Function | Purpose | Parameters | | -------- | -------- | -------- | | `convert_tsv_to_dict` | Extract the original column names (and their contents - for LLM queries) from the TSV file. This serves as a preparation step for the query passed to the LLM, as well as for the creation of the "raw" JSON file. | Input: `tsv_file:str` Output: `column_strings:Dict[str,str]`| |`tsv_to_json` | This function initializes a JSON file with the columns of the TSV file as keys and empty strings as values. | Input: `tsv_file:str` `json_file:str` Output: `None`| | | LLM Categorization (see [here](https://hackmd.io/QymmEdIoTk-2g7-JNMq2lA#LLM-Utilisation--Annotation-Tool-AI-Documentation)) | |`process_parsed_output` | This function decides which handler function to call based on the TermURL of the LLM response.| Input: `llm_output:Dict[str, Union[str, Dict[str, str], None]]` `levels_mapping:Mapping[str, Dict[str, str]]` Output: `TSVAnnotations:Union[str, TSVAnnotations]` |`handle_participant` `handle_age` `handle_categorical` `handle_session` `handle_assessmentTool` | These functions create specific annotation instances to ensure that each annotated column contains only the fields required for it.|Input ParticipantID, Session, Age: `llm_response:Dict[str, Any]` Input Sex, Diagnosis, Assessment Tool: `llm_response:Dict[str, Any]` `mapping:Mapping[str, Dict[str, str]]` Output: `TSVAnnotations`| |`load_levels_mapping` `load_assessmenttool_mapping` | These helper functions provide the mappings (i.e., the corresponding TermURLs to a specific label such as "Male" or "Alexia"). For diagnosis and sex the `load_levels_mapping` is used. For the assessment tool the `load_assessmenttool_mapping` is used. Two functions are needed because the original structure of the files (`diagnosisTerms.json` and `toolTerms.json`) is slightly different.| Input: `mapping_file:str` Output: `levels_mapping \|assessmenttool_mapping:Mapping[str, Dict[str, str]]` |`update_json_file` | Updates the "raw" JSON file with the processed data under the specific key (i.e. the original column name).| Input: `data: Union[str, TSVAnnotations]` `filename: str` `target_key: str` Output: `None` ## Main Script Here the main script demonstrates the complete process of annotating a column of the original `TSV` file. ```python if __name__ == "__main__": file_path = "participants.tsv" json_file = "output.json" columns_dict = convert_tsv_to_dict(file_path) tsv_to_json(file_path, json_file) # Create output for each column for key, value in columns_dict.items(): print("Processing column:", key) try: # Invoke the chain with the input data input_dict = {key: value} print(key) # Column information is inserted in prompt template llm_response = llm_invocation(input_dict) except Exception as e: print("Error processing column:", key) print("Error message:", e) result = process_parsed_output(llm_response) print(result) update_json_file(result, "output.json", key) ```