MedCAT v2 - HackMD

# MedCAT v2 ###### tags: `research` `ideas` Idea is to switch up MedCAT Linking and have it use BERT for calculation of context embeddings instead of Word2Vec. To some extent we've taken the idea from SapBERT which traines a BERT model (via Self-Alignement Pretraining) for Linking. The input data for SapBERT is tuples (<concept_name>, <concept_cui>): ``` (diabetes, C0011849), (dm, C0011849), (cancer, C0006826), . . . ``` Example 1. This is converted to triplets during training, such that one tuple is taken as an anchor (e.g. the first one in the example above), then a search is performed to find another tuple with the same CUI (this denotes the positive match), and another search is performed to find a tuple with a different CUI. The triplet is then: (diabetes, dm, cancer) with types (anchor, positive, negative). The goal of the training is to make the BERT embedding of the `positive (dm)` match be similar to `anchor (diabetes)` while making the `negative (cancer)` match less similar to `anchor (diabetes)`. The goal is to have all synonyms of one disease (concept) be similar to each other while being different from all other concept names in our ontology. The problem with SapBERT is that it is only able to link names to concepts and it ignores the context in which a name was found. In other words SapBERT cannot disambiguate names - the same name is always linked to the same concept irrelevant of the context in which it was found. To solve this problem we are merging the MedCAT approach and SapBERT. Given a dataset like MIMIC we first use MedCAT to find all mentions of e.g. SNOMED concepts that do not need disambiguation - e.g. the exact match of the name `Malignant Neoplasms` never needs disambiguation as it only (always) links to one concept. Once MedCAT runs through MIMIC it will find millions of annotations for concepts that do not need disambiguation. We now form the input in the same way as SapBERT but instead of providing just the name we also provide the context in which a concept was found: ``` (...the patient was diagnosed with <diabetes mellitus> a hundred years ago..., C0011849), (...patient's medical history: <diabetes mellitus>..., C0011849), (...that the patient has <Malignant Neoplasms> but I'm not sure..., C0006826) ``` Example 2. It will be immediately obvious that using this approach we will always find full concept names. Things like `dm` for `diabets` will never be found as it is ambiguous and MedCAT does not know to what to link it - we are looking for 100% positive/true examples that is why there is no disambiguation at this stage. We are only using MedCAT to find what is 100% correct. There is a quick fix for this, given that we've used SNOMED/UMLS we do know many synonyms for `diabetes` that is why we will modify the tuples in *Example 2* and replace the detected fully qualified names with synonyms in e.g. 50% of cases. ``` (...the patient was diagnosed with <dm> a hundred years ago..., C0011849), (...patient's medical history: <diabetes mellitus>..., C0011849), (...that the patient has <Malignant Neoplasms> but I'm not sure..., C0006826) ``` Example 3. After replacement Now we make triplets in the same way we've done in SapBERT and run the training while slightly modifying the last layer of the model and adding a max/avg pool layer for all the words that make out a concept. That then goes into a standard FC and the rest is the same as in the standard BERT. Notes: - [Maybe] It is important to always use all concepts for training, as training one concept improves the others - Try to jointly do MLM and the above similarity learning will probably work better. - Compare only on concepts that have received training. It does not make much sense to do linking on concepts that did not see even 1 example in the train set --- All: - [ ] Read SapBERT papers (two of them) [here](https://github.com/cambridgeltl/sapbert) - [ ] Read about huggingface datasets and have a look at a medcat [example](https://github.com/CogStack/MedCAT/blob/master/notebooks/BERT%20for%20NER.ipynb) - [ ] Document of [PyTorch Metric Learning](https://arxiv.org/pdf/2008.09164.pdf) Leo: - [x] Read [MedCAT](https://arxiv.org/abs/2010.01165), focus on Methods - [x] Have a look at the MedCAT tutorials (tutorial folder in the repo) - [ ] Create a training dataset for all concepts from MedMentions - [ ] Make a training dataset for all concepts from MIMIC Linglong: - [x] Reproduce sappbert itself and get fimilar with the model framework - [ ] Recreate SapBERT results on MedMentions (MM) -- modifiy last output layer to get representation (mean/max pool value of last concepts token) - [ ] Get scores for diseases only on MM > The huggingface tokenizer provides subword representations (have checked), which means a sentence would be tokenized into at least as many tokens as the number of words or greater than. I think it's better to change the training data pair into word_level on the context as well as the index, (have designed a solution to this). In addition, the max length of tokens may need to be extended since we add more context words. Z: - [x] Update MedCAT so that we can set length limit strict and also on non preferred names - [x] Share UMLS CDB with Leo (and Linglong just in case) - [x] Send around MedCAT BERT (deid) notebook with datasets examples - [ ] PC resources Backlog: - [ ] Train MedCATv2 on MedMentions alone and check performance - [ ] Train the model and test once more - [ ] Fall back when one concept did not receive enough training examples