Honghan Wu's MSc Project Proposals of 2022-2023

###### tags: `MSc-Project` `Teaching` # Honghan Wu's MSc Project Proposals of 2022-2023 - Primary supervisor: Dr Honghan Wu - Informal discussion: honghan.wu@ucl.ac.uk - Research lab: https://knowlab.github.io/ ## [KG-EMD]Benchmarking distributed representations for Chemistry knowledge graph Knowledge graphs (KGs) are a novel paradigm to represent and integrate data from highly heterogeneous sources, and have been shown to have huge potential in facilitating AI driven approaches for deriving new knowledge. In chemistry domain, ChEBI is a database and ontology of chemical entities of biological interest containing a wide range of manually curated data items [1-2]. ChEBI is widely used for many different purposes including drug target identification [3] and gene studies [4]. Link prediction is a field in graph-based machine learning that aims to predict novel relationships between entities. When applied to biomedical knowledge graphs, link prediction can be a versatile and powerful method of hypothesis generation, e.g. for drug discovery (Abbas et al., 2021). A vast array of machine learning link prediction algorithms emerged over the last decade. In this project, you will implement and compare different off-the-shelf embedding models [5-7] and our own chemistry domain embeddings in the task of link predictions for CheBEI. This is a collaboration with a start-up company, [IRIS.AI](http://iris.ai/), from Norway, meaning you will be supported by colleagues from [IRIS.AI](http://iris.ai/) and other members of KnowLab group. ### Reference [1] Degtyarenko, Kirill, et al. "ChEBI: a database and ontology for chemical entities of biological interest." Nucleic acids research 36.suppl_1 (2007): D344-D350. [2] Hastings, Janna, et al. "ChEBI in 2016: Improved services and an expanding collection of metabolites." Nucleic acids research 44.D1 (2016): D1214-D1219. [3] Gao, Yu-Fei, et al. "Prediction of drugs target groups based on ChEBI ontology." BioMed research international 2013 (2013). [4] Bettembourg, Charles, Christian Diot, and Olivier Dameron. "Optimal threshold determination for interpreting semantic similarity and particularity: application to the comparison of gene sets and metabolic pathways using GO and ChEBI." PloS one 10.7 (2015): e0133579. [5] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018). [6] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. [7] Chen, Qingyu, Yifan Peng, and Zhiyong Lu. "BioSentVec: creating sentence embeddings for biomedical texts." 2019 IEEE International Conference on Healthcare Informatics (ICHI). IEEE, 2019. ## [NHS-KG] Derive a knowledge graph from NHS A-Z website NHS A-Z (https://www.nhs.uk/conditions/) contains professionally curated knowledge for many conditions, which typically includes disease descriptions, symptoms, diagnosis, medicine, lifestyle, regular check-ups, comorbid health problems, help and support and etc. For example, for type 2 diabetes, see https://www.nhs.uk/conditions/type-2-diabetes/. While there exist many biomedical ontologies, much of the knowledge from NHS A-Z is not formally represented. Such knowledge is highly valuable for supporting tasks like clinical decision making, self-management, pretraining transferable machine learning models. The project is to utilise information retrieval and natural language processing to derive a knowledge graph from this website. ## [NHS-LM] Pretrain a language model on NHS A-Z corpus NHS A-Z (https://www.nhs.uk/conditions/) contains professionally curated knowledge for many conditions, which typically includes disease descriptions, symptoms, diagnosis, medicine, lifestyle, regular check-ups, comorbid health problems, help and support and etc. For example, for type 2 diabetes, see https://www.nhs.uk/conditions/type-2-diabetes/. In the last years, many pretrained large language models have been published to facilitate transfer learning capacities (i.e., support downstream tasks like document classification, named entity recognition). However, no effort has been put in making use of the NHS corpus for transfer learning. The project is to pretrain / further train/fine-tune language models on such a valuable resource and showcase its utilities in a few clinical NLP benchmarking tasks. ## [GP-NLP] Understand reasons of patient attendance to General Practice using natural language processing and machine learning General practice clinical staff use free text (unstructured data) to record reasons for attending a general practice appointment, this is often different to diagnosis data which is coded therefore structured. Systematic and accurate understanding of reasons for patient attendance would allow better resource managements and more efficient health care. In this project, you will have access to large scale GP appointment data (>200k) from Frimly NHS Foundation Trust with suitable computational resources available. You will devise and apply NLP (such as RNN/Attention/Transfomer based models) + Machine learning (such as clustering algorithms) models for address the above research question. ## [CVD-ML] Reproducible machine learning based risk prediction models for cardiovascular diseases Risk prediction and classification play an important role in managing many diseases. Machine learning has been widely applied in this, for example for heart diseases [1]. Compared to 'traditional' regression based methods, ML are expected to address many challenges in risk predictions[2]. Reproducible ML models have been shown to very valuable resources for realising high performing and potentially personalised predictions[3]. However, it is difficult to obtain reproducible ML models for particular cohorts, for example none reviewed models in [4] were reproducible. To address this challenge, this project aims to conduct a large scale survey on a general disease area - cardiovascular diseases and identify reproducible ML risk prediction models. In particular, we will also collect meta-data like demographic and clinical characteristics of derivation cohorts, which are shown to be extremely valuable for model ensemble[3]. Technically, this project will be conducted in semi-automated manner combining **information retrieval, natural language processing and systematic review**. **This project could be offered to one or two students.** ### Reference [1] Khan, Younas, et al. "Machine learning techniques for heart disease datasets: a survey." Proceedings of the 2019 11th International Conference on Machine Learning and Computing. 2019. [2] Goldstein, Benjamin A., Ann Marie Navar, and Rickey E. Carter. "Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges." European heart journal 38.23 (2017): 1805-1814. [3] Wu, Honghan, et al. "Ensemble learning for poor prognosis predictions: A case study on SARS-CoV-2." Journal of the American Medical Informatics Association 28.4 (2021): 791-800. [4] Wang, Minhong, et al. "Artificial intelligence models for predicting cardiovascular diseases in people with type 2 diabetes: A systematic review." Intelligence-Based Medicine (2022): 100072. --- # 2021-2022 Projects ## Graph Neural Network Methods for Predicting Adverse Events of COVID-19 Drugs Most drugs for treating COVID-19 are re-purposed, meaning they were designed for treating other diseases but found to be effective in managing COVID-19. While clinical trials have proven their effectiveness for new use, it is not clear what potential adverse reactions of these drugs might have when used at scale to the whole population, which certainly have much diverse background compared to trial cohorts, such as age groups, ethnicity, underlying conditions and polypharmacy (on some other drugs). Our group has been using network analysis and knowledge graph technologies for predicting adverse drug reactions. We have collated a large knowledge graph and established a computational pipeline for evaluating different algorithms. This project is to - (1) implement the state-of-the-art graph neural network algorithms into the pipeline; - (2) improve/devise new algorithms to improve the performance for predicting adverse events associated with repurposed COVID-19 drugs; - (3) potentially evaluate new findings in CVD-COVID-19 dataset. ## Multi-document automated coding of clinical notes with deep learning methods Automated coding is the task of assigning clinical notes with codes in a large classification system or ontology, e.g. the International Classification of Diseases (ICD). The task is usually formalised as a document classification problem and based on a single type of document (discharge summaries). Real-world coding is, however, based on multiple types of documents (e.g. also requires radiology reports and other past information of the patient). The key challenges lie in (i) the representation of multiple, long, semi-structured documents using deep neural network approaches, (ii) better pre-training language methods for automated coding, (iii) the low frequent labels potentially addressed by integrating knowledge-based reasoning. This project aims to investigate and improve the advanced deep learning methods, e.g. large pre-trained language models (BERT), convolutional and recurrent neural networks, with multiple types of documents, and potentially knowledge (rules, label relations) for automated ICD coding. The coded results will also be analysed by selected diseases. ### Requirements - Background in (health) data science, natural language processing (NLP), machine learning (ML), or related - Experience or strong interest in NLP or ML projects - Working knowledge of Python, PyTorch or Tensorflow, and other NLP/ML packages ## Knowledge Based Systems in Safety Critical Applications: a Literature Review and Associated Demonstration system(s) The standard approach to applying Machine Learning / Neural Networks is to divide the datasets into 2 parts where the first is used to train the ruleset, and the second part is used to test the inferred classifier. And provided the test results satisfy some criteria then the ruleset is then often used to classify / process the instances from further datasets which we will refer to as application datasets. What (objective) criteria are applied to determine whether the results with these applications datasets are acceptable? This is a particularly sensitive matter for KBSs used in safety-critical applications. The first part of this project will be a literature review of KBS systems used in (medical) safety critical applications, noting particularly, the criteria applied to determine whether a) a particular ruleset should be applied to a particular dataset and b) whether the results proposed by the KBS are acceptable. The second part of the project will be to incorporate some validation procedures into KBS(s) which are being applied to (medical) safety-critical applications. We have experience of using (representative) sets of expert-defined tasks-and-solution pairs (see References); in this approach a KBS which solves all these tasks satisfactorily is considered to be an acceptable KBS. It is suggested that this is an approach which might be investigated, together with various others, possibly some statistical, that are identified in the literature review. Additionally, in some domains rather than providing a very long list of acceptable instances, it might be more efficient to use a generalized rule which captures all the acceptable instances. A trivial example from the domain of triangles: rather than specifying a vast number of triangles, it would be far better to specify that the 3 angles of any triangle must sum to 180 Degrees. In medicine, one might state that a patient’s Systolic blood Pressure > = the patient’s Diastolic Blood Pressure. ### REFERENCES - Craw, S & Sleeman, D. 1990. Automating the Refinement of Knowledge-Based Systems. In Proceedings of ECCAI-90. Luigia Aiello (Ed). London: Pitman, pp 167-172. - Ferrucci, D, et al. (2010), "Building Watson: An Overview of the DeepQA Project", AI Magazine (AI Magazine) 31 (3). --- # Projects of last years ## The implications of (Machine Learning) training data biases for clinical decision making. A recent report showed decision-makers at healthcare organisations are confident that AI will improve medicine, but roughly half of them think it will produce fatal errors. This project will use real world datasets, initially for ICU patients, to assess particularly how the biases in training data could lead to different outcomes if AI is used in clinical decision making. **Supervisors:** Dr Honghan Wu, Prof Derek Sleeman https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002689 https://www.bmj.com/content/368/bmj.m689 --- ## Natural language processing for the automated extraction of immunohistochemical profiles from pathology reports Diagnostic cellular pathologists make use of specific immunohistochemical markers when analysing pathological specimens to enable disease entity classification, provide prognostic information and guide targeted therapy. This information is textually recorded in pathology reports in a non-standardised manner. This project seeks to develop a natural language processing tool for the automated extraction of these data to enable the generation of structured records of immunohistochemical profiles across large numbers of reports. Aggregated results produced using this approach will provide important real-world data on immunohistochemical profiles that will be both of practical use for diagnostic pathologists and may provide novel clinical-pathological insights. **Supervisors:** Dr Honghan Wu and Dr Adam P. Levine --- ## Automated medical coding of rare diseases from clinical notes Two sentences: This is to use NLP to analyse discharge summaries to identify those patients with rare diseases. The context is that rare diseases are not very well coded (in structured data like ICD-10 codes) and in COVID-19 pandemic those patients are likely vulnerable but not listed in the government’s shielding list. So automatically identifying them would potentially save their lives. The dataset we are going to use is MIMIC III [1]. We have applied NLP model (SemEHR[2]) on all the discharge summaries of MIMIC. The work would be focusing on using phenotypes (NLP results) to infer who had what rare disease using rare disease models [3]. We (KnowLab) have 2 postdocs working on related projects – so you won’t be working alone. ### Reference [1] https://mimic.physionet.org/ [2] https://academic.oup.com/jamia/article/25/5/530/4817428 [3] https://www.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/information-for-gmc-staff/rare-disease-documents/rare-disease-eligibility-criteria/