---
title: "Using LLM model to decipher auto-immune diseases"
#date: 30 mai 2024
citeproc: true
bibliography: ["MachineLearning.bib", "Bio2Mlib.bib"]
geometry: margin=2cm
---
In the past 10 years, machine learning (ML) has gone from a research field technology to a tool that is applied to everyday tasks and is present in nearly every devices [@consens_transformers_2023].
It was first generalized to the public by application to image analysis, like face recognition and now to generate image or video of nearly anything.
In the healthcare field, ML can help in different applications like interpreting radiology images in cancer, or predicting the best treatment protocols in precision medicine. Theses applications are based on different kind of ML algorithm but work by analyzing thousands of features like patient attributes, treatment context, medical literature depending on the context [@davenport_potential_2019].
The difficulty in machine learning approaches is the generation of models, which implies enormous datasets, numerous algorithms and huge computational power. In addition, in biology, the bottleneck for any ML is to find the data to train the models, how to format, organize and standardize them at a very large scale [@magnano_approachable_2022].
Until now, the solution has been to reduce or simplify the data used to train the models [see @consens_transformers_2023 for review].
Recently, new algorithms using large language models (LLM) like "LLM with attention map mechanism" were developed to go further in understanding more complex phenomena by revealing complex features and relationships that can not be seen by the humain brain.
In the last few years, for example, it became possible with LLM to infer the function of a protein with only its primary sequence, but also, with the same data, to isolate some amino-acids important for the function like amino-acids in the active site [@buton_predicting_2023].
In addition to help decipher the primary sequence of protein, work on DNA as already begun [@dalla-torre_nucleotide_2023; @ji_dnabert_2021; @consens_transformers_2023]. Some models already allow to define promoter sequences or splice events and to functionally classify genetic variants.
Even if some of theses new models are freely available in biology, they are dedicated to specific tasks that can not apply to all fields of medical analysis. They usually mix genomics data with one kind of other omics data only, and are very specific to a question [see @consens_transformers_2023].
In the context of Immun4Cure, it is crucial to develop models adapted to a more complex analysis because auto-immuno diseases are heavily multifactorial, and are more than often lacking genetic or biological markers.
The goal of this project is to develop these models and use them
to answer questions related to diagnosis and prognosis.
As Immun4Cure has planned to generate a lot of data from different kind of omics (mainly genomics, transcriptomics, and proteomics) associated with medical information of patients (metadata), but it will require time to acquire enough data to do meta-analysis.
, we will first use theses data to train our LLM models in order to succeed challenging multi-omics analysis.
In the first step, we will use genomics and transcriptomics data because they are easy to aggregate. Meanwhile, public data are available, but the main diffilcuty is the quality and accessibility of metadata. We have identified public projects with enough data and metadata, but with variable quality metadata, like the project PRJEB85597 with 468 samples on Systemic lupus erythrematosus disease.
Theses kind of data will be use to challenges the training process and evaluate different models.
The field being in its infancy, the first phase of the project will be devoted to set up the methodology and acquire new know how. We will aggregate complementary skills from each partner to address several challenges:
* A main effort will be the transformation of raw data in a standard format that can be used by algorithms. This first step will be greatly facilitated by our experience in datamining of large RNAseq dataset for ML [@silva_k-mer_2023 and Silva's thesis], and by the fact that we have in house a huge collection of indexed RNAseq datasets.
* We will then first test existing models (DNABert, Nucleotide Transformer, scGPT..) and fine tune them with our data to rapidly have an idea of the challenges lying ahead [@ji_dnabert_2021; @cui_scgpt_2023; @dalla-torre_nucleotide_2023]. In a second step we will devellop a LLM model specifically adapted to the multi-omics analysis of the auto-immune diseases studied in the Immun4cure IHU.
In a second phase the model will allow to identify specific markers to classify patients for diagnosis and prognosis. Because of its hability to "see" patterns not visible to the human eye it is also expected to help answer new questions brought up by clinicians and biologists.
We have put together 3 complementary teams to carry out this project:
* The Bio2M team will bring its expertise in transcriptomic analysis, large scale project. It will be also in charge of the transcriptomic analysis in IHU Immun4Care. It has also machine learning competences as confirmed by recent publication [@silva_k-mer_2023].
* The Medical Genetics department team will be responsible for the clinical data processing and genomic data analysis.
* The QuantaCell team will bring the Machine Learning skills, in particularly in tools like transformers involved in LLM.
* In addition to the 3 local partners of this projects, we have a partnership with Yan Lecunff' team at the INRAI, Rennes who works on the same questions at the amino acid level.
## References
::: {#refs}
:::
---
# Call for proposals "Immun4Cure Doctoral Fellowships" 2024
## PARTNER n°1
* **NAME/First name**: Jérôme Reboul
* **Position**: PharmD, PhD, CR Inserm
* **Research unit and/or Medical team**: U1183, Equipe 2 – Groupe Bio2M
* **Activity description in 10 lines max**:
The bioinformatics group includes biologists and bioinformaticians specialists in text-based algorithms who focus on designing new tools and structures for RNA-Seq analysis. We develop software and data structures for RNA-Seq data analysis (such as Gk-Arrays, CRAC, CracTools, ChimCT). We create new strategies based on kmers capable of organizing reads to quickly respond to specific queries, such as the following pipelines: De-Kupl, Kmerator Suite and KmerExploR. New development now seeks to establish a novel analysis framework for large scale OMICS data analysis in human health (https://transipedia.fr/). In a primary approach, as a proof of concept, we attempt to establish a complete Encyclopedia of kmers as signature of abnormal transcripts in the context of acute myeloid leukemia (AML). The establishment of such signature should allow to explore better diagnostic and prognosis models for new therapeutic strategies
## PARTNER n°2
* **NAME/First name**: Kevin Yauy
* **Position**: CCA, MD in Medical Genetics, PhD in Machine Learning
* **Research unit and/or Medical team**: U1183, Equipe 3 – Groupe D. Geneviève and Service de Génétique Médicale, Génétique Moléculaire et Cytogénomique, CHU de Montpellier
* **Activity description in 10 lines max**: The Medical Genetics department includes clinicians, biologists and bioinformaticians, focusing on rare diseases diagnostics. We specifically are a reference center for auto-inflammatory diseases and develop Data Science and Machine Learning approaches for genomic analysis (MPA [@yauy_b3gat3-related_2018], Genome Alert! [@yauy_genome_2022]) and clinical data processing (ClinFly [@gauthier_assessing_2023], PhenoGenius [@yauy_learning_2022]).
## PARTNER n°3
* **NAME/First name**: Virtor Racine
* **Position**: PhD, Director of QuantaCell Compagny
* **Research unit and/or Medical team**: IRMB, QuantaCell Compagny
* **Activity description in 10 lines max**:
Our team consists of highly skilled experts with backgrounds in biomedical research, computer science, and data analysis. We bring together these diverse areas of expertise to offer a comprehensive and tailored service that meets the specific needs of our clients. With our deep understanding of both the biological and computational aspects of image analysis and artificial intelligence, we are able to deliver innovative solutions that accelerate research and drive scientific discovery.
* **PhD DIRECTORS**: Jérôme Reboul et Kevin Yauy
* **PhD HOME TEAM**:
CV of the PhD director to attach to the file
* **ACRONYM**: AI4AI: Artifical Intelligence for Auto-Immune Diseases
* **TITLE**: Using LLM model to decipher auto-immune diseases
* **KEY WORDS 5 max**: Machine Learning, Large Language Model, multi-OMICS, Autoimmune Diseases,