# Model Zoo
## What are the major problems?
## What are the corresponding datasets?
### Microbiology
Open Source Repository: NCBI Microbiome Central - A collection of databases and tools designed to support the study of microbiomes.
Space Experiment Data: NASA GeneLab - Provides datasets from numerous space biology experiments. Some of these experiments have focused on the effects of space on microbial organisms, including bacteria and fungi.
### Cell and Molecular Biology
Open Source Repository: GEO (Gene Expression Omnibus) - A public functional genomics data repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data.
Space Experiment Data: NASA GeneLab - Contains datasets from various cell biology experiments conducted in space. For instance, studies on human cells to understand the impact of microgravity on cellular function.
### Plant Biology
Open Source Repository: TAIR (The Arabidopsis Information Resource) - Provides a comprehensive collection of data and information on the genetics and molecular biology of the plant Arabidopsis thaliana.
Space Experiment Data: NASA GeneLab - Includes experiments that investigate the effects of spaceflight on different plant species. For example, how plants grow in microgravity or how space radiation affects plant genetics.
### Animal Biology
Open Source Repository: Ensembl - Offers high-quality genome-wide sequence and annotation data for vertebrates and key model organisms.
Space Experiment Data: NASA GeneLab - Houses datasets from experiments on various animals, like rodents, sent to space. These studies can range from understanding bone density loss in microgravity to more complex behavioral studies.
### Developmental, Reproductive and Evolutionary Biology:
Open Source Repository: EvoDevoJ (Evolution & Development Journal) - While not a database in the traditional sense, this is a leading journal in the field of evolutionary developmental biology, and many articles provide supplemental data.
Space Experiment Data: NASA GeneLab - While it may not have a vast collection in this specific field, there are some datasets that explore how microgravity affects development, reproduction, and potentially evolutionary trajectories. For instance, studies might investigate how animals develop in space from embryo to maturity.
Microbiology
Human Microbiome Project (HMP) Data: A comprehensive resource that has sequences of microbial genomes found in the human body.
HMP Dataset
IMG/M: The Integrated Microbial Genomes & Microbiomes system offers tools for the analysis of microbial community genomes.
IMG/M
Cell and Molecular Biology
The Cancer Genome Atlas (TCGA): Detailed genomic information for over 30 types of cancer.
TCGA Dataset
Gene Expression Omnibus (GEO): A public functional genomics data repository supporting MIAME-compliant data submissions.
GEO
Plant Biology
The 1001 Genomes Project for Arabidopsis thaliana: Sequencing of over 1000 different strains of the model plant Arabidopsis.
1001 Genomes Dataset
Plant PhenomeNET: A dataset connecting phenotypic effect with gene function in plants.
Plant PhenomeNET
Animal Biology
Mouse Genome Informatics (MGI): A comprehensive database on the genetics and genomics of the laboratory mouse.
MGI
Zebrafish Model Organism Database (ZFIN): Provides integrated access to curated zebrafish genetic and genomic data.
ZFIN
Developmental, Reproductive and Evolutionary Biology
FaceBase: Datasets aimed at studying craniofacial development and disorders.
FaceBase
TreeBASE: A repository of phylogenetic information, specifically user-submitted phylogenetic trees and the data used to generate them.
TreeBASE
Bgee: A database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types such as RNA-seq, microarrays, and in situ hybridization.
Bgee
## How users use a pre-train model?
### GeneLab's data corresponding Earth models.
RNA-Seq data from plants to study gene expression changes in space:
DeepCount: A deep learning model for predicting gene expression levels based on sequence information.
D-GEX: Uses deep learning to predict gene expression across different conditions.
Transformer models like BERT and its variations have been adapted for biological sequences in tools like BioBERT or BioTransformers. Though they aren't pretrained on RNA-Seq data per se, they can be fine-tuned on such data.
Microbial gene expression data to study microbial behavior in space:
DeepMAsED: A deep learning-based method for differential expression analysis.
DRAGON: A deep learning model that can predict gene expression levels from the gene's regulatory region sequence.
Again, Transformer models adapted for biological sequences could be fine-tuned on microbial gene expression datasets.
Animal protein expression data to study protein synthesis changes in microgravity:
DeepProfile: Uses autoencoders to learn embeddings of gene expression profiles, which can be used for various downstream tasks.
DeepAffinity: Predicts protein-ligand affinity using convolutional neural networks.
Alphafold: Though it's a model for protein structure prediction, it signifies how deep learning models can be used effectively for protein-related tasks. Fine-tuning a model like Alphafold on protein expression data can provide meaningful embeddings or predictions.
So, take using
### Other image datas and corresponding pre-trained models
1. Microscopy of Cellular Structures: Observing cells in space can reveal how microgravity affects cellular structure and function. For instance, observing changes in the cytoskeleton of cells can provide insights into how cells sense and adapt to microgravity.
2. Bone Densitometry: Astronauts in space undergo bone density loss. Imaging the bone over time using densitometry can help in understanding the rate of bone degradation and the efficacy of countermeasures.
3. MRI Scans of Astronauts' Brains: Some studies have indicated changes in astronauts' brain structures after prolonged spaceflight. MRI scans can help in mapping these changes and understanding their implications.
4. Optical Coherence Tomography (OCT) for Eye Health: Extended space missions can affect eye health. OCT provides detailed images of the retina, helping in monitoring the health of an astronaut's eyes over time.
5. Biofilm Formation: Microorganisms in space have been observed to form biofilms differently than on Earth. Observing these structures can help understand microbial behavior in space.
6. Plant Growth Patterns: Microgravity affects how plants grow. Imaging the growth patterns can provide insights into plant behavior in space, crucial for potential long-term space missions where plants might be used for food and oxygen.
* Convolutional Neural Networks (CNNs):
* VGG (VGG16, VGG19): These are excellent for basic image classification tasks and can be fine-tuned for specific space biology imaging data.
* ResNet (ResNet50, ResNet101): These have deeper architectures and can capture more complex patterns in images.
* InceptionV3: Known for its efficiency and high performance in image classification.
* U-Nets: Particularly useful for segmentation tasks, such as segmenting specific cellular structures in microscopy images.
### An example
1. Goal & Hypothesis:
The space biologist aims to decipher how specific plant genes react to the microgravity conditions in space. She hypothesizes that certain genes play a pivotal role in plant adaptation to space and may be responsible for observed changes in growth or health.
2. Data Collection:
She begins with the Arabidopsis thaliana datasets OSD-427 and OSD-480 from NASA GeneLab which have RNA-Seq data of the plant in microgravity. She also has her own RNA-Seq data from a similar experiment she conducted recently.
3. Pre-trained Model Exploration:
On browsing the model zoo, she identifies a promising model from 2022 named scBERT specifically designed for RNA-Seq data. The model has been pre-trained on a vast array of Earth-based RNA-Seq datasets, making it adept at capturing the nuances of gene expression data.
4. Data Preprocessing:
Before utilizing scBERT, she pre-processes the RNA-Seq data to: Normalize gene expression values, handle missing data
and align sequences and quantify them
5. Transfer Learning with scBERT:
She loads the scBERT model and fine-tunes it using her space-based RNA-Seq datasets: The model is trained on OSD-427, OSD-480, and her experiment data.
During training, she adjusts the model's parameters slightly to adapt its knowledge to the specifics of microgravity-based gene expressions.
6. Results & Interpretation:
Once training is completed, she utilizes the fine-tuned scBERT model to: Identify genes that have significantly altered expression in space.
Understand the potential biological pathways impacted by these genes. Determine if any of these genes are associated with stress responses, growth patterns, or other vital processes in the plant.
7. Contribute
The scientist upload her model to the model zoo.
## Aim:
1. To **design a comprehensive database** of publicly available biomedical datasets that could be used to pretrain different models for a “model zoo,” and
2. To determine relevant publicly available space biology datasets that could then be used to refine the models to investigate specific space biology questions.
## 網站/資料庫 設計相關
1. 為了讓科學家很方便地利用: APIs for Developers: Provide APIs for programmatic access, making it easier for developers and platforms to integrate and utilize the models and datasets.
3. Model 要介紹使用方式 類似model(有tags方便fiter)
4. Dataset 也要做分類: 太空or地面, 影像or基因, 生物種類, 性別(?!)
5. 關於Preprocessing: (在簡報中說明 那些資料怎麼被preprocess)
* Normalization: Ensure datasets have consistent scales or distributions, especially when combining them.
* Tokenization and Encoding for Genomic Data: Convert genomic sequences into a format suitable for ML (e.g., A=0, T=1, G=2, C=3).
aging Data: Generate new training examples by applying various transformations (rotations, zooming, etc.).
* Handling Missing Values: Ensure that missing data points are either imputed or removed based on the dataset's nature.
* Feature Extraction: For complex datasets like imaging or time series, extract essential features to reduce dimensionality.
## Topics in Space Biology
1. 可能會使用的資料:
* 影像: 顯微影像, micro-CT老鼠影像等 (在演示ML部分時可優先使用這個,有經驗)
* 基因: 就基因(在演示ML部分時若有時間可使用這個) Ex: Datasets such as The 1000 Genomes Project, NCBI's GenBank, and GWAS Catalog provide genetic and genomic data.
*
3. [Space biology roadmap...](https://www.nasa.gov/wp-content/uploads/2015/03/16-03-23_sb_plan.pdf) (2016-2025)主要談以下
4. Nasa的任務: Free Flyer? ISS Space Biology ?
5. 更細部關於Space biology的影像問題:Determining the dynamics and roles of various cellular organelles within cells is essential to
understanding how the larger organism reacts and responds to microgravity. Dr. Rojas-Pierce and
her group (http://dx.doi.org/10.4161/psb.29783 ; PubMed PMID: 25482812 , Dec-2014) are seeking to
define the contribution of vacuolar and cytoskeletal dynamics to amyloplast sedimentation and
gravitropic responses in shoots. Using an agravitropic mutant the team has recently reported that
the impaired vacuole formation is the result of a mutation in a vacuolar trafficking protein resulting in
multiple organelles instead of a large central vacuole. This protein has also been shown to regulate
gravitropism and protein trafficking to the vacuole. Using a series of fluorescence microscopy techniques, demonstrated that the diffuse vacuoles are independent compartments and not connected to adjacent vacuoles, and that vacuole fusion is dependent on phosphoinosidides for vacuole
fusion in plants (Zheng, et al., 2014: PMID 25482812).
2. [paper](https://www.sciencedirect.com/science/article/pii/S221455242200102X?via%3Dihub) Transfer learning as an AI-based solution to address limited datasets in space medicine, mentioned.. 可以拿來處理 For a concrete example of the concept of transfer learning, consider
the task for non-invasive detection of anemia in spaceflight with retinal
images. With terrestrial training and target data, a deep learning prediction model to detect anemia with retinal images has recently been
successfully developed (Mitani et al., 2020). However, in the case of
space-derived training and target data, the prediction model results are
likely to degrade. Transfer learning is needed when there is a limited
supply of target training data, which is encountered in astronaut datasets (Weiss et al., 2016). Using existing, large datasets (terrestrial) which
is related to the target domain of interest (space), presents a useful
application of transfer learning (E Waisberg et al., 2022). In this
example, transfer learning allows a neural network to be a viable option
with an astronaut dataset that was previously considered to be too small
(Fig. 1).
2. A pretrained model on RNA sequencing can be used as a base model for any space biology RNA sequencing dataset in the OSDR.
會選幾個來demo.
Genomic Responses in Microgravity: Spaceflight induces changes at the genomic level in various organisms. By leveraging models pretrained on vast genomic datasets, researchers can refine them using spaceflight-specific datasets to understand these unique responses better.
Bone Density and Muscle Atrophy: The prolonged stay in space causes astronauts to experience bone density loss and muscle atrophy. Transfer learning from datasets related to osteoporosis or muscular diseases can provide insights into space-specific conditions.
Immune System Changes: The immune system undergoes changes in space. Transfer learning can assist in analyzing immune response based on vast immunological datasets to highlight the space-specific deviations.
Radiation Damage: High radiation in space poses significant risks. Models trained on radiation effects on biological systems on Earth can be adapted using datasets from organisms exposed to space radiation.
Cardiovascular Alterations: Microgravity affects cardiovascular systems. By using models pretrained on cardiovascular datasets from Earth, transfer learning can help analyze space-specific changes.
Visual Impairment: Some astronauts experience visual impairments related to intracranial pressure. Transfer learning from datasets on related eye conditions can aid in understanding this space-specific phenomenon.
Microbial Behavior: The behavior of microbes (like bacteria and fungi) can change in space environments, impacting spacecraft's health and cleanliness. Transfer learning can assist in predicting microbial behavior using models trained on vast microbial datasets.
Plant Growth and Behavior: In the bid to grow food in space, understanding how plants respond to microgravity is essential. Transfer learning from extensive plant biology datasets can be pivotal in predicting and optimizing plant growth in space.
Neurological Effects: Space travel can influence the nervous system and cognitive functions. Transfer learning from neurobiology datasets can offer insights into these alterations.
Psychological and Behavioral Analysis: Prolonged space missions can affect astronauts' mental health. Transfer learning can help analyze psychological datasets to predict and address potential behavioral issues during extended spaceflights.
## How users use model zoo?
探索與搜尋:Smith博士正在進行一個太空生物學計劃,研究微重力對植物基因的影響。她需要一個模型來幫助分類基因模式。她訪問了太空生物模型動物園,並搜索與植物遺傳學相關的模型。
檢視模型詳情:她找到了一個名為“植物基因組序列分類器 - 微重力效應”的模型。模型卡解釋說,該模型是在數千植物基因組序列上進行訓練的,並已經針對辨識與微重力效應相關的模式進行了優化。
下載/導入模型:Smith博士下載了模型的權重和配置文件。模型動物園提供了直接下載的鏈接,並且還提供了流行的編程語言的代碼片段,以便於導入過程。
微調:儘管這個模型看起來很有前途,但Smith博士擁有自己特定實驗收集的數據。她決定在自己的數據上微調這個模型,以確保它適應她的實驗條件。
部署和使用:微調後,她在她的基因組分析管線中部署了該模型。這個模型成功地幫助她分類基因模式,加速了她的研究。
反饋和貢獻:幾個月後,Smith博士進一步改進了該模型。她將自己的版本連同其在特定植物品種上的增強性能的筆記一起貢獻回模型動物園。
## What should we put in the model zoo?
### Models
Models trained on the TCGA (The Cancer Genome Atlas) could be a good starting point for genomic tasks.
For imaging tasks, models trained on large-scale medical imaging datasets, such as those from the RSNA (Radiological Society of North America), could be adapted.
DeepBind:
Description: This deep learning model predicts DNA and RNA binding specificities for different proteins.
Potential Application: Understanding protein-DNA/RNA interactions in space biology, especially under microgravity conditions, to study gene regulation.
AlphaFold:
Description: Developed by DeepMind, AlphaFold predicts the 3D structures of proteins based on their amino acid sequences.
Potential Application: Predicting the structural changes of proteins that may be induced under space conditions, which can offer insights into functional alterations.
EpiDeep:
Description: This model predicts epigenomic features, such as histone modifications, from DNA sequences.
Potential Application: Understanding the epigenetic landscape of organisms in space and how it might differ from terrestrial conditions.
Resnet for Microscopy:
Description: Residual networks (Resnets) that are fine-tuned for high-content microscopy images.
Potential Application: Analyzing morphological changes in cells or tissues during spaceflight, including aspects like cell shape, organelle health, and cell-cell interactions.
seq2seq for DNA Sequences:
Description: Sequence-to-sequence models that can predict, for instance, potential coding sequences within DNA.
Potential Application: Discovering new genes or regulatory elements that become active in space environments
### Collection of Publicly Available Biomedical Datasets
我們做的可以是把各大公開平台的地球dataset整理出來,分檔案格式,幫他們增加tag(如同Notion一般?)至於NASA資料也整理出來。對於太空資料,我們會需要把太空特殊條件也列出來做成filter。例如Environmental Conditions:
* Microgravity
* High Radiation
* Vacuum Exposure
* Earth
Images: Datasets like NIH's Chest X-ray dataset, The Cancer Imaging Archive (TCIA), and Dermatology Image databases can serve as a rich source for image-based diagnostics.
Genomics: Datasets such as The 1000 Genomes Project, NCBI's GenBank, and GWAS Catalog provide genetic and genomic data.
Proteomics and Metabolomics: The Human Protein Atlas and Metabolomics Workbench are great starting points.
Electronic Health Records (EHR): MIMIC-III and eICU are examples of databases containing EHR data.
### Relevant Publicly Available Space Biology Datasets
NASA's GeneLab: Houses spaceflight and spaceflight relevant data which can be leveraged for understanding the impact of space on organisms.
NASA Open Science Data Repository (OSDR): As mentioned, this can be used for RNA sequencing datasets and other biological data types.
## 相關paper
https://www.nature.com/articles/s41586-023-06139-9