# DNA ## References ### Final Project + 🔗 [**Group 14 Proposal**](https://docs.google.com/document/d/1CGB8yD2HIMu5pKVhnQbqwxZgSlFjz6UVuXEDs7j6V8w/edit?usp=drivesdk) + 🔗 [**Group 14 Slide**](https://www.canva.com/design/DAG8PjtrCS0/RfcEWhAMs-PhxHxIbJ51rg/edit?utm_content=DAG8PjtrCS0&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton) + 🔗 [**Group 14 Paper**](https://1drv.ms/w/c/b5bbad118354394f/IQD0jzM7nasiQYrBN7y6e06mAVT0SG7AThqkeodO1MgNWTQ?e=zjBnrj) + 🔗 [**IEEE Paper Format**](https://www.scribbr.com/ieee/ieee-paper-format/) + 🔗 [**GitHub - Experiments (RogelioKG)**](https://github.com/RogelioKG/test-dna) + 🔗 [**GitHub - Experiments (Jay Wu)**](https://github.com/Jay910066/ML_final) ### Paper + 🔗 [**Nature Biotechnology - Nucleotide Transformer: building and evaluating robust DNA sequence models**](https://www.nature.com/articles/s41592-024-02523-z) + 🔗 [**GitHub - Nucleotide Transformer**](https://github.com/instadeepai/nucleotide-transformer) + 🔗 [**GitHub - nucleotide_transformer.md**](https://github.com/instadeepai/nucleotide-transformer/blob/main/docs/nucleotide_transformer.md) + 🔗 [**Bioinformatics - DNABERT: pre-trained Bidirectional Encoder Representations from Transformers for DNA sequences**](https://academic.oup.com/bioinformatics/article/37/15/2112/6128680) + 🔗 [**GitHub - DNABERT**](https://github.com/jerryji1993/DNABERT) ## Goals + 模型:Nucleotide Transformer + 資料集:[DNA Sequence Prediction](https://www.kaggle.com/datasets/harshvardhan21/dna-sequence-prediction/code) ## Note + 生物小教室 (複習) 1. DNA 序列以 3 個鹼基為一組 (3-mer) 被轉錄成 mRNA 2. mRNA 的每個密碼子 (3-mer) 都會被對應的 tRNA 反密碼子辨識 (欸都,這裡有 wobble base pairing 的問題,但我們先不談) 3. 每種 tRNA 會攜帶一種特定的胺基酸 4. 也就是說某段 DNA 的 3-mer 組成,會影響它最終傾向產生哪類胺基酸 5. 也就間接決定了這段 DNA 序列更可能產生出哪種胺基酸鏈 (蛋白質) 6. 蛋白質,其中有不少是酶,具有生物活性,能藉此推斷出最終可能具備的生物功能 7. 也因此 3-mer 確實是一個推斷出 label 的重要 feature ## Features | 欄位 | 說明 | 類型 | | --- | --- | --- | | `Index` | 索引 | Numerical | | `NCBIGeneID` | NCBI 基因資料庫 ID | Numerical | | `Symbol` | 基因名稱 | Text | | `Description` | 基因描述 | Text | | `GeneGroupMethod` | 基因分組方法 | Categorical | | `NucleotideSequence` | **[關鍵]** 核苷酸序列 | Text | ## Labels | 欄位 | 說明 | 類型 | | --- | --- | --- | | `GeneType` | **[關鍵]** 基因功能 | Categorical | | GeneType | 占比 | | --- | --- | | `PSEUDO` | 45.64% | | `BIOLOGICAL_REGION` | 31.84% | | `ncRNA` | 10.74% | | `snoRNA` | 4.86% | | `PROTEIN_CODING` | 2.21% | | `tRNA` | 1.78% | | `OTHER` | 1.60% | | `rRNA` | 0.86% | | `snRNA` | 0.46% | | `scRNA` | 0.01% |
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up