---
tags: BMMB554-23
---
# Class projects
## Logistics
After all we will have **individual** projects. Here are the logistical details.
1. Pick a project from the list below. If you want, you can propose your own project as well.
2. Fill in this poll to indicate:
- Which project you select
- Time you can meet with me to discuss the details
## Projects
### 1kGP + T2T
To demonstrate scalability of public computational infrastructure we will be analyzing the full set of [3,202](https://www.sciencedirect.com/science/article/pii/S0092867422009916) samples against the latest human genome [T2T reference](https://www.science.org/doi/10.1126/science.abj6987) using [DeepVarinat caller](https://www.nature.com/articles/nbt.4235). However, before we can beging the fully scale analysis, we need to peform a feasibility study by running DeepVariant (DV) and FreeBayes (FB) on several samples and looking at the results. All necessary tools are already integrated into Galaxy.
:::info
You will be provided by a set of Illumina reads for two 1kGP samples and will need to perform (1) QC, (2) mapping, (3) filtering, (4) variant calling for DV and FB, and (5) result tabulation.
:::
### Structure prediction with AlphaFold
Pick a biologically interesting protein and try to sovle its structure using [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2). The key to the success of this project is to pick an *interesting* protein and justify your choice. AlphFold is integrated in Galaxy and should be easy to run.
:::info
You will be provided with instructions on how to run AlphaFold and interpret its results. You will need to think carefully about choice of an interesting protein.
:::
### Coverage analysis of duplicated genes
We have identified a duplication of a critical transcription factor gene in Phillipine flying lemur ([*Cynocephalus volans*](https://en.wikipedia.org/wiki/Philippine_flying_lemur)). In order to validate these results we need to perform an analysis of read coverage surrounding these two loci.
:::info
You will be given a set of PacBio HiFi reads used for genome assembly. You will need to (1) map the reads back to assembly, (2) compute per-based coverage and (3) plot the coverage across the two loci and surrounding regions
:::
### Estimation of mutational target size
Given mutation rate estimated from an experimental evolution study we need to simulate mutational accumulation in parallel lines to identify genes that mutate in parallel "more than by chance".
:::info
You will be provided a (suboptimal) R script that needs to be refactored to make this simulation time efficient.
:::
### Pick a project of your choice from GTN
Pick any tutorial from [GTN](https://training.galaxyproject.org/) that is close to the analyses you are doing and use it to interpret your own data.
## Project distribution across class (Spring 2023)
:::warning
Project meetings will be conducted **by ZOOM**
:::
### Pick a project of your choice from GTN:
:watch: 4/18/23 3:30 PM
Andrew Sugarman
Jeong Han
Venitha Bernard
Abigail Sequeira
### Estimation of mutational target size:
:watch: 4/18/23 4:00 PM
Samantha Seibel
TQ Smtih
### 1kGP + T2T:
:watch: 4/19/23 4:45 PM
Daniela Betancurt (?)
### Structure prediction with AlphaFold:
:watch: 4/21/23 3:45 PM
Yao Tu
Mengzhu Tang
Megan Nitchman
Srijana Adhikari
------
code: `exejcdk`