# Cohort 5 Interns mini-projects ## 1. Mining Arboviruses sequences from East Africa **Assignees:** Denis, Sharon and Rodney ### Background Arboviruses are viruses that are transmitted by arthropod vectors. Some of these viruses of interest include Chikungunya, dengue, West Nile, Yellow Fever, and Zika. The goal is to mine available data from Genbank and EBI/ENA, starting with those generated from East Africa, and expand to the rest of the world. There are two steps, data mining and analysis. The analysis step will be limited to East Africa, but the mining part also aims to contribute to the [PubSeq2](http://gemini.thebird.nl/gemini/blog/pubseq/doc/plan.gmi) project. "The goal for PubSeq2 is to create and maintain a *pangenome* of all human pathogens with their *metadata* - as fetched from public resources (that means no data that lacks a license for sharing data). Fortunately there are some great public resources, including NCBI and EBI. The PubSeq2 project aims to bring people together to create a useful resource." The code that fetches Genbank [is available](https://github.com/pubseq/bh20-seq-resource/tree/master/workflows/pull-data/genbank) as a starter code for the project. The data is to be published as RDF, which facilitates machine readability and can explore machine learning applications. An approach similar to that applied [genecup](https://genecup.org/) that takes gene names and links them up with a [pubmed abstract search](https://pubmed.ncbi.nlm.nih.gov/35285473/) can be used to build a pangenome of pathogens to identify all variants. The aim is to combine these data into a machine learning project. ### Aim This project aims to fetch arbovirus data from East Africa and analyze trends by mapping the sequence data and research interest. ### Tasks 1. Create a Roadmap for the mini-project. 2. Understand the PubSeq2 project and the source code for fetching the data 3. Fetch the data from GenBank and ENA, including the corresponding metadata 4. Perform exploratory analysis to identify some research questions that can be addressed. a) Can you identify transmission and occurrence patterns b) Can you identify questions that can be addressed using machine learning? 6. Document your work clearly on GitHub using wikis and GitHub pages. 7. Document the papers you are reading, a link to the paper, and a sentence or two on why you included them. ## 2. Breast Cancer ONT and WES data analysis **Assignees:** Audrey and Maxwell ### Background Knowledge of a patient's BRCA1/2 mutation status is crucial for clinical management and family testing. Next-generation sequencing (NGS) has superseded Sanger sequencing, but large DNA rearrangements are not reliably detected with NGS. Nanopore sequencing is a novel technique that uses molecular pores to read long DNA fragments (kilobases). This study aims to evaluate the combined use of whole-exome sequencing (WES) and nanopore sequencing to comprehensively screen for BRCA1/2 germline mutations. Three breast cancer patients had targeted sequencing for BRCA 1/2 genes. Two of these patients has previous WES data for comparison. ### Aim Evaluate the combined use of whole-exome sequencing (WES) and nanopore sequencing to comprehensively screen for BRCA1/2 germline mutations. ### Tasks 1. Create a Roadmap for the mini-project. For both Nanopore and WES data: 1. Perform quality analysis and cleaning of the raw data 2. Read mapping to the reference genome and variant calling 3. Variant prioritization, annotation, filtering and classification 4. Identify known and novel mutations 2. Evaluate and compare the utility of WES and ONT in screening for BRCA1/2 germline mutations. 3. Document your work clearly on GitHub using wikis and GitHub pages. 4. Document the papers you are reading, a link to the paper, and a sentence or two on why you included them.