Interns Cohort 4 Mini-projects

# Interns Cohort 4 Mini-projects The goal of the cohort 4 mini-projects are to ensure the learners acquire skills on how to build standardized pipelines for bioinformatics analysis and build machile learning models for insect identification. The projects focusing on reproducibility will implement **Nextflow** as the workflow language and **docker or singularity** containers. ## 1. Metagenomics (Targeted Amplicon Diversity Analysis) Workflow Assignees: Nelly and Collins ### Aim The aim of this mini-project is to standardize the 16s ribosomal RNA analysis pipeline. The pipeline should implement both the Dada2 and Qiime workflow. The pipeline should be written in Nextflow DSL2 syntax using either docker or singularity containers. The container should be built in either Quay registry or DockerHub. ### Tasks 1. Familiarize yourself with the 16SrRNA analysis - Go through 16srRNA documentation online. An example: - https://h3abionet.github.io/H3ABionet-SOPs/16s-rRNA-1-0.html - Go through these pipelines - https://github.com/h3abionet/TADA - https://github.com/mbbu/16S_Accreditation 2. Containerize the tools needed for analysis - Create a Dockerfile or SIF file - Upload your Dockerfile to Quay registry or Docker Hub via GitHub - Build the container 3. Write the 16s pipeline in Nextflow DSL2 4. Document your work clearly on GitHub using wikis and GitHub pages. **N/B:** - Use test dataset from either of the listed pipelines. - Demonstrate collaborative research skills, informative visualization, and report writing - We are looking forward to see a pipeline emulating the following; - https://github.com/biocorecrg/master_of_pores - https://github.com/nanoporetech/pipeline-transcriptome-de/tree/paired_dge_dtu ## 2. Variant Calling workflow Assignees: Allan and Joyce ### Aim The focus of this miniproject is to standardize an existing pathogen variant calling pipeline; https://github.com/mbbu/variant-calling-pipeline/tree/main/GATK. The pipeline should be written in Nextflow using the DSL2 syntax. Docker or singularity should be used for containerization and the containers should be from Quay registry or DockerHub. ### Tasks 1. Understand how to perform variant calling analysis. - Read online documentation. An example: - https://h3abionet.github.io/H3ABionet-SOPs/Variant-Calling-1-0.html - Understand how the scripts were written - https://github.com/mbbu/variant-calling-pipeline/tree/main/GATK - An example of a variant calling pipeline in Nextflow - https://github.com/CRG-CNAG/CalliNGS-NF 2. Containerization - Build a dockerfile of sif file - The trick is to build it locally, ensure it runs before building it in a registry. - Upload the dockerfile on either Quay or Docker hub and build the container. 3. Write a Nextflow pipeline using DSL2 syntax 4. Document your work clearly on GitHub using wikis and GitHub pages. **N/B:** - You have to clearly demonstrate collaborative research skills, informative visualization, and report writing. - We are looking forward to see a pipeline emulating the following; - https://github.com/biocorecrg/master_of_pores - https://github.com/nanoporetech/pipeline-transcriptome-de/tree/paired_dge_dtu ## 3. Insect Identification using Machine learning Assignees: Caro and JB ### background The international centre of insect physiology and ecology is a leading institution in insect research in Africa and the world. There are several reasons why icipe focuses on insect research: they are a source of food and feed, they are the most diverse and abundant forms of life on earth, and they are crop pests, disease vectors. There is a need to harness the potential of insects for food and feed and develop strategies for control. All these starts with identifying the insects, which is a role taken up by a well-trained entomologist. However, since entomologists are few, and insects varieties are many, and the fact that they are not always available raises the need for other automated techniques. Machine learning, especially deep learning, has become a go-to tool for image identification and classification. ### Tasks In this project, you are expected to: 1. Review some of the machine learning algorithms that can be developed for insect identification 2. identify open datasets that can be used for training the algorithms 3. Curate the data available at icipe to make them machine learning ready 4. Develop a proof of concept model for the identification of one insect (this can be identified from further consultations with entomologists) 5. provide recommendations on how this can be adopted, including dissemination platforms