owned this note
owned this note
Published
Linked with GitHub
1. BOSS WORKSHOP NOV 1-5
:::info
- **Location:** Virtual
- **Zoom Link:** https://us02web.zoom.us/j/81086617166?pwd=VWcvbVdsYjFmWWJkTm01U01KSkJodz09
- **Date:** Nov 1-5, 2021 9:00 am EAT (GMT+3)
- **Pre workshop survey:** https://forms.gle/LgWuvnihQGgzArjh6
- **Post workshop survey:** https://forms.gle/QDG9yc6pYx8jtEEY6
:::
:::info
## November 5
***Topics***
- Introduction to Git/GitHub - Caleb Kibet
- Introduction to Galaxy - Peter van Heusden
**Participants Roll call**
1. Mbu'u Mbanwi Cyrille, UYI, Cameroon
2. Winnie Mumbi
3. Joseph Gisaina; University of EMBU
4. Meshack Wadegu; KEMRI
5. Audrey Oronda
6. Beatrice Oduor, University of Nairobi
7. Martha Nginya, BHKi
8. Oscar Mwaura ;Pwani university
9. munduku Benoni;jcrc; Uganda
10. Okoko Irene; @okoko_mkavi
11. Faith Ndung'u;JKUAT; @Njoki_Ndung'u
12. Dativa pereus, National Institute for Medical Research-(NIMR Tanzania), University of Nairobi
13. Gladys Rotich; ICIPE
14. Sharon Watiri; KEMRI
15. Ahmed Abbi Abdille; PAUSTI @ahmedabbi
16. Stephen Tavasi ; Masinde Muliro University
17. Michael Ambutsi; MMUST
### Notes
***Git and GitHub***
- Git is a version control system that keeps track of work and facilitates collaboration on projects
- GitHub is a web-based service for version control used to host Git repositories. It's also a social networking site for developers.
- GitHub can also be used to host websites for events, organization, study groups, Gitbooks. It can also be used host code documentations, CVs, to-do lists, open educational resources (OERs).
***Galaxy***
- It's a web-based bioinformatics environment that consists of public servers
- Galaxy aims to make bioinformatics accessible by providing user friendly interface, making analysis workflows re-usable. Follows open source and FAIR principles.
- Register on https://usegalaxy.org/ or https://usegalaxy.eu/
- Build workflows and add them to Galaxy to enable their reproducibility
- short introduction to Galaxy: https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-short/tutorial.html
- Galaxy project youtube channel: https://www.youtube.com/c/GalaxyProject/playlists
- Demonstration workflow: https://usegalaxy.eu/u/pvanheus/w/my-first-analysis-workflow
- Example of published galaxy workflow: https://workflowhub.eu/workflows/155
- Galaxy 101 tutorial - https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-101/tutorial.html
-
**Questions**
:::
:::info
## November 4
***Topics***
- Sequence alignment - Sonal Henson
- Sequence assembly - James Otieno
**Participants Roll call**
Name/email/affiliation
1. Joseph Gisaina Amwoma; University of EMBU
2. Winnie Mumbi
3. Benjamin Opot; USAMRU-K
4. Kirunda Jeremy Menya; Joint Clinical Research Centre; @Kirunda_J
5. Stephen Tavasi ; mmust
6. Gladys Rotich; ICIPE
7. Audrey Oronda
8. Meshack Wadegu;KEMRI
9. Oscar Mwaura; Pwani University
10. Mbu'u Mbanwi Cyrille
11. Boakye Emmanuel; Kwame Nkrumah University Of Science And Technology,Ghana; @thescientistgh
12. Beatrice Oduor, University of Nairobi
13. Edwin Mwakio; USAMRU-K
14. munduku Benoni; jcrc; Uganda
15. Sharon Watiri;KEMRI
16. Michael Ambutsi; MMUST
17. Kimani Ndung'u, KALRO,FCRC-Njoro, Kenya
18. Dativa pereus, National Institute for Medical Research-(NIMR Tanzania), University of Nairobi
**Questions**
### Notes
https://docs.conda.io/en/latest/miniconda.html
conda installation
wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
conda update conda
for example install spades
conda install -c bioconda spades
:::
:::info
## November 3
***Topics***
- Quality control and assessment - Shaun Aron
- Practical session - QC - Shaun Aron
- Scientific writing - Joy Owango
**Participants Roll call**
1. Sharon Watiri
2. Benjamin Opot; USAMRU-k
3. Winnie Mumbi; @ShizlerMumbi
4. Edwin Mwakio;USAMRU-K
5. Kirunda Jeremy Menya; Joint Clincal Research Centre; @Kirunda_J
6. Joseph Gisaina; University Embu
7. Karim Mtengai,Copperbelt University
8. Meshack Wadegu, KEMRI
9. Ahmed Abbi Abdille, @ahmedabbi
10. Stephen Tavasi ; Masinde Muliro University
11. Beatrice Oduor, UON
12. Audrey Oronda
13. Ambutsi Michael; MMUST; @Ambutsi2
14. Mbu'u Mbanwi Cyrille, UYI, Cameroon
15. Gladys Rotich; ICIPE; @jerono7_gladys
16. Oscar Mwaura; Pwani University @OscarMwaura2
17. Faith Ndung'u, JKUAT, @Njoki_Ndung'u
18. Phyllys Langat, KEMRI,@phyllismemo
19. Kimani Ndung'u, KALRO,FCRC-Njoro, Kenya
20. Okoko Irene; @okoko_mkavi
21. Boakye Emmanuel; Kwame Nkrumah University Of Science And Technology,Ghana; @thescientistgh
22. munduku benoni; jcrc; uganda
23. Dativa pereus, National Institute for Medical Research-(NIMR Tanzania), University of Nairobi
**Link to course website**
- https://canvas.instructure.com/courses/3735217/modules
**Links to data**
- wget https://hpc.ilri.cgiar.org/~mkofia/NGS_QC/SRR6319976.zip
- wget https://hpc.ilri.cgiar.org/~mkofia/NGS_QC/SRR957824.zip
**Link to register for the next session**
- https://bit.ly/3EGw6Zc
-**Notes**
**Quality control and assessment**
- Errors can be introduced at various steps of a sequencing experiment making quality control an important step during before performing downstream analysis
- NGS read quality starts off high and drps off towards the end of the reads
- Common tool for generating the quality metrics of some raw reads is fastqc because it is easy to use and quite intuitive
- Multiqc can be used view html reports form multiple fastq files and generate a single report for all
- A fastqc run report:
**Scientific writing**
**Questions**
:::
:::
:::info
## November 2
***Topics***
- Advanced Linux, Awk and Sed - Sumir Panji
**Participants Roll call**
1. Mbu'u Mbanwi Cyrille, University of Yaounde I, Biotechnology Centre, Yaounde, Cameroon
2. Benjamin Opot; USAMRU-K
3. Joseph Gisaina; University of Embu; Kenya
4. Audrey Oronda
5. Oscar Mwaura; Pwani University;Kenya @OscarMwaura2
6. Boakye Emmanuel; Kwame Nkrumah University Of Science And Technology,Ghana; @thescientistgh
7. Okoko Irene; @okoko_mkavi
8. Musundi Sebastian. JKUAT, Kenya
9. Winnie Mumbi; @ShizlerMumbi
10. Edwin Mwakio;USAMRU-K
11. Gladys Rotich; ICIPE; @jerono7_gladys
12. Kirunda Jeremy Menya, Joint Clinical Research Centre, Uganda; @Kirunda_J
13. Beatrice Oduor, UON, @Betty_Oduor
14. Faith Ndung'u, JKUAT, @Njoki_Ndung'u
15. Ahmed Abbi Abdille, PAUSTI, ahmedabbi
16. Kimani Ndung'u; KALRO,FCRC-Njoro, Kenya
17. munduku benoni; JCRC; uganda
18. Stephen Tavasi ; Masinde Muliro University
19. Michael Ambutsi; MMUST; @Ambutsi2
20. Karim Mtengai;Copperbelt University @Kmtengai
21. Meshack Wadegu, KEMRI
22. Phyllys Langat, KEMRI,@phyllismemo
23. Dativa pereus, National Institute for Medical Research-(NIMR Tanzania), University of Nairobi
**Links to data**
wget https://raw.githubusercontent.com/eanbit-rt/IntroductoryLinux-2019/master/Data/exercises.fasta
wget https://raw.githubusercontent.com/eanbit-rt/IntroductoryLinux-2019/master/Data/sequences.fasta
wget https://raw.githubusercontent.com/eanbit-rt/IntroductoryLinux-2019/master/Data/genes.gff
wget https://raw.githubusercontent.com/bioinformatics-hub-ke/Boss-workshops/main/exercises.bed
**Questions**
:::
:::info
## November 1
***Topics***
- Intro to sequencing technologies - Martha Luka
- Data file formats - Talk - Shaun Aron
- Introduction to Unix - Sumir Panji
**Participants Roll call**
***Name; Institution; Twitter handle***
1. Benjamin Opot; USAMRU-K
2. Winnie Mumbi; @ShizlerMumbi
3. Gladys Rotich; ICIPE; @jerono7_gladys
4. Stephen Tavasi; Masinde Muliro University Of Sci & Techn.
5. Sharon Watiri
6. Susan Watitwa
7. Faith Ndung'u, JKUAT @Njoki_Ndung'u
8. Audrey Oronda
9. Mbu'u Mbanwi Cyrille, University of Yaounde I, Cameroon
10. Edwin Mwakio; USAMRU-K
11. Felix Maingi, @Felix_Wine, Jomo Kenyatta University of Agriculture and Technology
12. Boakye Emmanuel; Kwame Nkrumah University Of Science And Technology,Ghana; @thescientistgh
13. Michael Ambutsi,MMUST, @Ambutsi2
14. munduku Benoni; joint clinical research center, Uganda; @benon_munduku
15. Okoko Irene; @okoko_mkavi
17. Beatrice Oduor,@Betty_Oduor,University of Nairobi
18. Kirunda Jeremy Menya; Joint Clinical Research Centre, Uganda; @Kirunda_J
19. Joseph Gisaina; University of EMBU;@isain254
20. Ssegawa Abdulkarim;Makerere University,uganda;@karimkarsha
21. Kimani Ndung'u; KALRO,FCRC-Njoro, Kenya;
22. Ahmed Abbi Abdille; Pan African University; Somalia
23. Karim Mtengai Copperbelt University,Zambia @Kmtengai
25. Musundi Sebastian, JKUAT, Kenya
26. Meshack Wadegu, KEMRI
27. Phyllys Langat,KEMRI,@phyllismemo
28. Dativa pereus, National Institute for Medical Research-(NIMR Tanzania), University of Nairobi
### Notes
-**Introduction to sequencing technologies**
- First DNA sequencing was in 1977 by Fredrick Sanger abd it was gel-based
- Speed of sequencing has increased over the years due to adavncement in technology
- First generation sequencing was used to infer the identity while the second generation sequencing ensures massive parrell sequencing
- Increasing the length of the sequence fragments through the third generation sequencing hence improving quality of output.
- Automated sanger sequencing using fluoresent labelled ddNTPs that produces a chromatogram fed into an automatic detector to call individual reads. Sanger sequending is expensive for long reads.
- 2nd generation sequencing - takes place in real time:
- Ion torrent is based on the detection of hydrogen ions realesed during the polymerization of DNA. The sequencing is real time and is effective for small labs with less sequencing needs.
- ABI/SOLiD sequencing is the sequencing by oligonucleotide ligation and detection. Only the sequencing that are complimentary to the template bind and hence detected. It is much slower than other methods. Its good in DNA mythylation / epigenetics.
- Illumina sequencing is a sequencing by syntheisis method - the genomic DNA is sheared into small fragments, adapters are added to the library, unique indices are added to help with pooling based on identity. High base calling accuracy, cost effective in terms of running, able to produce deep coverage for metagenomics and de novo assembly, high quality pairwise alignments, produces short reads for long genomes
- 3rd generation sequencing:
- PacBio sequencing is a long read sequencer. Good for epigenetics because the DNA is already methylated. application in whole genome sequencng and targeted sequencing.
- Oxford nanopore sequencing can be used for native DNA sequencing without a PCR step, applicable for short or long reads, it is portable and the most cost effective but with a high error rate.
-**Quality control and assessment**
- Raw read data generated from sequencing platforms is passed through tools that assess the quality of the reads to determine whther or not they need to be trimmed.
- Raw sequence data in the fastq format, derived from the fasta format with the quality of each base encoded in Ascii characters.
- QC and trimming: fastqc is a tool whose input is fastqc files to check the quality of reads, checks for adapters etc. Fastqc generates a report.
- Aligninments vs assembly: denovo assembly merges fragments of overlapping DNA into full contigs and is applicable when there is no reference genome available. Alignment is conducted when a reference genome is available.
- The refence genome is first indexed before proceeding with alignment. When mapping using a tool such as BWA, the information from the indexing step is used to effciently map the reads on the reference genome.
- IGV is an aligment viewer.
- The output of a read alignment is a SAM (sequence alignment format) file which is human readable
- BAM (binary alignment file) is in binary format hence not human readable. Tools lik
- Cram files are compressed BAM files and the most effecient way to store BAM files because they use less space. They are reference based.
- Variant calling: the process by which we identify variants from sequence data
- VCF (variant call formats) files are the outup of variant calling which store the variant data togther with additional information on quality, annotations etc
-**Introduction to Unix**
- Unix is effiecient for working with large data sets
- Most tools can access the biolgical data that is in text format through unix
- Linux commands can be combined in many ways using pipes
- You can navigate through the terminal using relative or absolute paths. An absolute path specifies the location from the root directory whereas a relative path is related to the current directory.
- cd, ls, pwd, mkdir and rmdir are some of the most important commands in Unix.
- Linux is case sensitive, using tab completions is advised. Use underscores between names in a file name
- File permissions: other, group and owner/user permissions can be accessed. The chmod command is used to change the permissions of a file or directory. You can use symbols or octals(digits) with the chmod command.
- Ypu can edit a file content using text editors such as nano, emacs, gedit, vim etc
- Commands to use to vie the content of a file include cat, more, less, head and tail.
- You can use wc to get the basic counts of a file
-**Introduction to high performance computing**
- A HPC cluster is a collection of many computers called nodes, connected via fast interconnect
- The difference between personal computer and a cluster node is in quantity, quality and power of the components.
:::info
**Questions**
1. How would the error rates in ONT affect genetic diversity studies?
2. Is it a must to sort SAM files prior to variant calling
3. is a sequence reads a continuation of the one before?
:::