###### tags: `BOT3404` # SESSION 2: HIGH-THROUGHPUT DNA SEQUENCE FILES [TOC] ## Bioinformatic Research in Biology > Bioinformatics is the study of biology using the tools of computer science. Bioinformatics enables the study of proteins, genes and genomes using computer algorithms and computer databases. According to a more general National Institutes of Health definition (see PDF below), bioinformatics is "research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, analyze, or visualize such data." The related discipline of computational biology is "the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems." SOURCE: [Kennedy Krieger Institute](https://www.kennedykrieger.org/research/centers-labs-cores/bioinformatics) Bioinformatics is inherintly associated to biology research. However, the use of bioinformatics is not limited to the subjects covered in this tutorial. While it has prevalent use in genomics and genetics projects, computational skills utilized in this workshop can be applied to research in nearly all discliplines within biology. Not only is processing data from wet lab part of bioinformatics, but the visualization of data is a major aspect of bioinformatics work. ## Short Reads Sequencing When we sequence short reads, we mean that the genomic code (DNA) has been fragmented into small chunks, called 'reads'. We do this because we are unable to sequence very long stretches of DNA, and could potentially lose data. In the wetlab procedure, DNA extraction materials are fragmented with an enzyme called fragmentase. The seqments are amplified using PCR, and then paired together in the library preparation step. This is why data rendered from sequencing will have two files for one sample. One is the forward read, and one is the reverse read. ![SOURCE: https://thesequencingcenter.com/knowledge-base/what-are-paired-end-reads/](https://i.imgur.com/fR9x2vr.jpg) SOURCE: [The Sequencing Center](https://thesequencingcenter.com/knowledge-base/what-are-paired-end-reads/) ## Opening and Viewing a FASTQ files Last week, you calculated the number of lines and the sizes of two FASTQ files, which contain the raw output of a DNA sequencer. The sequences are _paired_ which means that the first sequence in the `R1.paired.fastq` file correpsonds to the same DNA fragment as the first sequence in the `R2.paired.fastq` file. The files you worked with last week were already _trimmed_ meaning that poor quality DNA sequences were removed from the file. This week, you will work with _untrimmed_ files and use statistics about the sequences to decide on quality control for the sequences. A FASTQ file is also a text format, but contains the raw or modified output of DNA sequencers, such as the Illumina system. Each sequence in a FASTQ file is known as a "read" and is represented by four lines: 1. The identifier for the read 2. The DNA sequence 3. A placeholder line, usually just a `+` character 4. [Quality scores represented by a single character.](https://en.wikipedia.org/wiki/Phred_quality_score) There are untrimmed sequence files in `/scratch/bot3404/reads`. You will be assigned to a specific pair of files starting with `ABRONIAseq`. The files will end with `.fastq.gz` which means the files have been compressed to save space on the computer. **Copy the two sequence files you have been assigned to your directory on SPHAGNUM**: `cp /scratch/bot3404/reads/Abronia/ABRONIAseq999.R1.fastq.gz /scratch/bot3404/yourname/` ### `zcat` You cannot normally use `cat` to view compressed files, (give it a try to see some crazy looking output!) but you can use `zcat` instead: `zcat filename.fastq.gz` **Use `zcat` to view one of your compressed `fastq.gz` files**. There will be a lot of output! You can use Control-C to cancel. You can pipe the output of `zcat` to other programs. For example, you can get the number of lines in the file: `zcat filename.fastq.gz | wc -l` You can also use `less` to view the sequence files: `zcat filename.fastq.gz | less` **Questions** 1. How does the number of lines using `wc -l` correspond with the number of sequences in the file? 2. Are all the sequences the same length? 3. The image below has the conversion between ASCII characters and quality scores (Q-scores) for DNA sequences. Look through some of your sequences - are the scores high (above 30) or low (below 20)? ![](https://www.omixon.com/wp-content/uploads/2013/06/illumina_fastq_coding.png) ## Using FastQC to Evaluate Quality of Reads FastQC is a tool used to perform quality control functions on untrimmed sequence data. It will also provide a set of summary statistics that allow us to identify problems in the data before continuing with the pipeline. FastQC creates an HTML report which can be opened outside of your terminal. You can read more about the function of FastQC on the [Babraham Bioinformatics](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) site. Here are links to [bad](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) and [good](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) outputs. FastQC accepts multiple input fastp files at once. This is a good opportunity to utilize the 'wildcard' `*` syntax. Using a star in place of a filename, or part of a filename, will tell the terminal that you want to run all samples that match the remaining descriptors. For example, `*.fasta` will refer to all files that have the extention .fasta. Additionally, SampleNumber*.fasta will refer to all files that start with 'SampleNumber' and end with the .fasta extension. Using a wildcard will be very helpful when you have a large dataset. To identify potential arguments for `fastqc`, you can use the `-h` argument: ``` (base) bot3404$ fastqc -h fastqc -h FastQC - A high throughput sequence QC analysis tool SYNOPSIS fastqc seqfile1 seqfile2 .. seqfileN fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN DESCRIPTION FastQC reads a set of sequence files and produces from each one a quality control report consisting of a number of different modules, each one of which will help to identify a different potential type of problem in your data. If no files to process are specified on the command line then the program will start as an interactive graphical application. If files are provided on the command line then the program will run with no user interaction required. In this mode it is suitable for inclusion into a standardised analysis pipeline. The options for the program as as follows: -h --help Print this help file and exit -v --version Print the version of the program and exit -o --outdir Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed. --casava Files come from raw casava output. Files in the same sample group (differing only by the group number) will be analysed as a set rather than individually. Sequences with the filter flag set in the header will be excluded from the analysis. Files must have the same names given to them by casava (including being gzipped and ending with .gz) otherwise they won't be grouped together correctly. --nano Files come from nanopore sequences and are in fast5 format. In this mode you can pass in directories to process and the program will take in all fast5 files within those directories and produce a single output file from the sequences found in all files. --nofilter If running with --casava then don't remove read flagged by casava as poor quality when performing the QC analysis. --extract If set then the zipped output file will be uncompressed in the same directory after it has been created. By default this option will be set if fastqc is run in non-interactive mode. -j --java Provides the full path to the java binary you want to use to launch fastqc. If not supplied then java is assumed to be in your path. --noextract Do not uncompress the output file after creating it. You should set this option if you do not wish to uncompress the output when running in non-interactive mode. --nogroup Disable grouping of bases for reads >50bp. All reports will show data for every base in the read. WARNING: Using this option will cause fastqc to crash and burn if you use it on really long reads, and your plots may end up a ridiculous size. You have been warned! --min_length Sets an artificial lower limit on the length of the sequence to be shown in the report. As long as you set this to a value greater or equal to your longest read length then this will be the sequence length used to create your read groups. This can be useful for making directly comaparable statistics from datasets with somewhat variable read lengths. -f --format Bypasses the normal sequence file format detection and forces the program to use the specified format. Valid formats are bam,sam,bam_mapped,sam_mapped and fastq -t --threads Specifies the number of files which can be processed simultaneously. Each thread will be allocated 250MB of memory so you shouldn't run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine -c Specifies a non-default file which contains the list of --contaminants contaminants to screen overrepresented sequences against. The file must contain sets of named contaminants in the form name[tab]sequence. Lines prefixed with a hash will be ignored. -a Specifies a non-default file which contains the list of --adapters adapter sequences which will be explicity searched against the library. The file must contain sets of named adapters in the form name[tab]sequence. Lines prefixed with a hash will be ignored. -l Specifies a non-default file which contains a set of criteria --limits which will be used to determine the warn/error limits for the various modules. This file can also be used to selectively remove some modules from the output all together. The format needs to mirror the default limits.txt file found in the Configuration folder. -k --kmers Specifies the length of Kmer to look for in the Kmer content module. Specified Kmer length must be between 2 and 10. Default length is 7 if not specified. -q --quiet Supress all progress messages on stdout and only report errors. -d --dir Selects a directory to be used for temporary files written when generating report images. Defaults to system temp directory if not specified. BUGS Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk or in www.bioinformatics.babraham.ac.uk/bugzilla/ ``` Below are some sample command line arguments: > `fastqc [-o output dir] [-f fastq|bam|sam] seqfile1 .. seqfileN` ### `fastqc activity` *Try writing a line of code below that prompts FastQC to create an output file.* ``` ``` Use CyberDuck to download the FastQC output file. To connect to the SPHAGNUM computer in CyberDuck you will need to open an `sftp` connection, and specify the server as `sphagnum.ttu.edu` with `bot3404` as the username: ![](https://i.imgur.com/oKAN8mS.png) and view it in HTML. **Questions** 1. How many reads are in the files? 2. How many total base-pairs (DNA letters) are in the files? 3. What is the distribution of read lengths? 4. Are there any warnings in FASTQC? ## Flags and Helper Documents If you find yourself stuck or confused while writing a line of code while using a program, try using the helper printout. For example, FASTQC has a helper document that explains arguments needed for running the program. Access it using `fastqc -h`. Some programs may use `man program` or `program --help` for their help documents. It is also worth looking for the program's webpage to view recent documentation -- the developer may also have set up a help forum or e-mail list to get assistance. Other good sources of information for UNIX systems is [StackOverflow](https://stackoverflow.com/). More specific help for bioinformatics can be found using [biostars](https://www.biostars.org/), [seqanswers](http://seqanswers.com/), or the [bioinformatics reddit forum](https://www.reddit.com/r/bioinformatics/). When in doubt, it is frequently useful to use a Google search with the name of the program and the error message. Often, these searches will take you directly to a post on one of the sites mentioned above! ## Trimming Data We will use [FastP](https://github.com/OpenGene/fastp) to trim down reads before pairing them. We trim reads for many reasons, such as removing the adapter sequences from a read (adapter sequences are critical to the library prep stage, but cause problems in bioinformatics), or to trim out the low quality, short, sections of reads. We want to keep only high quality and longer reads to input into the pipeline. > fastp supports both single-end (SE) and paired-end (PE) input/output. >* for SE data, you only have to specify read1 input by -i or --in1, and specify read1 output by -o or --out1. >* for PE data, you should also specify read2 input by -I or --in2, and specify read2 output by -O or --out2. >* if you don't specify the output file names, no output files will be written, but the QC will still be done for both data before and after filtering. >* the output will be gzip-compressed if its file name ends with .gz `fastp -i [untrimmed.forward].fq -I [untrimmed.reverse].fq -o [trimmedoutput.forward].fq -O [trimmedoutput.reverse].fq` SOURCE: [OpenGene/ FastP Github Page](https://github.com/OpenGene/fastp) In addition to the simple usage of FastP, we also need to add additional arguments for the type of data we have. The instructors will inform you of additional arguments needed based on the data set you use. ### `fastp` activity *Try adapting the simple usage code for your dataset.* ``` ``` ## Comparing trimmed and untrimmed output Previously, we ran FastQC on the untrimmed data. ### Final Activity *Write a command to run FastQC on the trimmed reads* ``` ``` What are the differences between the two outputs?