###### tags: `BOT3404`
# SESSION 3: HybPiper
[TOC]
## 3.1 Running HybPiper
The [HybPiper](https://github.com/mossmatters/HybPiper/blob/master/README.md) bioinformatics pipeline is designed to assemble target capture data, which is when extracted DNA is probed for genes regions of interest. Hybpiper is composed of numerous python scripts that use bioinformatics tools (like BLAST, EXONERATE, and GNU Parallel), to make assembling sequences easier for the user.
HybPiper runs the following tools in this order:
1. Assigns high-throughput reads to target genes using BLASTx or BWA
2. Reads are distributed into directories, and then assembled using SPAdes
3. Creates output FASTA files; one containing CDS portion of the sample for each target region, and the other containing the translated protein sequence.
### Using Conda
It can be a challenge to maintain programs on a remote system, especially if the programs are only distributed as source code and need to be compiled on each system separately. In Bioinformatics, often different workflows require different versions of the same software, and these can also be very difficult to maintain.
This week we will introduce another solution: `conda`. [From the Conda website](https://docs.conda.io/en/latest/):
> Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
On SPHAGNUM there is an environment already made that contains all of the dependencies to run HybPiper:
```
conda activate hybpiper
```
Note that your command prompt has changed with `(hybpiper)` at the beginning to indicate what environment you are in. If you ever need to get out of the envirnoment, use `conda deactivate`.
### Accessing your Target File
The reads you have been using were generated with the [Angiosperms353 targeted sequencing kit](https://arborbiosci.com/genomics/targeted-sequencing/mybaits/mybaits-expert/mybaits-expert-angiosperms-353/), so we will use a file that contains sequences from `Abronia`.
The sequences are located at: `/scratch/joh97948/Abronia/angiosperms353.abronia.fasta`
#### Action 3.1
*Use the space below to show how you will move the target file into your working directory.*
```
```
### Running HybPiper
With these two files, we are able to run the first HybPiper command, `hybpiper assemble`. The arguments for this command are as follows:
> `hybpiper assemble -t [Target File].fasta -r [Sample File].fastq --prefix [Output filename] --bwa --cpu N --targetfile_ambiguity_codes NMRWSYKVHD`
`assemble` is the python code which will run this portion of HybPiper.
The target file (`-t`) is the file we have located and moved into your working directory from before.
The sample file fastq file(s) (`-r`) indicates which sample we will be running. You should use the wildcard shortcut we used previously, and have the python script run the forward and reverse reads simultaneously. *Hint: try using the wildcard after the R in your filename.*
`--bwa` should be used when nucleotides are used in the target files or input files.
`--prefix` will control the name of the output files.
`--cpu` specifies the number of processors used by HybPiper. To better share the computer during class, this should be set to `4`
`--targetfile_ambiguity_codes` lets HybPiper know that this is a target file that contains [DNA ambiguity codes](https://droog.gs.washington.edu/parc/images/iupac.html).
#### Action 3.2
*Try running `hybpiper assemble` by yourself, with the correct filenames.*
```
```
### HybPiper Output
The output of HybPiper is a directory named with the `--prefix` flag. This directory contains a set of standard subdirectories, one per gene. Within each gene, there are other directories organizing outputs from the different stages of HybPiper.
For example, to find the sequences recovered for gene `4471` for a sample with the directory name `SampleName`, you would look in:
`SampleName/4471/SampleName/sequences/FNA/4471.FNA`
#### Action 3.3
Inside the output directory, the file `genes_with_seqs.txt` contains a list of all the genes with recovered sequences. How many genes were recovered for your sample?
Use CyberDuck to access the HybPiper output directory and navigate the directories within each gene. Does your sample have a sequence recovered for the following genes:
* `4471`
* `6969`
* `7628`
### Supercontigs
A supercontig is a ordered set of contigs, creating a portion of the genome. A singular contig is a continuous length of sequence that we are very sure the order of bases is. A supercontig is a larger portion of the genome reconstructed using these smalled contigs, creating a sequence with a few gaps. To recover the supercontigs we will need to re-run `hybpiper assemble` but now with some additional flags:
```
--run_intronerate --no-blast --no-distribute --no-assemble
```
The `--run_intronerate` flag recovers regions flanking the targeted exons, while the various `--no-` commands skip the earlier parts of the workflow.
#### Action 3.4
The output of this command will add a directory `intron` to each gene recovered. Inside this directory is a file with the suffix `_supercontig.fasta`
Identify one of the genes with a recovered sequence and download both the `FNA` file and `supercontig.fasta` file. For example, you would find sequences for gene `4471` in:
`SampleName/4471/SampleName/sequences/FNA/4471.FNA`
and
`SampleName/4471/SampleName/sequences/intron/4471_supercontig.fasta`
Open both in a text editor: **how different are the lengths of the sequences?**
### Getting Sequence Lengths
To get a quick visual summary of our data thus far, we can use `hybpiper get_seq_lengths`.
`hybpiper get_seq_lengths [Hybpiper output files].fasta [namelist].txt dna > [Seq Lengths output filename].txt`
### Getting HybPiper Stats
Getting a summary of statistics for our assembled reads is easy using the `hybpiper stats`:
```
usage: hybpiper stats [-h] [--seq_lengths_filename SEQ_LENGTHS_FILENAME] [--stats_filename STATS_FILENAME] targetfile {dna,DNA,aa,AA} {gene,GENE,supercontig,SUPERCONTIG} namelist
hybpiper stats: error: the following arguments are required: targetfile, targetfile_sequence_type, sequence_type, namelist
```
Therefore we need to specify the targetfile (`mega353.fasta`), the sequence type for the target file (`dna`), the sequence type to summarize (`gene`) and a namelist that contains the
Use `nano` to create a file called `namelist.txt` that contains the directory name for the sample generated by HybPiper (specified using `--prefix`).
#### Action 3.5
*Replace the example arguments with your own, and run the `hybpiper stats` command.*
```
```
`hybpiper stats` will generate two files, `seq_lengths.tsv` and `hybpiper_stats.tsv`
The first line of `seq_lengths.tsv` has the names of each gene. The second line has the length of the target gene, averaged over each "source" for that gene. The rest of the lines are the length of the sequence recovered by HybPiper for each gene. If there was no sequence for a gene, a 0 is entered.
We can view the output file by opening it with `less`. The first line of the file has all the target gene names, and the second line has the length of the target gene. The lines following will have each sample per line, showing the length of the sequence recovered for that particular gene. If there were no reads mapped to a particular gene, you will see a 0 in that sample line.
**How many genes are missing sequences?**
The `hybpiper_stats.tsv` contains a number of useful statistics about HybPiper output.
This script will summarize target enrichment and gene recovery efficiency for a set of samples. The output is a text file with one sample per line and the following statistics:
>
> * Number of reads
> * Number of reads on target
> * Percent reads on target
> * Number of genes with reads
> * Number of genes with contigs
> * Number of genes with sequences
> * Number of genes with sequences > 25% of the target length
> * Number of genes with sequences > 50% of the target length
> * Number of genes with sequences > 75% of the target length
> * Number of genes with sequences > 150% of the target length
> * Number of genes with paralog warnings
### Making a Heatmap
A heatmap can be a good visualization of target recovery with multiple samples; each column is a gene, each row is a sample, and the shading in each cell represents the percentage of the target gene length recovered:

Since this command will only make sense if multiple samples are run through HybPiper, you will need to compare `seq_lengths.tsv` run from multiple samples.
First, move your HybPiper output directory to the common location:
```
mv HybPiperOutput /scratch/bot3404/hybpiper_output
```
Then, modify the `namelist.txt` file and add the name of your output directory:
`nano /scratch/bot3404/hybpiper_output/namelist.txt`
Once the class has a lot of samples in this directory, your instructor will run `hybpiper stats` for the class.
#### Action 3.6
Construct a `hybpiper recovery_heatmap` command for the given file locations.
```
```
How does your sample compare to other samples run in the class?