---
# System prepended metadata

title: 'SESSION 3: HybPiper'
tags: [BOT3404]

---

###### tags: `BOT3404`

# SESSION 3: HybPiper
[TOC]


## 3.1 Running HybPiper
The [HybPiper](https://github.com/mossmatters/HybPiper/blob/master/README.md) bioinformatics pipeline is designed to assemble target capture data, which is when extracted DNA is probed for genes regions of interest. Hybpiper is composed of numerous python scripts that use bioinformatics tools (like BLAST, EXONERATE, and GNU Parallel), to make assembling sequences easier for the user.

HybPiper runs the following tools in this order:
1. Assigns high-throughput reads to target genes using BLASTx or BWA
2. Reads are distributed into directories, and then assembled using SPAdes
3. Creates output FASTA files; one containing CDS portion of the sample for each target region, and the other containing the translated protein sequence.


### New Conda Environment

You will first need to download HybPiper and other program dependencies. We can do this with Conda as we did in a previous session:

```
conda create -n hybpiper hybpiper
```

Once the installation has completed, activate the new environment:


```
conda activate hybpiper
```

Note that your command prompt has changed with `(hybpiper)` at the beginning to indicate what environment you are in. If you ever need to get out of the envirnoment, use `conda deactivate`.

### Setting your working directory

Hybpiper will generate a lot of output, so you will want to work in your **scratch** directory. Go to that directory now:

```
cd /lustre/scratch/yourusername
```

make a new directory for today's HybPiper analysis:

```
mkdir my_hybpiper
```

```
cd my_hybpiper
```

### Accessing your Target File

The reads you have been using were generated with the [Angiosperms353 targeted sequencing kit](https://arborbiosci.com/genomics/targeted-sequencing/mybaits/mybaits-expert/mybaits-expert-angiosperms-353/), so we will use a file that contains sequences from several different angiosperm species. This file is known within HybPiper as the **Target File**

The sequences are located at: `/home/joh97948/mega353.fasta`


#### Action 3.1
 *Use the space below to show how you will move the target file into your working directory.*
```

```

### Interactive session

You will need to run HybPiper in an interactive session on HPCC. To get a computing node with 8 available processors:

```
interactive -p nocona -c 8
```

You will need to re-activate the HybPiper Conda environment:

```
conda activate hybpiper
```


### Running HybPiper

With these two files, we are able to run the first HybPiper command, `hybpiper assemble`. The arguments for this command are as follows (**do not run this, it's just an example**):

> `hybpiper assemble -t_dna [Target File].fasta -r readfile1.fastq readfile2.fastq --prefix [Output filename] --bwa --cpu N`
 
`assemble` is the python code which will run this portion of HybPiper.
 
The target file (`-t`) is the file we have located and moved into your working directory from before.

The sample file fastq file(s) (`-r`) indicates which sample we will be running. You should use the wildcard shortcut we used previously, and have HybPiper run the forward and reverse reads simultaneously. *Hint: try using the wildcard after the R in your filename.*

 `--bwa` should be used when nucleotides are used in the target files or input files.
 
 `--prefix` will control the name of the output files. For your files, it is probably best to use the first part of the read file name (e.g. `GUMOseq429`).
 
`--cpu` specifies the number of processors used by HybPiper. To better share the computer during class, this should be set to `8`
 
`--targetfile_ambiguity_codes` lets HybPiper know that this is a target file that contains [DNA ambiguity codes](https://droog.gs.washington.edu/parc/images/iupac.html).

#### Action 3.2
*Try running `hybpiper assemble` by yourself, with the correct filenames.*
```

```

### HybPiper Output

The output of HybPiper is a directory named with the `--prefix` flag. This directory contains a set of standard subdirectories, one per gene. Within each gene, there are other directories organizing outputs from the different stages of HybPiper. 

For example, to find the sequences recovered for gene `4471` for a sample with the directory name `SampleName`, you would look in: 

`SampleName/4471/SampleName/sequences/FNA/4471.FNA`

You will need to exchange `SampleName` for whatever you put after `--prefix` in the command above.

#### Action 3.3

Inside the output directory, the file `genes_with_seqs.txt` contains a list of all the genes with recovered sequences. How many genes were recovered for your sample?

Use CyberDuck to access the HybPiper output directory and navigate the directories within each gene. Does your sample have a sequence recovered for the following genes:
* `4471`
* `6969`
* `7628`


### Supercontigs
A supercontig is a ordered set of contigs, creating a portion of the genome. A singular contig is a continuous length of sequence that we are very sure the order of bases is. A supercontig is a larger portion of the genome reconstructed using these smaller contigs, creating a sequence with a few gaps. 

The supercontig contains portions of exon (coding DNA) that are likely to be slowly evolving, and intron (non-coding DNA) that are likely to accumulate more mutations. Therefore, if your project is focused on deep phylogenetics, you may focus on the exons only. But if your project is focused on within-species (e.g. biogeography) or even species complex phylogenetics, adding the supercontigs will probably help resolve differences among individuals.


#### Action 3.4

The output of this `hybpiper assemble` will add a directory `intron` to each gene recovered. Inside this directory is a file with the suffix `_supercontig.fasta`

Identify one of the genes with a recovered sequence and download both the `FNA` file and `supercontig.fasta` file. For example, you would find sequences for gene `4471` in:

`SampleName/4471/SampleName/sequences/FNA/4471.FNA` 

and

`SampleName/4471/SampleName/sequences/intron/4471_supercontig.fasta`

Download the two files and open both in a text editor: **how different are the lengths of the sequences?**


### Getting HybPiper Stats

Getting a summary of statistics for our assembled reads is easy using the `hybpiper stats`:

```
usage: hybpiper stats [-h] [--seq_lengths_filename SEQ_LENGTHS_FILENAME] [--stats_filename STATS_FILENAME] targetfile {dna,DNA,aa,AA} {gene,GENE,supercontig,SUPERCONTIG} namelist
```

Therefore we need to specify the targetfile (`mega353.fasta`), the sequence type for the target file (`dna`), the sequence type to summarize (`gene`) and a namelist that contains the name of your sample.

Use `cat` to create a file called `namelist.txt` that contains the directory name for the sample generated by HybPiper (specified using `--prefix`).

```
cat > namelist.txt
```
Type the name of your sample (e.g. `GUMOseqNNN' ) and then use Ctrl+C to exit out. You now have a text file ``namelist.txt` in your current directory that contains one line, the name of your sample.


#### Action 3.5
*Replace the example arguments with your own, and run the `hybpiper stats` command.*
```

```

`hybpiper stats` will generate two files, `seq_lengths.tsv` and `hybpiper_stats.tsv`

The first line of `seq_lengths.tsv` has the names of each gene. The second line has the length of the target gene, averaged over each "source" for that gene. The rest of the lines are the length of the sequence recovered by HybPiper for each gene. If there was no sequence for a gene, a 0 is entered.

We can view the output file by opening it with `less`. The first line of the file has all the target gene names, and the second line has the length of the target gene. The lines following will have each sample per line, showing the length of the sequence recovered for that particular gene. If there were no reads mapped to a particular gene, you will see a 0 in that sample line. 

**How many genes are missing sequences?**


The `hybpiper_stats.tsv` contains a number of useful statistics about HybPiper output.


This script will summarize target enrichment and gene recovery efficiency for a set of samples. The output is a text file with one sample per line and the following statistics:
> 
> * Number of reads
> * Number of reads on target
> * Percent reads on target
> * Number of genes with reads
> * Number of genes with contigs
> * Number of genes with sequences
> * Number of genes with sequences > 25% of the target length
> * Number of genes with sequences > 50% of the target length
> * Number of genes with sequences > 75% of the target length
> * Number of genes with sequences > 150% of the target length
> * Number of genes with paralog warnings

### Making a Heatmap

A heatmap can be a good visualization of target recovery with multiple samples; each column is a gene, each row is a sample, and the shading in each cell represents the percentage of the target gene length recovered:

![](https://github.com/mossmatters/HybPiper/wiki/img/test_dataset_recovery_heatmap.png)

Since this command will only make sense if multiple samples are run through HybPiper, you will need to compare `seq_lengths.tsv` run from multiple samples.

First, move your HybPiper output directory to the common location:

```
mv HybPiperOutput /scratch/bot3404/hybpiper_output
```

Then, modify the `namelist.txt` file and add the name of your output directory:

`nano /scratch/bot3404/hybpiper_output/namelist.txt`

Once the class has a lot of samples in this directory, your instructor will run `hybpiper stats` for the class.


#### Action 3.6

Construct a `hybpiper recovery_heatmap` command for the given file locations.

```


```

How does your sample compare to other samples run in the class?