Welcome to the HybPiper tutorial!

###### tags: `hybpiper_wiki` # Welcome to the HybPiper tutorial! The purpose of this tutorial is to help familiarize you with the format of the input you need and output you should expect from running HybPiper. The tutorial uses a test dataset that is a subset of real Illumina data from an enriched library. This tutorial assumes that you have some experience executing programs from the command line on a UNIX-like system. If you would like to learn more about the command line, or need a refresher, you can find a [command line tutorial here](https://github.com/mossmatters/introToCmdLine/blob/master/introToCmdLine.pdf). ## Test dataset **[Click this link](https://github.com/mossmatters/HybPiper/raw/develop/test_dataset/test_reads.fastq.tar.gz) to download the fastq files for the test dataset.** **MJ**: to make this tutorial work with the conda installation we need to make sure the tar includes the other files, not just reads. **`test_reads_fastq.tar.gz`** contains paired reads from nine samples chosen from the initial HybPiper manuscript. It includes six "ingroup" samples (genus *Artocarpus*) and three outgroup samples. Each sample has a pair of `*.fastq` files, representing the forward and reverse reads, generated on an Illumina MiSeq 2x300 platform. **Note**: If you cloned this git repository and do not have the "large file storage" add-on to `git`, the `test_reads_fastq.tar.gz` file in your repository will **NOT** be a real compressed archive. Please download the reads from the link above! **`test_targets.fasta`** is a file containing the full coding sequence from 13 target genes based on the _Artocarpus_ probe set described in the HybPiper manuscript. There are two "sources" of sequence for each target: *Artocarpus* (sequences from a draft genome in the target group) and *Morus* (a reference genome in the same family as *Artocarpus*). For example, both of these sequences represent `gene002`: ``` >Artocarpus-gene002 ATGATGAAGCAGGACGCCACCAACGGCGGCGAGAACTGGGTGCGCGTGTGCGACACGTGC CGCTCGGCGGCGTGCACGGTTTACTGCCGCGCCGACTCGGCTTACCTCTGCGCCGGATGC >Morus-gene002 ATGATGAAGGAGGACACAAACGGGGGCAACTCCAGCAAGAACTGGGCGCGCGTGTGTGAC ACGTGCCGTTCCGCGGCGTGCGCGGTGTACTGCCGTGCCGACTCGGCGTACCTTTGCGCG ``` Having multiple sources for each gene increases the likelihood that reads will map to the targets during the first phase. HybPiper then chooses the version with the best overall mapping score to serve as a reference for extracting coding sequences. **`namelist.txt`** A file containing the list of sample names, one per line. While not required, this file will help run the main HybPiper script and post-processing scripts on multiple samples. **`run_tests.sh`** This shell script contains a few of the examples we will cover in this tutorial. The script can be run from within the `test_dataset` directory simply by: ``` ./run_tests.sh ``` ***FixMe:*** _Update this bash script to reflect the new Hypiper 2.0 commands_ All of the commands in the bash script will be covered in more detail in the following sections. ## Running HybPiper Uncompress the read files using the `tar` utility command: ``` tar -zxvf test_reads_fastq.tar.gz ``` This will generate 18 `*.fastq` files in your current directory, representing the forward and reverse Illumina reads for nine samples. The main command of HybPiper is `hybpiper assemble`. HybPiper needs sequencing reads and a file containing the target coding sequences, in either amino-acid or nucleotides format. HybPiper is run separately for each sample, and generates a directory that contains all of the output files, organized by gene. Run this command from the `test_dataset' directory: ``` hybpiper assemble -t_dna test_targets.fasta -r NZ281_R*_test.fastq --prefix NZ281 --bwa ``` Using the wild card (asterisk) saves some typing and instructs HybPiper to use both the R1 (forward) and R2 (reverse) read files. The `--prefix` flag will be the name of the directory generated by HybPiper, as well as the identifier for all sequences generated. The `--bwa` flag is required when the target file contains nucleotide sequences. HybPiper will generate coding sequences and translated proteins from the sequencing reads in three phases: 1. **Read Mapping**. BLASTx or DIAMOND is used if the targets are amino-acid sequences, and BWA is used if the targets are nucleotide sequences. Sequencing reads that map to each gene are sorted into separate gene directories. Liberal parameters for mapping quality are used to ensure the maximum number of reads are used for contig assembly. 2. **Contig assembly**. Each gene is assembled separately from the pool of reads identified in the first step. Assembly is conducted using SPAdes, which automatically detects the best k-mer values to use. If one or more k-mer values fails (usually due to lack of read depth), HybPiper re-runs SPAdes with smaller k-mer values. 3. **Coding Sequence Extraction** First, HybPiper uses Exonerate to align the contigs to the appropriate target sequence, and coding sequences (i.e. exons) are extracted. Next, the coding sequences are sorted according to their alignment position. If there is one sequence that represents the entire aligned portion, it is chosen. Otherwise, a set of selection criteria are applied, including length of alignment to the target locus, percent identity between the coding sequence and the target, and depth of coverage. Briefly, if two coding sequences have non-overlapping alignments to the reference, they are combined into a "stitched-contig". If two contigs have similar alignments to the target sequence, the contig with the longer alignment is chosen. If two contigs have identical alignment positions, but one contig has a much greater depth of coverage (10x more by default), it is chosen. If they both have similar depth, the contig with the greater percent identity to the target is chosen. ***FixMe:*** _Update paragraph above, chat with Matt. I'm not sure what 'similar alignments' means. Currently alignments with identical coordinates are selected between via similarity but not depth - fix._ --- ## Running multiple samples **`hybpiper assemble`** Although HybPiper is set up to run on each sample separately, if the input files are organized and named appropriately, it is easy to set up and run HybPiper on multiple samples consecutively. Here, we will employ a "while loop" to get the names of samples from the file `namelist.txt`, and use that name as a "variable" to access the names of read files and set the `--prefix` flag. From the `test_dataset` directory, run the command: ``` while read name; do hybpiper assemble -t test_targets.fa -r $name*.fastq --prefix $name --bwa; done < namelist.txt ``` It should only take a few minutes to run through every sample. The "while loop" syntax and the `namelist.txt` file will also be used for other post-processing commands later in the tutorial. ## Summary statistics **`hybpiper stats`** This command will summarize target enrichment and gene recovery efficiency for a set of samples. The output is a two tables in tab-separated-values (`*.tsv`) format. These are: 1) `seq_lengths.tsv`. This table lists the length of coding sequence recovered by HybPiper for each gene and sample. The first line of `seq_lengths.tsv` has the column header 'Species' followed by the names of each gene. The second line has the column header 'MeanLength' followed by the length of the target gene, averaged over each "source" for that gene. The rest of the lines are the length of the sequence recovered by HybPiper for each gene. If there was no sequence for a gene, a 0 is present. 2) `hybpiper_stats.tsv`. This table lists one sample per line and the following statistics: | ColumnName | Description | | -------- | -------- | | Name | Sample name corresponding to the parameter `--prefix` or generated from read files. | | NumReads | Total number of input reads in the `*.fastq` files provided. | | ReadsMapped | Total number of input reads that mapped to sequences in the target file. | | PctOnTarget | Percentage of reads in the input `*.fastq` files that mapped to sequences in the target file. | | GenesMapped | Number of genes in the target file that had reads mapped to their representative sequence(s). | | GenesWithContigs | From genes with mapped reads, the number of genes for which SPAdes assembled contigs. | | GenesWithSeqs | From genes with SPAdes contigs, the number of genes that had coding sequences extracted after Exonerate searches of contigs vs the target file reference sequence. | | GenesAt25pct | Number of genes with sequences > 25% of the mean target length. | | GenesAt50pct | Number of genes with sequences > 50% of the mean target length. | | GenesAt75pct | Number of genes with sequences > 75% of the mean target length. | | GenesAt150pct | Number of genes with sequences > 150% of the mean target length.| | ParalogWarningsLong | Number of genes that had more than one coding sequence extracted from different contigs by Exonerate searches, each covering more than 75% (default) the length of the selected target file reference sequence. | | ParalogWarningsDepth | Number of genes where the coverage depth of coding sequences extracted by Exonerate was >1 for 75% (default) the length of the selected target file reference sequence. Coding sequences can be of any length. | | GenesWithoutStitchedContigs | Number of genes with sequence derived from a single SPAdes contig. | | GenesWithStitchedContigs | Number of genes with sequence derived from multiple SPAdes contigs. | | GenesWithStitchedContigsSkipped | Number of genes for which stitched contig creation was skipped (i.e. if the flag `--no_stitched_contig` was provided to the command `hybpiper assemble`). | **Example Command Line:** ``` hybpiper stats -t_dna test_targets.fa gene namelist.txt ``` ## Visualizing results **`hybpiper recovery_heatmap`** A heatmap is one way to get a quick glance of the overall success of HybPiper in recovering coding sequences. To generate the heatmap, we can use the `seq_lengths.tsv` produced by the previous `hybpiper stats` command. Supply this file to the `hybpiper recovery_heatmap` command using: ``` hybpiper recovery_heatmap seq_lengths.tsv ``` This will plot a heatmap representing HybPiper's success at recovering genes at each locus, across all the samples. An image file in `*.png` formated called `heatmap.png` will be produced: ![](img/test_dataset_recovery_heatmap.png) ***FixMe:*** _Insert png heatmap image here._ Each column is a gene, and each row is a sample. The darkness of shading in each cell represents the length of the sequence recovered, expressed as a percentage of the length of the target gene. In the test dataset, most genes worked very well, but it is easy to see at a glance that some samples did not work well (NZ874) and some genes did not work as well (gene022). ## Retrieving sequences **`hybpiper retrieve_sequences`** This command fetches the sequences recovered for the same gene from multiple samples and generates an unaligned multi-FASTA file for each gene. Alternatively, it can be used to recover sequences from a single sample in a single multi-FASTA file. In this tutorial we will carry out the former task. Make sure all your sample runs are in the same directory. The command will retrieve all unique gene names from the target file used in the run of the pipeline. **Example command Line:** ``` hybpiper retrieve_sequences dna -t_dna test_targets.fasta --sample_names namelist.txt ``` You must specify whether you want the protein (aa) or nucleotide (dna) sequences. If you ran Intronerate on these samples by providing the flag `--run_intronerate` to the `hybpiper assemble` command, you can also specify "supercontig" or "intron" to recover those sequences instead. By default, the command will output unaligned fasta files, one per gene, to the current directory. Alternativelty, you can specify an output directory by adding the parameter `--fasta_dir <your_output_directory_name>`. ## Introns and Paralogs We have prepared more in-depth discussions about [extracting introns](Introns) or [diagnosing putative paralogs](Paralogs) on separate pages.