Results and output files

###### tags: `hybpiper_wiki` # Results and output files [TOC] ## 1.0 `hybpiper assemble` ### Base Directory The name of the base directory is specified by supplying the parameter `--prefix` to the `hybpiper assemble` command. If `--prefix` is not provided, it is generated from the read file names. - The master target file (e.g. **`target_file.fasta`**). - A BLAST (**`<target_file_name>.psq`**, etc.), DIAMOND (**`<target_file_name>.dmnd`**) or BWA database (**`<target_file_name>.amb`**, etc.). - A BLAST/DIAMOND (**`<prefix>.blastx`**) or BWA (**`<prefix>.bam`**) mapping results file. - A directory for every gene with BLAST/DIAMOND or BWA hits, e.g. **`gene001`**, **`gene002`**, etc. - **`target_tallies.txt`**. A text file summarizing the chosen target reference sequences for the sample run. - **`spades_genelist.txt`**. A text file listing all genes with mapped reads. - **`exonerate_genelist.txt`**. A text file listing all genes with assembled SPAdes contigs. - **`genes_with_seqs.txt`**. A text file listing all genes for which a coding sequence was extracted via Exonerate. - **`<prefix>_genes_with_long_paralog_warnings.txt`**. A text file listing all genes which had multiple long-length sequences from different SPAdes contigs (putative paralogs). - **`<prefix>_genes_with_paralog_warnings_by_contig_depth.csv`**. A comma-separated-values file listing all genes that had a SPAdes contig depth >1 for at least 75% (default) the length of the reference target file sequence. - **`<prefix>_genes_with_stitched_contig.csv`**. A comma-separated-values file with details on whether a stiched contig was created for a given gene. - **`<prefix>_genes_derived_from_putative_chimera_stitched_contig.csv`**. A comma-separated-values file listing all genes that might be derived from a chimeric stitched contig (i.e. comprising multiple paralogs). - **`<prefix>_hybpiper_assemble_<date_time>.log`**. A text log file containing many details regarding the pipeline run for the sample. - **`spades.log`**. A text log file containing the concatenated output of the SPAdes assembler for initial SPAdes assemblies for all genes. - **`failed_spades.txt`**. A text file listing all genes that had a failed intial SPAdes assembly. - **`redo_spades_commands.txt`**. A text file containing commands to re-run SPAdes for genes with a failed initial assembly. - **`spades_redo.log`**. A text log file containing the concatenated output of the SPAdes assembler for SPAdes re-runs. - **`spades_duds.txt`**. A text file listing all genes with failed SPAdes re-runs. - **`total_input_reads_paired.txt`**. A text file containing the number of paired-end reads (if supplied) in the input read files. - **`total_input_reads_single.txt`**. A text file containing the number of single-end reads (if supplied) in the input read files. - **`total_input_reads_unpaired.txt`**. A text file containing the number of unpaired reads (if supplied) in the input read files. ### Base Directory -> Gene Directory The gene directories will be named according the unique gene names present in the target file used for the run. - **`<gene_name>_interleaved.fasta`**. A fasta file containing all reads provided using the `--readfiles` or `-r` parameter that mapped to any target sequence for this gene. In cases where only one read of a read pair mapped, both R1 and R2 reads are included in this file. If paired-end reads files were used as input, this fasta file is in interleaved format; not that this file will be have the suffix `interleaved.fasta` even if you provide single-end reads. - **`<gene_name>_merged.fastq`**. A fastq file of merged reads from paired-end input. This file will only be present if the flag `--merged` is used with the `hybpiper assemble` command and paired-end reads are provided. - **`<gene_name>_unmerged.fastq`**. A fastq file of paired-end reads that could not be merged. in interleaved format. This file will only be present if the flag `--merged` is used with the `hybpiper assemble` command and paired-end reads are provided. - **`<gene_name>_unpaired.fasta`**. A fasta file containing all reads provided using the `--unpaired` parameter that mapped to any target sequence for this gene. - **`<gene_name>_contigs.fasta`**. The contigs assembled from the input read using SPAdes. - **`<gene_name>_target.fasta`**. A fasta file with the amino-acid sequence of the 'best' reference target for the given gene/sample. - **`<gene_name>_<date_time>.log`**. The log file produced by the `exonerate_hits.py` module for the given gene/sample. This will only be present if the flag `--keep_intermediate_files` was provided to the command `hybpiper assemble`; default behaviour is to delete the log file after it has been re-logged to the main sample logfile in the base directory. - **`<sample_name>`**. A directory of Exonerate results; the directory has the same name as the sample. See below for details. - **`<gene_name>_spades`**. The directory produced by the SPAdes assembler for the given gene/sample. See below for details. ### Base Directory -> Gene Directory -> SPAdes directory The SPAdes assembly directory is produced by the SPAdes assembler; in this case it will have a prefix corresponding to the given gene name, i.e. `<gene_name>_spades`. This directory will **only be present** if the flag `--keep_intermediate_files` was provided to the command `hybpiper assemble`; default behaviour is to delete the directory after processing. It contains standard SPAdes output files and folders as described [here](https://github.com/ablab/spades#spades-output). ### Base Directory -> Gene Directory -> Exonerate Directory The Exonerate directory will have the same name as the base directory (i.e. the sample name), and contains output files and folder produced by the `exonerate_hits.py` module. - **`exonerate_results.fasta`**. The output of the initial Exonerate search of the target protein against the SPAdes contigs. This file contains both Exonerate alignments, and fasta sequence for the extracted coding region. - **`exonerate_stats.tsv`**. A table in tab-separated-values format, containing information on all SPAdes contigs with Exonerate hits against the 'best' reference target sequence. - **`exonerate_hits_trimmed.FAA`**. A fasta file containing amino-acid sequences of one or more Exonerate hits used to create the output gene sequence. - **`exonerate_hits_trimmed.FNA`**. A fasta file containing nucleotide sequences of one or more Exonerate hits used to create the output gene sequence. - **`genes_with_stitched_contig.csv`**. A file in comma-separated-values format, providing details on whether the given gene/sample sequence was derived from a stitched contig. - **`paralog_warning_long.txt`**. A text file produced if the given gene/sample had 'long' paralog warnings, listing the corresponding SPAdes contigs along with Exonerate hit details. - **`paralog_warning_by_contig_depth.txt`**. A text file detailing whether the given gene/sample has a paralog warning produced by sequence depth across the reference target sequence after Exonerate searches. - **`chimera_test_stitched_contig.fasta`**. A fasta file containing a stitched contig nucleotide sequence, used for read mapping during the chimera test. - **`chimera_test_stitched_contig.sam`**. A mapping file in Sequence Alignment Map (SAM) format, produced by mapping paired-end reads against the `chimera_test_stitched_contig.fasta` sequence. - **`putative_chimeric_stitched_contig.csv`**. A file in comma-separated-values format, produced if a stitched contig for the given gene/sample appears to be chimeric. Lists the sample name, gene name, and chimera warning details. - **`chimera_test_diagnostic_reads.sam`** A headless mapping file in Sequence Alignment Map (SAM) format, produced by filtering the `chimera_test_stitched_contig.sam` file to retain readi pairs diagnostic for a chimeric stitched contig. - **`sequences`**. A directory containing subdirectories with recovered sequences. See below for details. - **`intronerate`**. A directory containing intron and supercontig processing results. See below for details. - **`paralogs`**. A directory containing paralog sequence results, if present. See below for details. ### Base Directory -> Gene Directory -> Exonerate Directory -> Sequences Directory The directory **`sequences`** contains subdirectories containing fasta files with recovered sequences, as follows: - **`FAA`**. A directory containing the fasta file `<gene_name>.FNA` with the recovered gene sequence in amino-acids. - **`FNA`**. A directory containing the fasta file `<gene_name>.FNA` with the recovered gene sequence in nucleotides. - **`intron`**. A directory containing the fasta files `<gene_name>_introns.fasta` and `<gene_name>supercontig.fasta`. These files contain recovered intron sequence, and the recovered supercontig sequence (the latter containing both introns and exons), of recovered for the gene/sampke. This directory will **only be present** if the flag `--run_intronerate` was provided to the command `hybpiper assemble`. ### Base Directory -> Gene Directory -> Exonerate Directory -> Paralogs Directory The directory **`paralogs`** contains the fasta file `<gene_name>_paralogs.fasta` with paralog sequences, if recovered for the gene/sample. ### Base Directory -> Gene Directory -> Exonerate Directory -> Intronerate Directory The directory **`intronerate`** will only be present if the flag `--run_intronerate` was provided to the command `hybpiper assemble`. It contains output files produced by Intronerate (the process used to recover introns and supercontigs, if present for the gene/sample). - **`intronerate_query_stripped.fasta`**. A fasta file containing the recovered gene sequence in amino-acid format, with and 'X' characters removed. Used as a query in Exonerate searches to generate a gff file. - **`<gene_name>_supercontig_without_Ns.fasta`**. A fasta file containing a sueprcontig (i.e. exons and introns) for the given gene/sample. Used as a target in Exonerate searches to generate a gff file. - **`<gene_name>_intronerate_supercontig_individual_contig_hits.fasta`**. A fasta file containing the individual SPAdes contigs used to create the supercontig sequence. - **`<gene_name>_intronerate_fasta_and_gff.txt`**. A text file containing both Exonerate search alignment and gff details. - **`intronerate.gff`**. The gff details only, extracted from the `<gene_name>_intronerate_fasta_and_gff.txt` file. ## 2.0 `hybpiper stats` ### Parent Directory The parent directory contains one or more Base directories corresponding to the output of `hybiper assemble` for each sample. The descriptions below assume that the command `hybpiper stats` has been run from the parent directory. - **`seq_lengths.tsv`**. A table in tab-separated-values format, containing the lengths of each recovered gene sequence for each sample, along with the mean sequence length for each gene within the target file. The name of this file can be changed using the parameter `--seq_lengths_filename <filename>`. - **`hybpiper_stats.tsv`**. A table in tab-separated-values format, containing statistics on the HybPiper run. The name of this file can be changed using the parameter `--stats_filename <filename>`. ## 3.0 `hybpiper retrieve_sequences` ### Parent Directory The parent directory contains one or more Base directories corresponding to the output of `hybiper assemble` for each sample. The descriptions below assume that the command `hybpiper retrieve_sequences` has been run from the parent directory. - **`<gene_name>.FNA`**. A fasta file containing the recovered gene sequence from each sample in nucleotides (if parameter `dna` was supplied). A fasta file will be produced for each gene. - **`<gene_name>.FAA`**. A fasta file containing the recovered gene sequence from each sample in amino-acids (if parameter `aa` was supplied). A fasta file will be produced for each gene. - **`<gene_name_introns>.fasta`**. A fasta file containing the recovered gene intron sequence from each sample in nucleotides (if parameter `intron` was supplied). A fasta file will be produced for each gene. - **`<gene_name_supercontig>.fasta`**. A fasta file containing the recovered gene supercontig sequence (exons and introns) from each sample in nucleotides (if parameter `supercontig` was supplied). A fasta file will be produced for each gene. ### Optional sequence directory If the parameter `--fasta_dir <directory_name>` is provided, the directory will be created the fasta files decribed above will be placed within it, rather than in the parent directory. ## 4.0 `hybpiper paralog_retriever` ### Parent Directory The parent directory contains one or more Base directories corresponding to the output of `hybiper assemble` for each sample. The descriptions below assume that the command `hybpiper paralog_retriever` has been run from the parent directory. - **`paralog_heatmap.png`**. A heatmap image file in `*.png` format, depicting... - **`paralog_report.tsv`**. A table in tab-separated-values format, containing... - **`paralogs_above_threshold_report.txt`**. A text file containing... - **`paralogs_all`**. A directory containing... - **`paralogs_no_chimeras`**. A directory containing... ## 5.0 `hybpiper recovery_heatmap` ### Parent Directory The parent directory contains one or more Base directories corresponding to the output of `hybiper assemble` for each sample. The descriptions below assume that the command `hybpiper recovery_heatmap` has been run from the parent directory. - **`recovery_heatmap.png`**. A heatmap image file in `*.png` format, depicting the length of the recovered sequence for each gene and each sample, relative to the mean length of the gene sequence references in the target file. The name of this file can be changed using the parameter `--heatmap_filename <filename>`. The format of the file can be changed using the parameter `--heatmap_filetype {png,pdf,eps,tiff,svg}`. ## 6.0 `hybpiper check_dependencies` No output files are produced by this command. Results are printed to the terminal screen. ## 7.0 `hybpiper check_targetfile` No output files are produced by this command. Results are printed to the terminal screen.