Introns - HackMD

###### tags: `hybpiper_wiki` # Introns **`hybpiper assemble --run_intronerate`** If the flag `--run_intronerate` is provided when running the command `hybpiper assemble`, the pipeline will attempt to identify introns (if present), and will also produce a 'supercontig' sequence for each gene/sample. These are defined as: - **`supercontig`**: A sequence containing all assembled SPAdes contigs with a unique alignment to the reference target file sequence, concatenated in to one sequence. The `supercontig` sequence can contain both exon AND intron sequences. See note 1 below. - **`introns`**: Sequences in the `supercontig` that Exonerate annotates as 'intron'. See note 2 below. **Example command:** ``` hybpiper assemble -t test_targets.fasta -r NZ281_R*_test.fastq --prefix NZ281 --bwa --run_intronerate ``` Alternatively, if you have already run the `hybpiper assemble` command without the `--run_intronerate` flag, you can use the following command to run only the final stage of the pipeline (including intron and supercontig recovery ): ``` hybpiper assemble -t test_targets.fasta -r NZ281_R*_test.fastq --prefix NZ281 --bwa --run_intronerate --no-blast --no-distribute --no-assemble ``` **NOTES**: 1) The supercontig can contain mutiple SPAdes contigs that have been concatenated. Ideally these will contain genuine exon and intron (or intergenic sequence) sequences only. However, the sequence might also partly comprise mis-assembled contigs. While it may be difficult to tell whether the sequence is "real" from a single sample, I recommend using `--run_intronerate` on several samples. Then, extract the supercontig sequences with `hybpiper retrieve_sequences` and align them. Sequences that appear in only one sample are probably from mis-assembled contigs and may be trimmed, for example using the program Trimal. 2) Intron sequences recovered by using the `--run_intronerate` flag correspond to regions in the supercontig that Exonerate has annotated as 'intron'. As Exonerate hits begin and end with exon hits against an exon-only target file sequence, this means that only intron sequences that occur between identified exons will be recovered. That is, sequence upstream of the first exon and downstream of the last exon will not be annotated, and will not appear in the `inronerate.gff` file found within each gene directory. These unannotated regions might be exonic (i.e. if the target protein used in Exonerate searches was shorter than the exonic regions assembled in SPAdes contigs) or intronic (but not detected for reasons described above).