###### tags: `Nanopore Sequencing`
# Prokaryotic Genome Annotation Pipeline (PGAP) - Annotation with NCBI Pipeline
## Create the Input files
### 1. FASTA file
Each sequence in the file must have a definition line beginning with '>' and a unique identifier (SeqID), eg >contig_1 or >contig_2.
e.g.: nord1.fasta
WITH:
1. Chromosome
>contig_1 Escherichia coli nord1, complete genome [completeness=complete] [topology=circular] [location=chromosome]
ATTTGATGCCTGGCAGTTCCCTACTCTCGCATGGGGAGACCCCACACTACCATCGGCGCTACGGCGTTTCACTTCTGAGTTCGGCATGGGGTCAGGTGGGACCACCGCGCTAAGGCCGCCAGGCAAATTCTGTTTCATCAGACCGCTTCTGCGTTCTGATTTAATCTGTATCAGGCTGAAAATCTTCTCTCATCCGCCAAAACATCTTCGGCGTTGTAAGGTTAAGCCTCACGGTTCATTAGTACCGGTTAGCTCAACGCATCGCTGCGCTTACACACCCGGCCTATCAACGTCGTCGTCTTCAACGTTCCTTCAGGACTCTCAGGGAGTCAGGGAGAACTCATCTCGGGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCGCATTTAGCTACCGGGCAGTGCCATTGGCATGACAACCCGAACACCAGTGATGCGTCCACTCCGGTCCTCTCGTACTAGGAGCAGCCCCCTCAGTTCTCCAGCGCCCACGGCAGATAGGGACCGAACTGTCTCACGACGTTCTAAACCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCTACTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATATGAACTCTTGGGCGGTATCAGCCTGTTATCCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCCTTCCATTCAGAACCACC.....
2. Plasmids
>contig_2 Escherichia coli nord1, complete genome [completeness=complete] [topology=circular] [location=plasmid] [plasmid-name = pNORD1_2]
ATTTGATGCCTGGCAGTTCCCTACTCTCGCATGGGGAGACCCCACACTACCATCGGCGCTACGGCGTTTCACTTCTGAGTTCGGCATGGGGTCAGGTGGGACCACCGCGCTAAGGCCGCCAGGCAAATTCTGTTTCATCAGACCGCTTCTGCGTTCTGATTTAATCTGTATCAGGCTGAAAATCTTCTCTCATCCGCCAAAACATCTTCGGCGTTGTAAGGTTAAGCCTCACGGTTCATTAGTACCGGTTAGCTCAACGCATCGCTGCGCTTACACACCCGGCCTATCAACGTCGTCGTCTTCAACGTTCCTTCAGGACTCTCAGGGAGTCAGGGAGAACTCATCTCGGGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCGCATTTAGCTACCGGGCAGTGCCATTGGCATGACAACCCGAACACCAGTGATGCGTCCACTCCGGTCCTCTCGTACTAGGAGCAGCCCCCTCAGTTCTCCAGCGCCCACGGCAGATAGGGACCGAACTGTCTCACGACGTTCTAAACCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCTACTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATATGAACTCTTGGGCGGTATCAGCCTGTTATCCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCCTTCCATTCAGAACCACC....
### 2. YAML files
Both *.yaml files must have these structures. You can find an example for each *.yaml on server here:
```
$:/mnt/data_1/pgap_ncbi/test_genomes-2020-09-24.build4894/MG37
```
#### a) input.yaml
```
fasta:
class: File
location: nord1.fasta
submol:
class: File
location: submol.yaml
```
#### b) submol.yaml
```
topology: circular
comment: 'There is no really a biologist Arnold Schwarzenegger'
consortium: 'SkyNet consortium'
sra:
- accession: 'ERR123456789'
tp_assembly: true
organism:
genus_species: 'Escherichia coli'
strain: 'Nord1'
contact_info:
last_name: 'Braun'
first_name: 'Sascha'
email: 'jane_doe@gmail.com'
organization: 'Institute of Klebsiella foobarensis research'
department: 'Department of Using NCBI'
phone: '301-555-0245'
street: '1234 Main St'
city: 'Docker'
postal_code: '12345'
country: 'Lappland'
authors:
- author:
first_name: 'Arnold'
last_name: 'Schwarzenegger'
middle_initial: 'T'
- author:
first_name: 'Linda'
last_name: 'Hamilton'
bioproject: 'PRJNA9999999'
biosample: 'SAMN99999999'
# -- Locus tag prefix - optional. Limited to 9 letters
locus_tag_prefix: 'nord01'
publications:
- publication:
pmid: 16397293
title: 'Discrete CHARMm of Klebsiella foobarensis.'
status: published # this is enum: controlled vocabulary
authors:
- author:
first_name: 'Martin'
last_name: 'Reinicke'
- author:
first_name: 'Sascha'
last_name: 'Braun'
middle_initial: 'D'
```
## Start PGAP
1. copy all needed files (fasta, input.yaml, submol.yaml) into one folder
```
e.g.:
$:/mnt/data_1/pgap_ncbi/dresi_ref/nord1
```
2. navigate to folder:
```
$:/mnt/data_1/pgap_ncbi
```
3. start ./pgab.py
```
$:./pgap.py -r -o dresi_ref/nord1/nord1_results dresi_ref/nord1/input.yaml
-o outdir directory (will automatically written)
-r Report to NCBI whenever the pipeline is run
```
Usefull commands
```
-r, --report-usage-true Report to NCBI whenever the pipeline is run.
-n, --report-usage-false Do not report to NCBI.
-o path, --output path Output directory to be created, which may include a full path.
--no-internet Disable internet access for all programs in pipeline.
--taxcheck Also calculate the Average Nucleotide Identity to type assemblies
--taxcheck-only Only calculate the Average Nucleotide Identity to type assemblies, do not run PGAP
```