###### tags: `Nanopore Sequencing` # Prokaryotic Genome Annotation Pipeline (PGAP) - Annotation with NCBI Pipeline ## Create the Input files ### 1. FASTA file Each sequence in the file must have a definition line beginning with '>' and a unique identifier (SeqID), eg >contig_1 or >contig_2. e.g.: nord1.fasta WITH: 1. Chromosome >contig_1 Escherichia coli nord1, complete genome [completeness=complete] [topology=circular] [location=chromosome] ATTTGATGCCTGGCAGTTCCCTACTCTCGCATGGGGAGACCCCACACTACCATCGGCGCTACGGCGTTTCACTTCTGAGTTCGGCATGGGGTCAGGTGGGACCACCGCGCTAAGGCCGCCAGGCAAATTCTGTTTCATCAGACCGCTTCTGCGTTCTGATTTAATCTGTATCAGGCTGAAAATCTTCTCTCATCCGCCAAAACATCTTCGGCGTTGTAAGGTTAAGCCTCACGGTTCATTAGTACCGGTTAGCTCAACGCATCGCTGCGCTTACACACCCGGCCTATCAACGTCGTCGTCTTCAACGTTCCTTCAGGACTCTCAGGGAGTCAGGGAGAACTCATCTCGGGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCGCATTTAGCTACCGGGCAGTGCCATTGGCATGACAACCCGAACACCAGTGATGCGTCCACTCCGGTCCTCTCGTACTAGGAGCAGCCCCCTCAGTTCTCCAGCGCCCACGGCAGATAGGGACCGAACTGTCTCACGACGTTCTAAACCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCTACTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATATGAACTCTTGGGCGGTATCAGCCTGTTATCCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCCTTCCATTCAGAACCACC..... 2. Plasmids >contig_2 Escherichia coli nord1, complete genome [completeness=complete] [topology=circular] [location=plasmid] [plasmid-name = pNORD1_2] ATTTGATGCCTGGCAGTTCCCTACTCTCGCATGGGGAGACCCCACACTACCATCGGCGCTACGGCGTTTCACTTCTGAGTTCGGCATGGGGTCAGGTGGGACCACCGCGCTAAGGCCGCCAGGCAAATTCTGTTTCATCAGACCGCTTCTGCGTTCTGATTTAATCTGTATCAGGCTGAAAATCTTCTCTCATCCGCCAAAACATCTTCGGCGTTGTAAGGTTAAGCCTCACGGTTCATTAGTACCGGTTAGCTCAACGCATCGCTGCGCTTACACACCCGGCCTATCAACGTCGTCGTCTTCAACGTTCCTTCAGGACTCTCAGGGAGTCAGGGAGAACTCATCTCGGGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCGCATTTAGCTACCGGGCAGTGCCATTGGCATGACAACCCGAACACCAGTGATGCGTCCACTCCGGTCCTCTCGTACTAGGAGCAGCCCCCTCAGTTCTCCAGCGCCCACGGCAGATAGGGACCGAACTGTCTCACGACGTTCTAAACCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCTACTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATATGAACTCTTGGGCGGTATCAGCCTGTTATCCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCCTTCCATTCAGAACCACC.... ### 2. YAML files Both *.yaml files must have these structures. You can find an example for each *.yaml on server here: ``` $:/mnt/data_1/pgap_ncbi/test_genomes-2020-09-24.build4894/MG37 ``` #### a) input.yaml ``` fasta: class: File location: nord1.fasta submol: class: File location: submol.yaml ``` #### b) submol.yaml ``` topology: circular comment: 'There is no really a biologist Arnold Schwarzenegger' consortium: 'SkyNet consortium' sra: - accession: 'ERR123456789' tp_assembly: true organism: genus_species: 'Escherichia coli' strain: 'Nord1' contact_info: last_name: 'Braun' first_name: 'Sascha' email: 'jane_doe@gmail.com' organization: 'Institute of Klebsiella foobarensis research' department: 'Department of Using NCBI' phone: '301-555-0245' street: '1234 Main St' city: 'Docker' postal_code: '12345' country: 'Lappland' authors: - author: first_name: 'Arnold' last_name: 'Schwarzenegger' middle_initial: 'T' - author: first_name: 'Linda' last_name: 'Hamilton' bioproject: 'PRJNA9999999' biosample: 'SAMN99999999' # -- Locus tag prefix - optional. Limited to 9 letters locus_tag_prefix: 'nord01' publications: - publication: pmid: 16397293 title: 'Discrete CHARMm of Klebsiella foobarensis.' status: published # this is enum: controlled vocabulary authors: - author: first_name: 'Martin' last_name: 'Reinicke' - author: first_name: 'Sascha' last_name: 'Braun' middle_initial: 'D' ``` ## Start PGAP 1. copy all needed files (fasta, input.yaml, submol.yaml) into one folder ``` e.g.: $:/mnt/data_1/pgap_ncbi/dresi_ref/nord1 ``` 2. navigate to folder: ``` $:/mnt/data_1/pgap_ncbi ``` 3. start ./pgab.py ``` $:./pgap.py -r -o dresi_ref/nord1/nord1_results dresi_ref/nord1/input.yaml -o outdir directory (will automatically written) -r Report to NCBI whenever the pipeline is run ``` Usefull commands ``` -r, --report-usage-true Report to NCBI whenever the pipeline is run. -n, --report-usage-false Do not report to NCBI. -o path, --output path Output directory to be created, which may include a full path. --no-internet Disable internet access for all programs in pipeline. --taxcheck Also calculate the Average Nucleotide Identity to type assemblies --taxcheck-only Only calculate the Average Nucleotide Identity to type assemblies, do not run PGAP ```