## C3. Gene annotation with braker
Once we have softmasked our genomes, we can procede with gene annotation. Braker3 allows for fully automated training of the gene prediction tools GeneMark-ES/ET/EP/ETP R14, R15, R17, F1 and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction.
In contrast to other available methods that rely on protein homology information, BRAKER2 reaches high gene prediction accuracy even in the absence of the annotation of very closely related species and in the absence of RNA-Seq data.
### Software requirements
* **Conda environments**
* **braker3 3.0.8** (https://anaconda.org/bioconda/braker3)
* **Programs**
* **Genemark-ETP 1.0** (https://github.com/gatech-genemark/GeneMark-ETP). We need the Genemark-ETP 1.0 software (http://topaz.gatech.edu/Genemark/license_download.cgi). You will need to register and then download the software and the key in your computer to then upload it to your cluster. Store the .key in your home folder in the cluster. Then, decompress them and procede with the installation. Make sure Perl and Python3 is installed. Make sure Perl and Python3 is previously installed.
````
scp /your/home/folder/gmetp_linux_64.tar.gz igomez@login.spc.cica.es:/home/igomez/software/gmetp_linux_64.tar.gz
scp /your/home/folder/gm_key_64.gz igomez@login.spc.cica.es:/home/igomez/gm_key_64.gz
ssh igomez@login.spc.cica.es
gunzip ~/gm_key_64.gz
cd software
tar -xvzf gmetp_linux_64.tar.gz
````
* **ProtHint** (https://github.com/gatech-genemark/ProtHint)
````
cd ~/software
git clone https://github.com/gatech-genemark/ProtHint.git
````
Running the braker.pl command requires many dependencies and a very long computational time, much longer than anything we have run in this tutorial (3-4 days at least per genome). As braker installation is tricky due to the many different dependencies accross multiple manually installed softwares, it is strongly encouraged to run a test with a small dataset. We will use the example proposed in the braker documentation (https://github.com/Gaius-Augustus/BRAKER#example-data), where we will annotate 1,000,000 nucleotides of *Arabidopsis tathiana* using protein data from OrthoDB, particurlarly the database for Brassicales ``brassicales_odb10``. The data is available in https://github.com/Gaius-Augustus/BRAKER/tree/master/example.
We will perform the test in a folder called ``test``.
````
mkdir -p ~/genome_assembly/C3_braker/test
cd ~/genome_assembly/C3_braker/test
wget https://github.com/Gaius-Augustus/BRAKER/blob/master/example/genome.fa
wget https://github.com/Gaius-Augustus/BRAKER/blob/master/example/proteins.fa
````
Then, we will launch a test job. This test should take less than an hour to finish.
``nano ~/genome_assembly/C3_braker/test/C3_braker_test.sbatch``
This is the test script:
````
#!/usr/bin/bash
#SBATCH --job-name=C3_braker_test
#SBATCH --output=C3_braker_test_output
#SBATCH --error=C3_braker_test_error
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=240G
#SBATCH --partition=standard
#SBATCH --time=6:00:00
###Set dependencies directories and export them
GENEMARK_PATH=/home/igomez/software/gmetp_linux_64/bin
PROTHINT_PATH=/home/igomez/software/gmetp_linux_64/bin/gmes/ProtHint/bin
###Set conda source and activate environment
source ~/anaconda3/etc/profile.d/conda.sh
conda activate braker
export GENEMARK_PATH PROTHINT_PATH
###Set output directories
OUTDIR="/home/igomez/genome_assembly/C3_braker/test"
cd $OUTDIR
#Run braker
braker.pl --genome=genome.fa --prot_seq=proteins.fa --threads=32 --species=arabidopsis_thatiana --busco_lineage=brassicales_odb10
conda deactivate
### SOFTWARE VERSIONS
# BRAKER version 3.0.8
# genemark_ETP
# ProtHint v2.6
````
When it finishes check the error file ``C3_braker_test_error``. Then go to the ``braker`` folder and check the ``braker.log`` file, the error folder and all the files whithin it. Your output folder should look like this:
````
cd ~/genome_assembly/C3_braker/test
du -h *
880K Augustus
20M bbc/genemark/brassicales_odb10_hmmsearch_output
21M bbc/genemark
20M bbc/augustus/brassicales_odb10_hmmsearch_output
21M bbc/augustus
21M bbc/braker/brassicales_odb10_hmmsearch_output
21M bbc/braker
61M bbc
4.0K best_by_compleasm.log
128K braker.aa
376K braker.codingseq
396K braker.gtf
76K braker.log
24K errors
2.6M GeneMark-EP
2.3M GeneMark-ES
4.0K genome_header.map
428K hintsfile.gff
140K prothint.gff
384K species/arabidopsis_thatiana
388K species
4.0K what-to-cite.txt
````
Pay attention to any error messages sometimes. Sometimes it does not find some scripts or executables. I'll show an example:
````
cat ~/genome_assembly/C3_braker/test/C3_braker_test_error
ERROR: augustus binary not found or not executable.
````
Make sure it exists in your cluster using ``find``:
````
find home/igomez -name augustus
/home/igomez/augustus/bin
````
Then, just add the path to the slurm script:``export AUGUSTUS_BIN_PATH=~/augustus/bin``
Sometimes installations are incomplete and you will have to go to GitHub repositories and copy the scripts directly in ``bin``or ``scripts`` folder inside the software directories or whithin the conda environment folder.