Install braker

## C3. Gene annotation with braker Once we have softmasked our genomes, we can procede with gene annotation. Braker3 allows for fully automated training of the gene prediction tools GeneMark-ES/ET/EP/ETP R14, R15, R17, F1 and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction. In contrast to other available methods that rely on protein homology information, BRAKER2 reaches high gene prediction accuracy even in the absence of the annotation of very closely related species and in the absence of RNA-Seq data. ### Software requirements * **Conda environments** * **braker3 3.0.8** (https://anaconda.org/bioconda/braker3) * **Programs** * **Genemark-ETP 1.0** (https://github.com/gatech-genemark/GeneMark-ETP). We need the Genemark-ETP 1.0 software (http://topaz.gatech.edu/Genemark/license_download.cgi). You will need to register and then download the software and the key in your computer to then upload it to your cluster. Store the .key in your home folder in the cluster. Then, decompress them and procede with the installation. Make sure Perl and Python3 is installed. Make sure Perl and Python3 is previously installed. ```` scp /your/home/folder/gmetp_linux_64.tar.gz igomez@login.spc.cica.es:/home/igomez/software/gmetp_linux_64.tar.gz scp /your/home/folder/gm_key_64.gz igomez@login.spc.cica.es:/home/igomez/gm_key_64.gz ssh igomez@login.spc.cica.es gunzip ~/gm_key_64.gz cd software tar -xvzf gmetp_linux_64.tar.gz ```` * **ProtHint** (https://github.com/gatech-genemark/ProtHint) ```` cd ~/software git clone https://github.com/gatech-genemark/ProtHint.git ```` Running the braker.pl command requires many dependencies and a very long computational time, much longer than anything we have run in this tutorial (3-4 days at least per genome). As braker installation is tricky due to the many different dependencies accross multiple manually installed softwares, it is strongly encouraged to run a test with a small dataset. We will use the example proposed in the braker documentation (https://github.com/Gaius-Augustus/BRAKER#example-data), where we will annotate 1,000,000 nucleotides of *Arabidopsis tathiana* using protein data from OrthoDB, particurlarly the database for Brassicales ``brassicales_odb10``. The data is available in https://github.com/Gaius-Augustus/BRAKER/tree/master/example. We will perform the test in a folder called ``test``. ```` mkdir -p ~/genome_assembly/C3_braker/test cd ~/genome_assembly/C3_braker/test wget https://github.com/Gaius-Augustus/BRAKER/blob/master/example/genome.fa wget https://github.com/Gaius-Augustus/BRAKER/blob/master/example/proteins.fa ```` Then, we will launch a test job. This test should take less than an hour to finish. ``nano ~/genome_assembly/C3_braker/test/C3_braker_test.sbatch`` This is the test script: ```` #!/usr/bin/bash #SBATCH --job-name=C3_braker_test #SBATCH --output=C3_braker_test_output #SBATCH --error=C3_braker_test_error #SBATCH --nodes=1 #SBATCH --cpus-per-task=32 #SBATCH --mem=240G #SBATCH --partition=standard #SBATCH --time=6:00:00 ###Set dependencies directories and export them GENEMARK_PATH=/home/igomez/software/gmetp_linux_64/bin PROTHINT_PATH=/home/igomez/software/gmetp_linux_64/bin/gmes/ProtHint/bin ###Set conda source and activate environment source ~/anaconda3/etc/profile.d/conda.sh conda activate braker export GENEMARK_PATH PROTHINT_PATH ###Set output directories OUTDIR="/home/igomez/genome_assembly/C3_braker/test" cd $OUTDIR #Run braker braker.pl --genome=genome.fa --prot_seq=proteins.fa --threads=32 --species=arabidopsis_thatiana --busco_lineage=brassicales_odb10 conda deactivate ### SOFTWARE VERSIONS # BRAKER version 3.0.8 # genemark_ETP # ProtHint v2.6 ```` When it finishes check the error file ``C3_braker_test_error``. Then go to the ``braker`` folder and check the ``braker.log`` file, the error folder and all the files whithin it. Your output folder should look like this: ```` cd ~/genome_assembly/C3_braker/test du -h * 880K Augustus 20M bbc/genemark/brassicales_odb10_hmmsearch_output 21M bbc/genemark 20M bbc/augustus/brassicales_odb10_hmmsearch_output 21M bbc/augustus 21M bbc/braker/brassicales_odb10_hmmsearch_output 21M bbc/braker 61M bbc 4.0K best_by_compleasm.log 128K braker.aa 376K braker.codingseq 396K braker.gtf 76K braker.log 24K errors 2.6M GeneMark-EP 2.3M GeneMark-ES 4.0K genome_header.map 428K hintsfile.gff 140K prothint.gff 384K species/arabidopsis_thatiana 388K species 4.0K what-to-cite.txt ```` Pay attention to any error messages sometimes. Sometimes it does not find some scripts or executables. I'll show an example: ```` cat ~/genome_assembly/C3_braker/test/C3_braker_test_error ERROR: augustus binary not found or not executable. ```` Make sure it exists in your cluster using ``find``: ```` find home/igomez -name augustus /home/igomez/augustus/bin ```` Then, just add the path to the slurm script:``export AUGUSTUS_BIN_PATH=~/augustus/bin`` Sometimes installations are incomplete and you will have to go to GitHub repositories and copy the scripts directly in ``bin``or ``scripts`` folder inside the software directories or whithin the conda environment folder.