# Genome assembly tutorial
## B2. Scaffolding with 'yaHS' and creation of HiC contact maps with 'juicer'
Now that we have aligned the Hi-C reads to our genome assembly, we can proceed with scaffolding. We use a program called 'yaHS' (Zhou et al., 2022).
When the scaffolding is complete we will create a **contact matrix**, that we will represent in a heatmap using the software 'Juicer' (Durand et al., 2016). In this contact matrix we quantify the number of contacts between the different regions in the genome assembly. This heatmap will be explored and modified in the next section.
### Software requirements
* Conda environments
* **ref_map** (I am including only the packages we need from this environment)
* **samtools** 1.20: Likely, it is already installed (https://anaconda.org/bioconda/samtools)
* **yahs** 1.2a.2 (https://anaconda.org/bioconda/yahs)
* **java (.jar) scripts**: Make sure you have a working Java Runtime Environment (JRE) in your cluster. Run java -version:
`````
java -version
openjdk version "22.0.1" 2024-04-16
OpenJDK Runtime Environment (build 22.0.1+8-16)
OpenJDK 64-Bit Server VM (build 22.0.1+8-16, mixed mode, sharing)
`````
If nothing appears it can mean that it was not installed or that it is not included in your $PATH in ``.bashrc``(it may be a hidden file, un this case run ``ls -a ~``).
If the latter is your case you have to edit ``.bashrc`` to include the path to java, writing the following commands in the file:
````
export JAVA_HOME=~/pa/jdk-22.0.1
export PATH=$JAVA_HOME/bin:$PATH
````
Then run this ``source ~/.bashrc``, to apply the changes.
If you do not java installed, you have to install it, You can do it even without being admin. In this link there are free versions of java for linux, windows and macOS: https://jdk.java.net/22/. Follow the instruction on the page for the installation.
Then, we can access the scripts:
* **juicer_tools.jar** from aidenlab: juicer (https://github.com/aidenlab/juicer). It should be installed in software folder using ``git clone``. Then you will have to download the script of the function. In order for it to be easier to call we will soft-link it to another file. Make sure you activate ref_map conda environment before using the executables, as it contains its dependencies.
````
cd ~/software
git clone https://github.com/theaidenlab/juicer.git
wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar
ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicer_tools.jar
````
* **Juicebox**: It will be installed in your computer, not in the cluster. Download an addecuate version to your Operating System (https://github.com/aidenlab/Juicebox/wiki/Download). If you have Linux in your computer you will need to have Java installed (https://www.java.com/es/download/), and use of of the Java versions.
These script will be run in a folder called ``B2_juicer_before_mc``,as it before the manual curation step with juicebox.
````
mkdir -p ~/genome_assembly/B2_juicer_before_mc
cd ~/genome_assembly/B2_juicer_before_mc
````
### Run the script
First, create the script file:
``nano B2_juicer_bf_mc.sbatch``
Then, we will add the commands whithin it. This script should take 1-2 hours to run per sample. The ``juicer_tools.jar`` commands requires a lot of CPU memmory, at least 100-120 GB.
````
#!/bin/bash
#SBATCH --job-name=B2_juicer_bf_mc
#SBATCH --export=ALL
#SBATCH --output=B2_juicer_bf_mc_output
#SBATCH --error=B2_juicer_bf_mc_error
#SBATCH --mem=120G
#SBATCH --cpus-per-task=16
#SBATCH --time=23:30:00
### Load directories
INDIR=/home/igomez/genome_assembly/B1_hiC_mapping
OUTDIR=/home/igomez/genome_assembly/B2_juicer_before_mc
SOFTWAREDIR=/home/igomez/software
SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt"
### Create function to parallelize over samples
juicer_bf_mc() {
SAMPLE=$1
## Establish working directories
mkdir -p $OUTDIR/${SAMPLE}
##Load conda source and activate environments
source ~/anaconda3/etc/profile.d/conda.sh
conda activate ref_map
## Sort .BAM file (the one with hiC mappings) by read name
BAM_UNSORTED=$INDIR/${SAMPLE}/repdir/${SAMPLE}_final.bam
samtools sort $BAM_UNSORTED -o $INDIR/${SAMPLE}/repdir/${SAMPLE}_final_sorted.bam -O BAM -n -@ 16
# Create a variable for the sorted .BAM file
BAM=$INDIR/${SAMPLE}/repdir/${SAMPLE}_final_sorted.bam
echo "BAM sorting done for ${SAMPLE}"
## Procede with scaffolding with yahs. Always use -q 0 to set the minimum
## read mapping quality to 0
REF=/home/igomez/genome_assembly/A2_purge_dups/${SAMPLE}/purged.fa
yahs $REF $BAM -o $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out -e GATC -e GANTC --no-mem-check -q 0
echo "scaffolding with yahs done for ${SAMPLE}"
### Generate HiC contact maps:
## First, convert the yahs output to a juicer input. Here as well, also use -q 0!.
#Check if you have the indexing (.fai) of your genome assembly. This will be used to sort the juicer output.
(juicer pre -q 0 $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out.bin $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out_scaffolds_final.agp $REF.fai | \
LC_ALL=C sort -k2,2d -k6,6d -T $OUTDIR/${SAMPLE} --parallel=16 -S120G | \
awk 'NF' > $OUTDIR/${SAMPLE}/alignments_sorted.txt.part) && \
(mv $OUTDIR/${SAMPLE}/alignments_sorted.txt.part $OUTDIR/${SAMPLE}/alignments_sorted.txt)
echo "transformed yahs to juicer for ${SAMPLE}"
## Then index the final scaffold file and prepare a chromosome (scaffold) size file, where we indicate the size of scaffolds
samtools faidx $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out_scaffolds_final.fa
cut -f1-2 $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out_scaffolds_final.fa.fai > $OUTDIR/${SAMPLE}/${SAMPLE}_scaffolds.chrom.sizes
echo "scaffold size file done for ${SAMPLE}"
## Now, the hiC contact matrix is created, considering the size of the scaffolds
## indicated in a chrom.sizes file
(java -jar -Xmx120G $SOFTWAREDIR/juicer/CPU/common/juicer_tools.jar pre $OUTDIR/${SAMPLE}/alignments_sorted.txt \
$OUTDIR/${SAMPLE}/out.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_scaffolds.chrom.sizes) && \
(mv $OUTDIR/${SAMPLE}/out.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_out.hic)
echo "hiC contact map created for ${SAMPLE}"
## Repeat the creation of hiC contact matrix but generating an adapted output format
## compatible with "Juicebox", specifying the mode (-a)
juicer pre -q 0 -a -o $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out.bin \
$OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out_scaffolds_final.agp ${REF}.fai 2> $OUTDIR/${SAMPLE}/tmp_juicer_pre_JBAT.log
cat $OUTDIR/${SAMPLE}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- > $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.chrom.sizes
(java -jar -Xmx120G $SOFTWAREDIR/juicer/CPU/common/juicer_tools.jar pre $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.txt $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.chrom.sizes) && (mv $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.hic)
echo "hiC contact map compatible with Juicebox created for ${SAMPLE}"
conda deactivate
}
### Parallelize over individuals
export INDIR OUTDIR SOFTWAREDIR SAMPLE_DIR
export -f juicer_bf_mc
parallel 'juicer_bf_mc {1}' :::: $SAMPLE_DIR
### SOFTWARE VERSIONS
# samtools 1.20
# yahs 1.2a.2
# juicer tools 1.22.01
````
### Output and interpretation
We are going to check the success of this step by looking at the Hi-C contact map using 'Juicebox'.
We have already done the scafolding using yaHS. This software while being very useful can make some mistakes, like assembling together scaffolds that do not belong together. This is why a manual curation step is necessary, where we look at the Hi-C contacts and modify the assembly when it is necessary. This is not an exact science, different experts may make different decissions for the same assembly.
We will look at a heatmap of the Hi-C contact matrix generated by 'Juicer' using the software 'Juicebox'. Over the Hi-C contact map we will overlap the ``assembly`` file, a vector that groups the different scaffolds in 'chromosomes' (they are not necessarily chromosomes in the biological sense, it is just a higher level of the assembly). This software allows you to modify the assembly by grouping scaffolds into 'chromosomes', separating chromosomes, cutting whithin the scaffolds and inverting them.
To learn more about this program and how to use it go to: https://github.com/aidenlab/Juicebox/wiki.
We will show our manual curation of *C.rostrata* genome. Manual curation is not an exact science, different experts may differ on the steps of manual curation.
Firstly, we will download the hiC map (``C_rostrata_JBAT.hic``) and assembly files (``C_rostrata_JBAT.assembly``) in our computer.
````
scp igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_rostrata/C_rostrata_JBAT.hic your/home/folder
scp igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_rostrata/C_rostrata_JBAT.assembly your/home/folder
scp igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_helodes/C_helodes_JBAT.hic your/home/folder
scp igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_helodes/C_helodes_JBAT.assembly your/home/folder
````
Now, open **Juicebox** and import the ``C_rostrata_JBAT.hic`` file. To do that, go to the **File** menu and click on **Open**. Go to the location where you saved your file and select it. This is a heatmap that represents the contacts between each position in the genome with the rest. The redder the point between the two positions the more contacts there are between them. The color scale can be changed in the **Color range** bar. We can zoom in and out in the **Show** window. The squares represent a group of positions that have a lot pf contacts between them, probably because they are located in the same chromosomes.
<img src="https://hackmd.io/_uploads/BkrgB7J6R.png" alt="rostrata_1" width="1000px">
Then, import de ``C_rostrata_JBAT.assembly`` file by clicking on **Import Assembly** on the **Assembly** menu. Then go to **Chromosome**, change selection from *All* to *Assembly* and click on the circling arrows. This layer represents the positions that are supposed to belong to the same chromosome. The blue squares represent superscaffolds, which are more less equivalent to chromosomes, and the green one the scaffolds. In our assembly, the majority of chromosomes are composed by one scaffold. This is a indication that our assembly is quite contiguous already. Only the superscaffold 16 is composed by more than one scaffold.
<img src="https://hackmd.io/_uploads/S1cFUQyTR.png" alt="rostrata_2" width="1000px">
This software only modifies permanently the ``.assembly`` file, not the ``.hic`` map. If we have made some changes, then we close and open Juicebox the next day, these changes are only applied to the squares in the ``.assembly`` layer. Therefore, the Hi-C map and the squares in the assembly will not overlap and continuing with manual curation is impossible. Before starting to make changes in the assembly consider that you cannot close the program during the manual curation process, unless you want to repeat all the changes. Manual curation normally takes less than 1 hour, but sometimes in complicated genomes it can take several days.
It is recommended to look through all the scaffolds before to check if there are clear misassemblies. We did not see any one clear enough. You will see some white areas, they normally represent repetitive regions. Some of this white areas are in the extremes of the chromosomes, they represent telomeric regions . They are really different to assemble, so the fact that in most chromosomes we dont see them, does not mean that they do not exist. Other repetitive regions are not located in the chromosome ends, some of them represent the centromeres. *Carex* species have holocentric chromosomes so there can be multiple parts of the chromsome with centromeric sequences. In some holocentric chromsomes we do not see clear centromeric regions Other white areas are other types pf repetitive regions. Below you see the hi-C map of chromome 10: we can see telomeric regions in the lower border, and multiple repetitive regions (possibly centromeres) throughout the chromosome.
<img src="https://hackmd.io/_uploads/Hy60y4kaA.png" alt="rostrata_3" width="500px">
The scaffolds are ordered by size, in the bottom-left corner we find the smaller ones. They normally correspond to mithocondrial or plastidial DNA, other contaminations and sequences who were not assembled, many of them are repetitive regions as they are the most difficult to assemble. we will focus on this area.
<img src="https://hackmd.io/_uploads/rJgYWEkaC.png" alt="rostrata_4" width="500px">
The sequences that do not have contacts with the rest of the genome will be moved to the end. To select a scaffold (``Shift``+ Click) and them with the righ bottom we open menu we click on ``Move to debris``. We will move to debris scaffolds 40, 41, 43 and 49. The chromosome we move to debris are normally either empty or highly repetitive.
Then, we have chromsome 42 which is neither repetitive nor empty. We can see that it has contacts with scaffold 28, so we will procede to move it to this superscaffold. To move it we select and then we choose where we want to move it. Normally, we will try to move it to the area were it fits best in the superscaffold. However, we will have to cut the scaffold (to cut you have to zoom in on the selected chromsome). Cutting is recomended on white areas, as there are not white areas near the high contact area between the two scaffolds the most conservative thing to do is leave the chromosome 42 next to the 28, and not cut it, as it is shown in the image below. In the reescafolding in the next step this two scaffolds will be reanalyzed and integrated.
<img src="https://hackmd.io/_uploads/H1W7izg60.png" alt="rostrata_6" width="500px">
After eliminating the empty ones and moving scaffold 42, we see how we form a superscaffold, uniting 9 scaffolds (38, 39, 43, 45, 46, 47, 48 and 50).
<img src="https://hackmd.io/_uploads/B1mK0fgTR.png" alt="rostrata_7" width="500px">
There are probably other changes we could make, but these are the main ones. This assembly was really good before manual curation, so manual curation was minimal. However, in other assemblies this step is much longer and major changes are necessary. In more complicated genomes, there can be more disagreements between experts about how far to go and whoch changes to make in manual curation. Manual curation can take almost as much time as you want, especially on little scaffolds. This small changes have little impact on the assembly, but thet still consume time. Therefore, it is important to know when to stop. A good practice in manual curation is to take notes of every change you make (or at least how many changes you do) in order for it to be replicable.
Once we are satisfied with our hi-C map, we will save the changes. We will export a new ``.assembly`` with the changes we made. To do that click on ``Èxport assembly`` on the ``Assembly`` menu and save it in a known location in your computer. The file will be named ``C_rostrata.JBAT.review.assembly``. If you want to check that al the changes open a new session in Juicebox with the old ``.hic`` map and the ``C_rostrata.JBAT.review.assembly`` file. They shold not overlap, reflecting the manual curation changes.
<div style="display: flex; justify-content: space-between; width: 100%; margin: 0 auto;">
<div style="width: 49%; text-align: center;">
<img src="https://hackmd.io/_uploads/Bk7IVSea0.png"
alt="chromosome 28"
width="100%" height="80%" style="display: block;">
<p style="font-size: 0.8em; font-weight: bold;">chromosome 28</p>
</div>
<div style="width: 49%; text-align: center;">
<img src="https://hackmd.io/_uploads/HyqbEBlpR.png"
alt="small scaffolds"
width="100%" height="80%" style="display: block;">
<p style="font-size: 0.8em; font-weight: bold;">small scaffolds</p>
</div>
</div>
## B3. Modifying our assembly according to manual curation
The modification on Juicebox does not modify automatically our files containing the assembly sequences (.fa). So, once we have finished manual curation we have to apply those changes to the FASTA file (``C_rostrata_yahs_out_scaffolds_final.fa``) using ``juicer`` again. Then, we will create another hiC map to check if the changes were correctly applied. Although it is possible to do more manual curation in this second hiC map repaeating the post-curation juicer step, it is not recommended as it is difficult to track the changes, therefore making the assembly unreplicable.
### Software requirements
* **Conda environments**
* ref_map (I am including only the packages we need from this environment)
* yahs 1.2a.2 (https://anaconda.org/bioconda/yahs)
* **juicer_tools.jar**
* **Juicebox**
These script will be run in a folder called ``B3_juicer_after_mc``,as it is after the manual curation step with juicebox.
````
mkdir -p ~/genome_assembly/B3_juicer_after_mc
cd ~/genome_assembly/B3_juicer_after_mc
````
Before running the script we have to upload the modified assembly to the cluster.
````
scp /your/home/folder/C_rostrata_JBAT.review.assembly igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_rostrata
scp /your/home/folder/C_helodes_JBAT.review.assembly igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_helodes
````
### Run script
First, create the script file:
``nano B3_juicer_after_mc.sbatch``
Then, we will add the commands whithin it. This script should take 2-3 hours to run per sample. The ``juicer_tools.jar`` commands requires a lot of CPU memmory, at least 100-120 GB
````
#!/bin/bash
#SBATCH --job-name=B3_af_mc
#SBATCH --export=ALL
#SBATCH --output=B3_af_mc_output
#SBATCH --error=B3_af_mc_error
#SBATCH --mem=120G
#SBATCH --cpus-per-task=16
#SBATCH --time=23:30:00
### Load directories
INDIR=/home/igomez/genome_assembly/B2_juicer_before_mc
OUTDIR=/home/igomez/genome_assembly/B3_juicer_after_mc
SOFTWAREDIR=/home/igomez/software
SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt"
### Create function to parallelize over samples
hic_af_mc() {
SAMPLE=$1
## Create output directories
mkdir -p $OUTDIR/${SAMPLE}
cd $OUTDIR/${SAMPLE}
##Load conda source
source ~/anaconda3/etc/profile.d/conda.sh
##Estalish path to reference genome
REF=/home/igomez/genome_assembly/A2_purge_dups/${SAMPLE}/purged.fa
## Modify the file with the genome assembly according to the (agp)
## manually curated .assembly file.
conda activate ref_map
juicer post -o $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT $INDIR/${SAMPLE}/${SAMPLE}_JBAT.review.assembly $INDIR/${SAMPLE}/${SAMPLE}_JBAT.liftover.agp ${REF}
echo "juicer post mc step done for ${SAMPLE}"
## Then, recreate a final hic contact map comptaible with Juicebox using the
reviewed version of
## the assembly file.
juicer pre -q 0 -a -o $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final $INDIR/${SAMPLE}/${SAMPLE}_yahs_out.bin $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.FINAL.agp ${REF}.fai 2> $OUTDIR/${SAMPLE}/tmp_juicer_pre_JBAT.log
cat $OUTDIR/${SAMPLE}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- > $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.chrom.sizes
(java -jar -Xmx120G $SOFTWAREDIR/juicer/CPU/common/juicer_tools.jar pre $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.txt $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.chrom.sizes) && (mv $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.hic)
conda deactivate
}
### Parallelize over individuals
export INDIR OUTDIR SOFTWAREDIR SAMPLE_DIR
export -f hic_af_mc
parallel 'hic_af_mc {1}' :::: $SAMPLE_DIR
### SOFTWARE VERSIONS
# yahs 1.2a.2
# juicer tools 1.22.01
````
### Output and interpretation
We will check the hi-C contact map and assembly to check if the changes made by manual curation were implemented in the assembly.
First, download the hi-C map (``C_rostrata_JBAT_final.hic``) and the assembly file (``C_rostrata_JBAT_final.assembly``) in your computer.
````
scp igomez@login.spc.cica.es:~/genome_assembly/B3_juicer_after_mc/C_rostrata/C_rostrata_JBAT_final.hic /your/home/folder
scp igomez@login.spc.cica.es:~/genome_assembly/B3_juicer_after_mc/C_rostrata/C_rostrata_JBAT_final.assembly /your/home/folder
scp igomez@login.spc.cica.es:~/genome_assembly/B3_juicer_after_mc/C_helodes/C_helodes_JBAT_final.hic /your/home/folder
scp igomez@login.spc.cica.es:~/genome_assembly/B3_juicer_after_mc/C_helodes/C_helodes_JBAT_final.assembly /your/home/folder
````
Then, open them in Juicebox. We will check if the changes made were implemented. In our case all the changes were made in the smallest scaffols (scaffold_), so we will look at them.
As we see, the changes are implemented so we can continue forward.
## B4. Decontamination
Most organisms do not live in sterile environmentsa, so extracted DNA may be contaminated with foreign DNA from associated microbiota and endosymbionts, like mithocondrial and plastidial DNA. In addition, laboratory reagents and procedures can also introduce foreign DNA. Eliminating those is necessary for a successful assembly.
In model organism, contaminants are well-known. Therefore, contaminant sequences are identified by comparaing them to a contaminant database. However, in *de novo* assembly in non-model organisms contaminants are not known. ``Tiara`` software proposes a solution: using *machine learning* to create a decision tree to classify scaffolds in general categories: **eukarya**, **organelle**, **bacterya**, **archaea** and **unknown** depending on general DNA characteristics like length, GC content, sequecing coverage and percentage of scaffold covered in DNA alignment. Then, a further analysis is done for the scaffols classified as **organelle** to classify them into: **plastid**, **mithocondria** and **unknown**. Then, ``BlobToolKit`` separates the assembly ``.fa``, into different ``.fa`` files according to the ``tiara`` classification.
### Software requirements
* Conda environments
* minimap_samtools
* minimap2 2.28 (https://anaconda.org/bioconda/minimap2)
* samtools 1.20 (https://anaconda.org/bioconda/samtools)
* tiara: Install it in a conda environment with python 3.8
* tiara 1.0.3 (https://anaconda.org/conda-forge/tiara
``mamba create -y -n tiara -c conda-forge python=3.8 tiara``
* btk
* blobtk 0.5.3. (https://anaconda.org/tolkit/blobtk)
* BlobToolKit 4.3.11 (https://blobtoolkit.genomehubs.org/install/). Install it using python as it is indicated in the link.
````
conda create -y -n btk -c conda-forge python=3.9
conda activate btk
pip install "blobtoolkit[full]"
conda deactivate
````
This script will be run in a folder called ``B4_decontaminate``.This script will take several hours to run.
````
mkdir -p ~/genome_assembly/B4_decontaminate
cd ~/genome_assembly/B4_decontaminate
````
### Run the script
First, create the script file:
``nano B4_decontaminate.sbatch``
Then, we will add the commands whithin it. This script should take several hours to run per sample.
````
#!/bin/bash
#SBATCH --job-name=B4_decontaminate
#SBATCH --export=ALL
#SBATCH --output=B4_decontaminate_output
#SBATCH --error=B4_decontaminate_error
#SBATCH --mem=60G
#SBATCH --cpus-per-task=16
#SBATCH --time=23:30:00
##Load directories
INDIR=/home/igomez/genome_assembly/B3_juicer_after_mc
OUTDIR=/home/igomez/genome_assembly/B4_decontaminate
SOFTWAREDIR=/home/igomez/software
SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt"
## Run the Blobtools pipeline in parallel for each sample
blobs_func() {
source ~/anaconda3/etc/profile.d/conda.sh
SAMPLE=$1
REF_GEN=$INDIR/${SAMPLE}/${SAMPLE}_JBAT.FINAL.fa
READS=/home/igomez/genome_assembly/raw_data/PacBio/${SAMPLE}/hifi.fastq
##Create output folder for each sample
mkdir -p $OUTDIR/${SAMPLE}
cd $OUTDIR/${SAMPLE}
### Create yaml file for the sample
touch ${SAMPLE}.yaml
echo -e "assembly:\n alias: ${SAMPLE}\n record_type: contig\ntaxon:\n name: ${SAMPLE}" > ${SAMPLE}.yaml
echo "yaml file created"
## Create coverage file (adding directly the bam does not work for some reason)
conda activate minimap_samtools
minimap2 -ax map-pb -t 16 $REF_GEN $READS | samtools sort -@16 -O BAM -o ${SAMPLE}_coverage.bam
samtools coverage ${SAMPLE}_coverage.bam > ${SAMPLE}_coverage.txt
conda deactivate
echo "coverage file created"
## Use tiara to quickly check if a contig is eukaryote, bacteria or organelle
conda activate tiara
tiara -i $REF_GEN -o ${SAMPLE}.tiara -t 16 --pr --tf all -m 1000
conda deactivate
echo "tiara run"
## Adding input files to blobtoolkit dataset
conda activate btk
mkdir -p $OUTDIR/${SAMPLE}/btk
blobtools create --fasta $REF_GEN --meta ${SAMPLE}.yaml --taxdump ../taxdump/ --replace ./btk
blobtools add --text ${SAMPLE}_coverage.txt --text-header --text-cols '#rname=identifier,meandepth=reads_cov' --replace ./btk
blobtools add --key plot.y=reads_cov --replace ./btk
blobtools add --text ${SAMPLE}.tiara --text-delimiter "\t" --text-header --text-cols "sequence_id=identifier,class_fst_stage=tiara" --replace ./btk
blobtools add --key plot.cat=tiara --replace ./btk
echo "bloobtools run"
## View the dataset on blobtool viewer:: call from python3
export RUST_BACKTRACE=full
# Create a summary table on the dataset
blobtools filter --table $OUTDIR/${SAMPLE}/${SAMPLE}_unfiltered_summary.tsv --table-fields gc,length,reads_cov,tiara ./btk
conda deactivate
### The decontaminated assembly is eukarya_${SAMPLE}_JBAT.FINAL.fa
### Keep mitochondrial genome and bacterial genomes separated
}
### Parallelize over individuals
export INDIR OUTDIR SOFTWAREDIR SAMPLE_DIR
export -f blobs_func
parallel 'blobs_func {1}' :::: $SAMPLE_DIR
### SOFTWARE VERSIONS
# samtools 1.20
# minimap2 2.28
# blobtoolkit v4.3.0
# blobtk 0.5.3
# tiara 1.0.3
````
### Output and interpretation
There are multiple tables and files created from this folder. We will focus on the ``C_rostrata_unfiltered_summary.tsv`` table, as it is the easier to interpret. The table has the following columns:
* **identifiers:** The number of the scaffold identified. These are the same names we see in the Juicebox ``.hic`` map, so you can compare the results of the decontamination analysis with the Hi-C map.
* **gc** Proportion of GC in the scaffold. Scaffolds classified as *eukarya* tend to have similar proportions of GC, around
* **length** Length of the scaffold measured in base pairs. Organelle and bacterya scaffolds tend to be short.
* **reads_cov** Coverage of each scaffold in the PacBio sequences.
* **tiara** Classification of the scaffold according to tiara: **eukarya**, **bacterya**, **organelle** and **unknown**. If there are many **unknown** or **bacterya** scaffolds, it means there were high levels of contamination during DNA extraction and/or sequenciation. In the worst of cases, one would have to repeat sequenciation and DNA extraction. If they are mostly in small scaffolds, the contamination is separated from the chromsomes which is a good sign. **Organelle DNA** is normally expected and rarely a problem if it is not assembled with the chromsomes. If you want more information about which type of organelle the scaffolds come from, check the ``C_rostrata.tiara`` file.
````
cat C_rostrata_unfiltered_summary.tsv
index identifiers gc length reads_cov tiara
0 scaffold_1 0.3407 20519163 68.8836 eukarya
1 scaffold_2 0.3536 18889618 73.6254 eukarya
2 scaffold_3 0.3424 16979481 71.2241 eukarya
3 scaffold_4 0.3483 15431245 68.0746 eukarya
4 scaffold_5 0.3378 15144400 68.996 eukarya
5 scaffold_6 0.3495 15086000 73.9218 eukarya
6 scaffold_7 0.3394 14748687 69.1416 eukarya
7 scaffold_8 0.3401 14602151 69.2029 eukarya
8 scaffold_9 0.3393 14208447 68.8214 eukarya
9 scaffold_10 0.3571 13909784 75.0702 eukarya
10 scaffold_11 0.3388 13873836 68.3305 eukarya
11 scaffold_12 0.3388 13702659 70.0808 eukarya
12 scaffold_13 0.3436 11161153 68.8056 eukarya
13 scaffold_14 0.34 10647859 68.486 eukarya
14 scaffold_15 0.3388 10645412 70.5592 eukarya
15 scaffold_16 0.3405 10612874 67.7808 eukarya
16 scaffold_17 0.3401 10446310 68.8682 eukarya
17 scaffold_18 0.3439 9570554 69.5611 eukarya
18 scaffold_19 0.3388 9534215 69.4222 eukarya
19 scaffold_20 0.3383 9347619 70.2492 eukarya
20 scaffold_21 0.3399 8844035 69.4575 eukarya
21 scaffold_22 0.3368 8824609 69.1734 eukarya
22 scaffold_23 0.3391 8716604 69.5628 eukarya
23 scaffold_24 0.3371 8676010 69.4134 eukarya
24 scaffold_25 0.3409 8179128 68.8671 eukarya
25 scaffold_26 0.3399 8062943 69.8144 eukarya
26 scaffold_27 0.34 7823859 69.3813 eukarya
27 scaffold_28 0.3351 7748284 69.2471 eukarya
28 scaffold_29 0.3351 6519814 69.3641 eukarya
29 scaffold_30 0.3385 6290409 69.3558 eukarya
30 scaffold_31 0.334 6094630 73.821 eukarya
31 scaffold_32 0.3375 5587025 70.7281 eukarya
32 scaffold_33 0.3364 5472348 66.2694 eukarya
33 scaffold_34 0.3366 5152814 66.9996 eukarya
34 scaffold_35 0.3331 4910038 68.9357 eukarya
35 scaffold_36 0.3298 74523 439.511 organelle
36 scaffold_37 0.6268 70145 3.12213 bacteria
37 scaffold_38 0.191 65092 2.65747 organelle
38 scaffold_39 0.1966 61652 1.65685 organelle
39 scaffold_40 0.1906 61222 2.33374 organelle
40 scaffold_41 0.3392 59135 85.36 eukarya
41 scaffold_42 0.4701 58810 20.9611 eukarya
42 scaffold_43 0.1916 58246 2.53853 organelle
43 scaffold_44 0.2055 58209 2.32397 unknown
44 scaffold_45 0.4216 58180 322.099 organelle
45 scaffold_46 0.6278 58150 1.60504 bacteria
46 scaffold_47 0.5025 57737 1.87627 eukarya
47 scaffold_48 0.2004 57082 1.41439 unknown
48 scaffold_49 0.5489 55609 395.976 unknown
49 scaffold_50 0.2204 52069 2.42939 unknown
50 scaffold_51 0.3873 51883 49.6402 eukarya
51 scaffold_52 0.2022 50878 1.69987 organelle
52 scaffold_53 0.2016 50437 2.07457 organelle
53 scaffold_54 0.3205 50095 3042.3 organelle
54 scaffold_55 0.2079 48092 2.59794 organelle
55 scaffold_56 0.2031 48059 2.0556 organelle
56 scaffold_57 0.3156 46241 1644.23 organelle
57 scaffold_58 0.1918 45266 2.93282 organelle
58 scaffold_59 0.4193 44433 226.836 organelle
59 scaffold_60 0.1917 44402 2.9094 organelle
60 scaffold_61 0.3278 43641 734.054 organelle
61 scaffold_62 0.206 43028 2.27094 organelle
62 scaffold_63 0.2035 42061 2.66377 organelle
63 scaffold_64 0.3873 41885 2.22347 bacteria
64 scaffold_65 0.3428 40959 4045.39 organelle
65 scaffold_66 0.2017 40118 1.76457 organelle
66 scaffold_67 0.3131 39074 959.684 organelle
67 scaffold_68 0.1807 38619 2.07333 organelle
68 scaffold_69 0.2024 37946 2.48611 organelle
69 scaffold_70 0.1967 37592 1.88889 organelle
70 scaffold_71 0.3353 37243 3222.14 organelle
71 scaffold_72 0.2003 36881 1.83496 organelle
72 scaffold_73 0.1981 36445 2.36373 organelle
73 scaffold_74 0.191 36405 2.11092 organelle
74 scaffold_75 0.4711 36136 1.51146 eukarya
75 scaffold_76 0.2122 35878 2.11698 organelle
76 scaffold_77 0.5212 35258 1.66893 eukarya
77 scaffold_78 0.195 35248 3.15768 organelle
78 scaffold_79 0.4137 35135 540.227 organelle
79 scaffold_80 0.1916 35060 1.61834 organelle
80 scaffold_81 0.5061 34973 2.04218 eukarya
81 scaffold_82 0.4149 34253 328.986 organelle
82 scaffold_83 0.2052 30879 2.12011 unknown
83 scaffold_84 0.3858 28719 5870.67 organelle
84 scaffold_85 0.3888 27991 4190.74 organelle
85 scaffold_86 0.3926 26371 780.334 organelle
86 scaffold_87 0.3471 25543 79.4332 eukarya
87 scaffold_88 0.1918 25157 2.37866 organelle
88 scaffold_89 0.558 25069 925.192 unknown
89 scaffold_90 0.2998 22585 126.543 eukarya
90 scaffold_91 0.3398 22307 3786.25 organelle
91 scaffold_92 0.308 20145 1758.45 organelle
92 scaffold_93 0.5509 251759 129.548 eukarya
93 scaffold_94 0.3836 214937 71.5338 eukarya
94 scaffold_95 0.3843 139078 26.7298 organelle
95 scaffold_96 0.339 101866 2631.44 organelle
96 scaffold_97 0.4181 138986 486.046 organelle
97 scaffold_98 0.4019 1175730 1254.63 organelle
````
Then Blobtools have created ``.fa`` files containing a certain type of DNA according to the ``tiara`` classification: ``mitochondrion_C_rostrata_JBAT.FINAL.fa`` (which contains mithocondrial DNA), ``plastid_C_rostrata_JBAT.FINAL.fa``(which contains plastidial DNA), ``bacteria_C_rostrata_JBAT.FINAL.fa``(which contains bacterial DNA) ,``unknown_C_rostrata_JBAT.FINAL.fa`` (which contains DNA from unknown source) and ``eukarya_C_rostrata_JBAT.FINAL.fa``, which contains eukaryotic DNA, the most important one for us and the one we will use in further analysis.
## B5. Quality Control with telomere analysis of the scaffolded and decontaminated assembly
We will repeat the completedness and contiuguity analysis we performed for step A3. In this step we will evaluate the success in scaffolding, comparing the genome contiguity with A3_QC results. We expect to see a higher contigüity with fewer and longer DNA molecules.Finally, we will evaluate whether the assembly achieves the highest possible level of contiguity—telomere-to-telomere—using a tool that identifies telomeric repeats.
### Software requirements
* Conda environments:
* compleasm:
* compleasm 0.2.6 (https://anaconda.org/bioconda/compleasm)
* gfastats:
* gfastats 1.3.6 (https://anaconda.org/bioconda/gfastats)
* merqury
* merqury 1.3 (https://anaconda.org/bioconda/merqury)
* tidk
* tidk 0.2.41 (https://anaconda.org/bioconda/tidk)
* Perl scripts (I normally save them in a software folder):
* contig_length_fasta.pl (https://github.com/ishengtsai/perlscripts_jit/blob/master/contig_length_fasta.pl)
* R packages
* ggplot2 (https://r-charts.com/es/ggplot2/)
This script will be run in the folder ``B5_QC``. This scripts is quite fast to run, less than an hour.
````
mkdir -p ~/genome_assembly/B5_QC
cd ~/genome_assembly/B5_QC
````
### Run Script
First, create the script file:
nano B4_decontaminate.sbatch
Then, we will add the commands whithin it. This script should take some minutes to run per sample.
````
#!/bin/bash
#SBATCH --job-name=B5_QC
#SBATCH --export=ALL
#SBATCH --output=B5_QC_output
#SBATCH --error=B5_QC_error
#SBATCH --mem=60G
#SBATCH --cpus-per-task=16
#SBATCH --time=6:00:00
#set source for conda
source ~/anaconda3/etc/profile.d/conda.sh
### Set directories
OUTDIR="/home/igomez/genome_assembly/B5_QC"
INDIR="/home/igomez/genome_assembly/B4_decontaminate"
SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt"
for SAMPLE in $(cat $SAMPLE_DIR | cut -f1)
do
REF_EUR=$INDIR/${SAMPLE}/eukarya_${SAMPLE}_JBAT.FINAL.fa
mkdir -p $OUTDIR/${SAMPLE}
cd $OUTDIR/${SAMPLE}
## Find telomeres
conda activate tidk
tidk find -c Poales -d $OUTDIR/${SAMPLE} -o ${SAMPLE}_telomeres --log $REF_EUR
conda deactivate
echo "telomeric analysis done for ${SAMPLE}"
## First test: genome completedness with compleasm
conda activate compleasm
compleasm run -a $REF_EUR -o $OUTDIR/${SAMPLE}/${SAMPLE}_compleasm -l poales_odb10 -t 16
conda deactivate
echo "compleasm done for ${SAMPLE}"
## Second test: Genome contiguity with Gfastats: Contigs and scaffolds length
conda activate gfastats
gfastats $REF_EUR -j 16 > $OUTDIR/${SAMPLE}/${SAMPLE}_gfastats.txt
conda deactivate
echo "gfastats done for ${SAMPLE}"
## Third test: General scaffold lenth distribution with perl script.
perl /home/igomez/software/contig_length_fasta.pl $REF_EUR > $OUTDIR/${SAMPLE}/${SAMPLE}_scaffold_length.txt
echo "contig list complete for ${SAMPLE}"
## Forth test: Repeatedness with merqury
#create directory to run merqury
mkdir -p $OUTDIR/${SAMPLE}/merqury
cd $OUTDIR/${SAMPLE}/merqury
#The meryl reads is calculated from the PacBio reads so it is the same for all the merqury analysis
#in the assembly
cp -r ~/genome_assembly/A3_QC/${SAMPLE}/${SAMPLE}.meryl .
# Set path to to the alternate assembly
REF_ALT=~/genome_assembly/A2_purge_dups/${SAMPLE}/alternate/purged_alt.fa
#Run merqury
conda activate merqury
merqury.sh ../${SAMPLE}.meryl $REF_EUR $REF_ALT B3_merqury_${SAMPLE}
conda deactivate
## Fifth test: Find telomeres
conda activate tidk
tidk find -c Poales -d $OUTDIR/${SAMPLE} -o ${SAMPLE}_telomeres --log $REF_EUR
conda deactivate
echo "telomeric analysis done for ${SAMPLE}"
done
### SOFTWARE VERSIONS
# compleasm 0.2.6
# contig_length_fasta.pl https://github.com/ishengtsai/perlscripts_jit/blob/master/contig_length_fasta.pl
# gfastat 1.3.6
# merqury 1.3
# tidk 0.2.41
````
The plot produced by tidk software, is not very clear, so using the ``.tsv`` file created by the ``tidk plot`` we can create a better plot using the ``ggplot2``. Therefore, download the ``.tsv`` in your computer ( ``scp usuario@login.spc.cica.es:~/genome_assembly/B5_QC/C_rostrata/C_rostrata_telomeres.tsv /your/home/folder/gmetp_linux_64.tar.gz usuario@login.spc.cica.es:~/``
) and run the following script on Rstudio:
Run this script in RStudio in your computer.
````
###Create a function to plot telomeres
plot_all_telomere_counts <- function(df, chr){
df_filt <- filter(df, id == chr)
total <- ggplot(data = df_filt, aes(x=window, y=total_count)) +
geom_bar(stat='identity', col='blue') + theme_classic() + ggtitle(label=chr) +
labs(x='Position (Mb)', y='Count of TTAGG per window') +
theme_classic() +
theme(panel.background = element_rect(fill = 'transparent'), #transparent panel bg
plot.background = element_rect(fill = 'transparent', color = NA), #transparent plot bg
panel.grid.major = element_blank(), #remove major gridlines
panel.grid.minor = element_blank(), #remove minor gridlines
legend.background = element_rect(fill = 'transparent'), #transparent legend bg
legend.box.background = element_rect(fill = 'transparent')) #transparent legend panel
return(total)
}
plot_all_chr <- function(df){
total <- ggplot(data = df, aes(x=window, y=total_count)) +
geom_bar(stat='identity', col='blue') + theme_classic() +
labs(x='Position (Mb)', y='Count of TTAGG per window') + facet_wrap(~id) +
ylim(0,200) + # to prevent outliers driving the scale
theme_classic() +
theme(panel.background = element_rect(fill = 'transparent'), #transparent panel bg
plot.background = element_rect(fill = 'transparent', color = NA), #transparent plot bg
panel.grid.major = element_blank(), #remove major gridlines
panel.grid.minor = element_blank(), #remove minor gridlines
legend.background = element_rect(fill = 'transparent'), #transparent legend bg
legend.box.background = element_rect(fill = 'transparent')) #transparent legend panel
return(total)
}
#-------------------------------------------------------------------------------------------
###Then we run the function
library(ggplot2)
setwd("your/path/way/to/tsv_file")
df <- read.csv('${SAMPLE}_.tsv', sep='\t')
df$total_count <- df$forward_repeat_number + df$reverse_repeat_number
df$window <- df$window/1000000 # convert to Mb
X3311_telo_plot <- plot_all_chr(df)
X3311_telo_plot
````
## B6. Eliminate small scaffolds
### Software requirements
* Conda environments
* seqkit
* seqkit 2.8.2 (https://anaconda.org/bioconda/seqkit)
````
mkdir -p B6_clean_assemblies
cd B6_clean_assemblies
````
### Run script
````
#!/bin/bash
#SBATCH --job-name=B6_clean_assemblies
#SBATCH --export=ALL
#SBATCH --output=B6_clean_assemblies_output
#SBATCH --error=B6_clean_assemblies_error
#SBATCH --mem=60G
#SBATCH --cpus-per-task=16
#SBATCH --time=23:30:00
### Set directories
INDIR=~/genome_assembly/B4_decontaminate
OUTDIR=~/genome_assembly/B6_clean_assemblies
SAMPLE_DIR=~/genome_assembly/raw_data/samples.txt
for SAMPLE in $(cat $SAMPLE_DIR | cut -f1)
do
mkdir -p $OUTDIR/${SAMPLE}
cd $OUTDIR/${SAMPLE}
source ~/anaconda3/etc/profile.d/conda.sh
conda activate seqkit
MINLEN=4000000 #maybe modify it according to contig length list, hiC and telomeres.
seqkit seq --threads 16 --min-len $MINLEN $INDIR/${SAMPLE}/eukarya_${SAMPLE}_JBAT.FINAL.fa > $OUTDIR/${SAMPLE}/${SAMPLE}.fasta
conda deactivate
echo "clean assemblies generated without small, umapped scaffolds for ${SAMPLE}"
perl ~/software/contig_length_fasta.pl ${SAMPLE}.fasta > clean_assembly_${SAMPLE}_scaffold_length.txt
conda activate gfastats
gfastats ${SAMPLE}.fasta -j 16 > clean_assembly_${SAMPLE}_gfastats.txt
conda deactivate
echo "trimming test done for ${SAMPLE}"
done
````
### Output
````
````
## C. Annotation and Synteny graphs
## C1. Soft masking
### Software requirements
* Conda environments
* Repeatmasker
* repeatmasker 4.1.5(https://anaconda.org/bioconda/repeatmasker)
### Run script
````
#!/usr/bin/bash
#SBATCH --job-name=C1_repeatmask
#SBATCH --output=C1_repeatmask_output
#SBATCH --error=C1_repeatmask_error
#SBATCH --nodes=1 # Using one node
#SBATCH --cpus-per-task=24
#SBATCH --mem=120G
#SBATCH --partition=standard
#SBATCH --time=24:00:00
### Set directories
INDIR=/home/igomez/genome_assembly/B6_clean_assemblies
OUTDIR=/home/igomez/genome_assembly/C1_repeatmask
SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt"
### Run repeatMasker in parallel
rmasker() {
SAMPLE=$1
REF_trim=$INDIR/${SAMPLE}/${SAMPLE}.fasta
#set source for conda and activate environment
source ~/anaconda3/etc/profile.d/conda.sh
conda activate repeatmasker
#establish working directories
mkdir -p $OUTDIR/${SAMPLE}
cd $OUTDIR/${SAMPLE}
RepeatMasker -pa 4 -e rmblast -species ${SAMPLE} -dir $OUTDIR/${SAMPLE} -xsmall -html -gff $REF_trim
# -pa runs in parallel (pa 4 uses 16 cores in total in combination with rmblast)
# -e specifies the search engine (rmblast = ncbi is the one used by EarlGrey)
# -nolow does not mask low complexity regions
# -s slow but more sensitive, -q fast but less sensitive
# IMPORTANT if using RepeatMasker before gene prediction, I should not mask low complexity regions
# because sometimes codons from genes would be masked
# -xsmall for softmasking (repeats in lowercase) rather than changing them to Ns
# -html and -gff create additional output files
conda deactivate
}
### Parallelize over individuals
export INDIR OUTDIR SAMPLE_DIR
export -f rmasker
parallel 'rmasker {1}' :::: $SAMPLE_DIR
### SOFTWARE VERSIONS
# repeatmasker 4.1.5.
````
## C3. Gene annotation with braker
Once we have softmasked our genomes, we can procede with gene annotation. Braker3 allows for fully automated training of the gene prediction tools GeneMark-ES/ET/EP/ETP R14, R15, R17, F1 and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction.
In contrast to other available methods that rely on protein homology information, BRAKER2 reaches high gene prediction accuracy even in the absence of the annotation of very closely related species and in the absence of RNA-Seq data.
### Software requirements
* **Conda environments**
* **braker3 3.0.8** (https://anaconda.org/bioconda/braker3)
* **Programs**
* **Genemark-ETP 1.0** (https://github.com/gatech-genemark/GeneMark-ETP). We need the Genemark-ETP 1.0 software (http://topaz.gatech.edu/Genemark/license_download.cgi). You will need to register and then download the software and the key in your computer to then upload it to your cluster. Store the .key in your home folder in the cluster. Then, decompress them and procede with the installation. Make sure Perl and Python3 is installed. Make sure Perl and Python3 is previously installed.
````
scp /your/home/folder/gmetp_linux_64.tar.gz igomez@login.spc.cica.es:/home/igomez/software/gmetp_linux_64.tar.gz
scp /your/home/folder/gm_key_64.gz igomez@login.spc.cica.es:/home/igomez/gm_key_64.gz
ssh igomez@login.spc.cica.es
gunzip ~/gm_key_64.gz
cd software
tar -xvzf gmetp_linux_64.tar.gz
````
* **ProtHint** (https://github.com/gatech-genemark/ProtHint)
````
cd ~/software
git clone https://github.com/gatech-genemark/ProtHint.git
````
Running the braker.pl command requires many dependencies and a very long computational time, much longer than anything we have run in this tutorial (3-4 days at least per genome). As braker installation is tricky due to the many different dependencies accross multiple manually installed softwares, it is strongly encouraged to run a test with a small dataset. We will use the example proposed in the braker documentation (https://github.com/Gaius-Augustus/BRAKER#example-data), where we will annotate 1,000,000 nucleotides of *Arabidopsis tathiana* using protein data from OrthoDB, particurlarly the database for Brassicales ``brassicales_odb10``. The data is available in https://github.com/Gaius-Augustus/BRAKER/tree/master/example.
We will perform the test in a folder called ``test``.
````
mkdir -p ~/genome_assembly/C3_braker/test
cd ~/genome_assembly/C3_braker/test
wget https://github.com/Gaius-Augustus/BRAKER/blob/master/example/genome.fa
wget https://github.com/Gaius-Augustus/BRAKER/blob/master/example/proteins.fa
````
Then, we will launch a test job. This test should take less than an hour to finish.
``nano ~/genome_assembly/C3_braker/test/C3_braker_test.sbatch``
This is the test script:
````
#!/usr/bin/bash
#SBATCH --job-name=C3_braker_test
#SBATCH --output=C3_braker_test_output
#SBATCH --error=C3_braker_test_error
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=240G
#SBATCH --partition=standard
#SBATCH --time=6:00:00
###Set dependencies directories and export them
GENEMARK_PATH=/home/igomez/software/gmetp_linux_64/bin
PROTHINT_PATH=/home/igomez/software/gmetp_linux_64/bin/gmes/ProtHint/bin
###Set conda source and activate environment
source ~/anaconda3/etc/profile.d/conda.sh
conda activate braker
export GENEMARK_PATH PROTHINT_PATH
###Set output directories
OUTDIR="/home/igomez/genome_assembly/C3_braker/test"
cd $OUTDIR
#Run braker
braker.pl --genome=genome.fa --prot_seq=proteins.fa --threads=32 --species=arabidopsis_thatiana --busco_lineage=brassicales_odb10
conda deactivate
### SOFTWARE VERSIONS
# BRAKER version 3.0.8
# genemark_ETP
# ProtHint v2.6
````
When it finishes check the error file ``C3_braker_test_error``. Then go to the ``braker`` folder and check the ``braker.log`` file, the error folder and all the files whithin it. Your output folder should look like this:
````
cd ~/genome_assembly/C3_braker/test
du -h *
880K Augustus
20M bbc/genemark/brassicales_odb10_hmmsearch_output
21M bbc/genemark
20M bbc/augustus/brassicales_odb10_hmmsearch_output
21M bbc/augustus
21M bbc/braker/brassicales_odb10_hmmsearch_output
21M bbc/braker
61M bbc
4.0K best_by_compleasm.log
128K braker.aa
376K braker.codingseq
396K braker.gtf
76K braker.log
24K errors
2.6M GeneMark-EP
2.3M GeneMark-ES
4.0K genome_header.map
428K hintsfile.gff
140K prothint.gff
384K species/arabidopsis_thatiana
388K species
4.0K what-to-cite.txt
````
Pay attention to any error messages sometimes. Sometimes it does not find some scripts or executables. I'll show an example:
````
cat ~/genome_assembly/C3_braker/test/C3_braker_test_error
ERROR: augustus binary not found or not executable.
````
Make sure it exists in your cluster using ``find``:
````
find home/igomez -name augustus
/home/igomez/augustus/bin
````
Then, just add the path to the slurm script:``export AUGUSTUS_BIN_PATH=~/augustus/bin``
Sometimes installations are incomplete and you will have to go to GitHub repositories and copy the scripts directly in ``bin``or ``scripts`` folder inside the software directories or whithin the conda environment folder.
### Run script
````
#!/usr/bin/bash
#SBATCH --job-name=C2_braker
#SBATCH --output=C2_braker_output
#SBATCH --error=C2b_braker.error
#SBATCH --nodes=1
#SBATCH --cpus-per-task=64
#SBATCH --mem=180G
#SBATCH --partition=standard
#SBATCH --time=120:00:00
###Set directories
OUTDIR="/home/igomez/genome_assembly_helodes_thesis/C2_braker"
export GENEMARK_PATH=/home/igomez/software/gmetp_linux_64/bin
export PROTHINT_PATH=/home/igomez/software/gmetp_linux_64/bin/gmes/ProtHint/bin
###Create braker function
REF=~/genome_assembly_helodes_thesis/C1b_repeatmasker/1JMC18/1JMC18.fasta.masked
PROT=~/genome_assembly_helodes_thesis/C2a_braker/poales_odb10/refseq_db.faa
#set directories
mkdir -p $OUTDIR/1JMC18
cd $OUTDIR/1JMC18
#set source for conda and activate environment
source ~/anaconda3/etc/profile.d/conda.sh
conda activate braker
#Run braker
braker.pl --genome=$REF --prot_seq=$PROT --species=1JMC18_3 --threads=64 --busco_lineage=poales_odb10 --gff3
conda deactivate
### SOFTWARE VERSIONS
# BRAKER version 3.0.8
# genemark_etp (install whithin braker env)
# ProtHint v2.6 (install with git clone outside environment)
````