Genome assembly tutorial_2

# Genome assembly tutorial ## B2. Scaffolding with 'yaHS' and creation of HiC contact maps with 'juicer' Now that we have aligned the Hi-C reads to our genome assembly, we can proceed with scaffolding. We use a program called 'yaHS' (Zhou et al., 2022). When the scaffolding is complete we will create a **contact matrix**, that we will represent in a heatmap using the software 'Juicer' (Durand et al., 2016). In this contact matrix we quantify the number of contacts between the different regions in the genome assembly. This heatmap will be explored and modified in the next section. ### Software requirements * Conda environments * **ref_map** (I am including only the packages we need from this environment) * **samtools** 1.20: Likely, it is already installed (https://anaconda.org/bioconda/samtools) * **yahs** 1.2a.2 (https://anaconda.org/bioconda/yahs) * **java (.jar) scripts**: Make sure you have a working Java Runtime Environment (JRE) in your cluster. Run java -version: ````` java -version openjdk version "22.0.1" 2024-04-16 OpenJDK Runtime Environment (build 22.0.1+8-16) OpenJDK 64-Bit Server VM (build 22.0.1+8-16, mixed mode, sharing) ````` If nothing appears it can mean that it was not installed or that it is not included in your $PATH in ``.bashrc``(it may be a hidden file, un this case run ``ls -a ~``). If the latter is your case you have to edit ``.bashrc`` to include the path to java, writing the following commands in the file: ```` export JAVA_HOME=~/pa/jdk-22.0.1 export PATH=$JAVA_HOME/bin:$PATH ```` Then run this ``source ~/.bashrc``, to apply the changes. If you do not java installed, you have to install it, You can do it even without being admin. In this link there are free versions of java for linux, windows and macOS: https://jdk.java.net/22/. Follow the instruction on the page for the installation. Then, we can access the scripts: * **juicer_tools.jar** from aidenlab: juicer (https://github.com/aidenlab/juicer). It should be installed in software folder using ``git clone``. Then you will have to download the script of the function. In order for it to be easier to call we will soft-link it to another file. Make sure you activate ref_map conda environment before using the executables, as it contains its dependencies. ```` cd ~/software git clone https://github.com/theaidenlab/juicer.git wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicer_tools.jar ```` * **Juicebox**: It will be installed in your computer, not in the cluster. Download an addecuate version to your Operating System (https://github.com/aidenlab/Juicebox/wiki/Download). If you have Linux in your computer you will need to have Java installed (https://www.java.com/es/download/), and use of of the Java versions. These script will be run in a folder called ``B2_juicer_before_mc``,as it before the manual curation step with juicebox. ```` mkdir -p ~/genome_assembly/B2_juicer_before_mc cd ~/genome_assembly/B2_juicer_before_mc ```` ### Run the script First, create the script file: ``nano B2_juicer_bf_mc.sbatch`` Then, we will add the commands whithin it. This script should take 1-2 hours to run per sample. The ``juicer_tools.jar`` commands requires a lot of CPU memmory, at least 100-120 GB. ```` #!/bin/bash #SBATCH --job-name=B2_juicer_bf_mc #SBATCH --export=ALL #SBATCH --output=B2_juicer_bf_mc_output #SBATCH --error=B2_juicer_bf_mc_error #SBATCH --mem=120G #SBATCH --cpus-per-task=16 #SBATCH --time=23:30:00 ### Load directories INDIR=/home/igomez/genome_assembly/B1_hiC_mapping OUTDIR=/home/igomez/genome_assembly/B2_juicer_before_mc SOFTWAREDIR=/home/igomez/software SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" ### Create function to parallelize over samples juicer_bf_mc() { SAMPLE=$1 ## Establish working directories mkdir -p $OUTDIR/${SAMPLE} ##Load conda source and activate environments source ~/anaconda3/etc/profile.d/conda.sh conda activate ref_map ## Sort .BAM file (the one with hiC mappings) by read name BAM_UNSORTED=$INDIR/${SAMPLE}/repdir/${SAMPLE}_final.bam samtools sort $BAM_UNSORTED -o $INDIR/${SAMPLE}/repdir/${SAMPLE}_final_sorted.bam -O BAM -n -@ 16 # Create a variable for the sorted .BAM file BAM=$INDIR/${SAMPLE}/repdir/${SAMPLE}_final_sorted.bam echo "BAM sorting done for ${SAMPLE}" ## Procede with scaffolding with yahs. Always use -q 0 to set the minimum ## read mapping quality to 0 REF=/home/igomez/genome_assembly/A2_purge_dups/${SAMPLE}/purged.fa yahs $REF $BAM -o $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out -e GATC -e GANTC --no-mem-check -q 0 echo "scaffolding with yahs done for ${SAMPLE}" ### Generate HiC contact maps: ## First, convert the yahs output to a juicer input. Here as well, also use -q 0!. #Check if you have the indexing (.fai) of your genome assembly. This will be used to sort the juicer output. (juicer pre -q 0 $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out.bin $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out_scaffolds_final.agp $REF.fai | \ LC_ALL=C sort -k2,2d -k6,6d -T $OUTDIR/${SAMPLE} --parallel=16 -S120G | \ awk 'NF' > $OUTDIR/${SAMPLE}/alignments_sorted.txt.part) && \ (mv $OUTDIR/${SAMPLE}/alignments_sorted.txt.part $OUTDIR/${SAMPLE}/alignments_sorted.txt) echo "transformed yahs to juicer for ${SAMPLE}" ## Then index the final scaffold file and prepare a chromosome (scaffold) size file, where we indicate the size of scaffolds samtools faidx $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out_scaffolds_final.fa cut -f1-2 $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out_scaffolds_final.fa.fai > $OUTDIR/${SAMPLE}/${SAMPLE}_scaffolds.chrom.sizes echo "scaffold size file done for ${SAMPLE}" ## Now, the hiC contact matrix is created, considering the size of the scaffolds ## indicated in a chrom.sizes file (java -jar -Xmx120G $SOFTWAREDIR/juicer/CPU/common/juicer_tools.jar pre $OUTDIR/${SAMPLE}/alignments_sorted.txt \ $OUTDIR/${SAMPLE}/out.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_scaffolds.chrom.sizes) && \ (mv $OUTDIR/${SAMPLE}/out.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_out.hic) echo "hiC contact map created for ${SAMPLE}" ## Repeat the creation of hiC contact matrix but generating an adapted output format ## compatible with "Juicebox", specifying the mode (-a) juicer pre -q 0 -a -o $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out.bin \ $OUTDIR/${SAMPLE}/${SAMPLE}_yahs_out_scaffolds_final.agp ${REF}.fai 2> $OUTDIR/${SAMPLE}/tmp_juicer_pre_JBAT.log cat $OUTDIR/${SAMPLE}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- > $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.chrom.sizes (java -jar -Xmx120G $SOFTWAREDIR/juicer/CPU/common/juicer_tools.jar pre $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.txt $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.chrom.sizes) && (mv $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.hic) echo "hiC contact map compatible with Juicebox created for ${SAMPLE}" conda deactivate } ### Parallelize over individuals export INDIR OUTDIR SOFTWAREDIR SAMPLE_DIR export -f juicer_bf_mc parallel 'juicer_bf_mc {1}' :::: $SAMPLE_DIR ### SOFTWARE VERSIONS # samtools 1.20 # yahs 1.2a.2 # juicer tools 1.22.01 ```` ### Output and interpretation We are going to check the success of this step by looking at the Hi-C contact map using 'Juicebox'. We have already done the scafolding using yaHS. This software while being very useful can make some mistakes, like assembling together scaffolds that do not belong together. This is why a manual curation step is necessary, where we look at the Hi-C contacts and modify the assembly when it is necessary. This is not an exact science, different experts may make different decissions for the same assembly. We will look at a heatmap of the Hi-C contact matrix generated by 'Juicer' using the software 'Juicebox'. Over the Hi-C contact map we will overlap the ``assembly`` file, a vector that groups the different scaffolds in 'chromosomes' (they are not necessarily chromosomes in the biological sense, it is just a higher level of the assembly). This software allows you to modify the assembly by grouping scaffolds into 'chromosomes', separating chromosomes, cutting whithin the scaffolds and inverting them. To learn more about this program and how to use it go to: https://github.com/aidenlab/Juicebox/wiki. We will show our manual curation of *C.rostrata* genome. Manual curation is not an exact science, different experts may differ on the steps of manual curation. Firstly, we will download the hiC map (``C_rostrata_JBAT.hic``) and assembly files (``C_rostrata_JBAT.assembly``) in our computer. ```` scp igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_rostrata/C_rostrata_JBAT.hic your/home/folder scp igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_rostrata/C_rostrata_JBAT.assembly your/home/folder scp igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_helodes/C_helodes_JBAT.hic your/home/folder scp igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_helodes/C_helodes_JBAT.assembly your/home/folder ```` Now, open **Juicebox** and import the ``C_rostrata_JBAT.hic`` file. To do that, go to the **File** menu and click on **Open**. Go to the location where you saved your file and select it. This is a heatmap that represents the contacts between each position in the genome with the rest. The redder the point between the two positions the more contacts there are between them. The color scale can be changed in the **Color range** bar. We can zoom in and out in the **Show** window. The squares represent a group of positions that have a lot pf contacts between them, probably because they are located in the same chromosomes. <img src="https://hackmd.io/_uploads/BkrgB7J6R.png" alt="rostrata_1" width="1000px"> Then, import de ``C_rostrata_JBAT.assembly`` file by clicking on **Import Assembly** on the **Assembly** menu. Then go to **Chromosome**, change selection from *All* to *Assembly* and click on the circling arrows. This layer represents the positions that are supposed to belong to the same chromosome. The blue squares represent superscaffolds, which are more less equivalent to chromosomes, and the green one the scaffolds. In our assembly, the majority of chromosomes are composed by one scaffold. This is a indication that our assembly is quite contiguous already. Only the superscaffold 16 is composed by more than one scaffold. <img src="https://hackmd.io/_uploads/S1cFUQyTR.png" alt="rostrata_2" width="1000px"> This software only modifies permanently the ``.assembly`` file, not the ``.hic`` map. If we have made some changes, then we close and open Juicebox the next day, these changes are only applied to the squares in the ``.assembly`` layer. Therefore, the Hi-C map and the squares in the assembly will not overlap and continuing with manual curation is impossible. Before starting to make changes in the assembly consider that you cannot close the program during the manual curation process, unless you want to repeat all the changes. Manual curation normally takes less than 1 hour, but sometimes in complicated genomes it can take several days. It is recommended to look through all the scaffolds before to check if there are clear misassemblies. We did not see any one clear enough. You will see some white areas, they normally represent repetitive regions. Some of this white areas are in the extremes of the chromosomes, they represent telomeric regions . They are really different to assemble, so the fact that in most chromosomes we dont see them, does not mean that they do not exist. Other repetitive regions are not located in the chromosome ends, some of them represent the centromeres. *Carex* species have holocentric chromosomes so there can be multiple parts of the chromsome with centromeric sequences. In some holocentric chromsomes we do not see clear centromeric regions Other white areas are other types pf repetitive regions. Below you see the hi-C map of chromome 10: we can see telomeric regions in the lower border, and multiple repetitive regions (possibly centromeres) throughout the chromosome. <img src="https://hackmd.io/_uploads/Hy60y4kaA.png" alt="rostrata_3" width="500px"> The scaffolds are ordered by size, in the bottom-left corner we find the smaller ones. They normally correspond to mithocondrial or plastidial DNA, other contaminations and sequences who were not assembled, many of them are repetitive regions as they are the most difficult to assemble. we will focus on this area. <img src="https://hackmd.io/_uploads/rJgYWEkaC.png" alt="rostrata_4" width="500px"> The sequences that do not have contacts with the rest of the genome will be moved to the end. To select a scaffold (``Shift``+ Click) and them with the righ bottom we open menu we click on ``Move to debris``. We will move to debris scaffolds 40, 41, 43 and 49. The chromosome we move to debris are normally either empty or highly repetitive. Then, we have chromsome 42 which is neither repetitive nor empty. We can see that it has contacts with scaffold 28, so we will procede to move it to this superscaffold. To move it we select and then we choose where we want to move it. Normally, we will try to move it to the area were it fits best in the superscaffold. However, we will have to cut the scaffold (to cut you have to zoom in on the selected chromsome). Cutting is recomended on white areas, as there are not white areas near the high contact area between the two scaffolds the most conservative thing to do is leave the chromosome 42 next to the 28, and not cut it, as it is shown in the image below. In the reescafolding in the next step this two scaffolds will be reanalyzed and integrated. <img src="https://hackmd.io/_uploads/H1W7izg60.png" alt="rostrata_6" width="500px"> After eliminating the empty ones and moving scaffold 42, we see how we form a superscaffold, uniting 9 scaffolds (38, 39, 43, 45, 46, 47, 48 and 50). <img src="https://hackmd.io/_uploads/B1mK0fgTR.png" alt="rostrata_7" width="500px"> There are probably other changes we could make, but these are the main ones. This assembly was really good before manual curation, so manual curation was minimal. However, in other assemblies this step is much longer and major changes are necessary. In more complicated genomes, there can be more disagreements between experts about how far to go and whoch changes to make in manual curation. Manual curation can take almost as much time as you want, especially on little scaffolds. This small changes have little impact on the assembly, but thet still consume time. Therefore, it is important to know when to stop. A good practice in manual curation is to take notes of every change you make (or at least how many changes you do) in order for it to be replicable. Once we are satisfied with our hi-C map, we will save the changes. We will export a new ``.assembly`` with the changes we made. To do that click on ``Èxport assembly`` on the ``Assembly`` menu and save it in a known location in your computer. The file will be named ``C_rostrata.JBAT.review.assembly``. If you want to check that al the changes open a new session in Juicebox with the old ``.hic`` map and the ``C_rostrata.JBAT.review.assembly`` file. They shold not overlap, reflecting the manual curation changes. <div style="display: flex; justify-content: space-between; width: 100%; margin: 0 auto;"> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/Bk7IVSea0.png" alt="chromosome 28" width="100%" height="80%" style="display: block;"> <p style="font-size: 0.8em; font-weight: bold;">chromosome 28</p> </div> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/HyqbEBlpR.png" alt="small scaffolds" width="100%" height="80%" style="display: block;"> <p style="font-size: 0.8em; font-weight: bold;">small scaffolds</p> </div> </div> ## B3. Modifying our assembly according to manual curation The modification on Juicebox does not modify automatically our files containing the assembly sequences (.fa). So, once we have finished manual curation we have to apply those changes to the FASTA file (``C_rostrata_yahs_out_scaffolds_final.fa``) using ``juicer`` again. Then, we will create another hiC map to check if the changes were correctly applied. Although it is possible to do more manual curation in this second hiC map repaeating the post-curation juicer step, it is not recommended as it is difficult to track the changes, therefore making the assembly unreplicable. ### Software requirements * **Conda environments** * ref_map (I am including only the packages we need from this environment) * yahs 1.2a.2 (https://anaconda.org/bioconda/yahs) * **juicer_tools.jar** * **Juicebox** These script will be run in a folder called ``B3_juicer_after_mc``,as it is after the manual curation step with juicebox. ```` mkdir -p ~/genome_assembly/B3_juicer_after_mc cd ~/genome_assembly/B3_juicer_after_mc ```` Before running the script we have to upload the modified assembly to the cluster. ```` scp /your/home/folder/C_rostrata_JBAT.review.assembly igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_rostrata scp /your/home/folder/C_helodes_JBAT.review.assembly igomez@login.spc.cica.es:~/genome_assembly/B2_juicer_before_mc/C_helodes ```` ### Run script First, create the script file: ``nano B3_juicer_after_mc.sbatch`` Then, we will add the commands whithin it. This script should take 2-3 hours to run per sample. The ``juicer_tools.jar`` commands requires a lot of CPU memmory, at least 100-120 GB ```` #!/bin/bash #SBATCH --job-name=B3_af_mc #SBATCH --export=ALL #SBATCH --output=B3_af_mc_output #SBATCH --error=B3_af_mc_error #SBATCH --mem=120G #SBATCH --cpus-per-task=16 #SBATCH --time=23:30:00 ### Load directories INDIR=/home/igomez/genome_assembly/B2_juicer_before_mc OUTDIR=/home/igomez/genome_assembly/B3_juicer_after_mc SOFTWAREDIR=/home/igomez/software SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" ### Create function to parallelize over samples hic_af_mc() { SAMPLE=$1 ## Create output directories mkdir -p $OUTDIR/${SAMPLE} cd $OUTDIR/${SAMPLE} ##Load conda source source ~/anaconda3/etc/profile.d/conda.sh ##Estalish path to reference genome REF=/home/igomez/genome_assembly/A2_purge_dups/${SAMPLE}/purged.fa ## Modify the file with the genome assembly according to the (agp) ## manually curated .assembly file. conda activate ref_map juicer post -o $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT $INDIR/${SAMPLE}/${SAMPLE}_JBAT.review.assembly $INDIR/${SAMPLE}/${SAMPLE}_JBAT.liftover.agp ${REF} echo "juicer post mc step done for ${SAMPLE}" ## Then, recreate a final hic contact map comptaible with Juicebox using the reviewed version of ## the assembly file. juicer pre -q 0 -a -o $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final $INDIR/${SAMPLE}/${SAMPLE}_yahs_out.bin $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT.FINAL.agp ${REF}.fai 2> $OUTDIR/${SAMPLE}/tmp_juicer_pre_JBAT.log cat $OUTDIR/${SAMPLE}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- > $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.chrom.sizes (java -jar -Xmx120G $SOFTWAREDIR/juicer/CPU/common/juicer_tools.jar pre $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.txt $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.chrom.sizes) && (mv $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.hic.part $OUTDIR/${SAMPLE}/${SAMPLE}_JBAT_final.hic) conda deactivate } ### Parallelize over individuals export INDIR OUTDIR SOFTWAREDIR SAMPLE_DIR export -f hic_af_mc parallel 'hic_af_mc {1}' :::: $SAMPLE_DIR ### SOFTWARE VERSIONS # yahs 1.2a.2 # juicer tools 1.22.01 ```` ### Output and interpretation We will check the hi-C contact map and assembly to check if the changes made by manual curation were implemented in the assembly. First, download the hi-C map (``C_rostrata_JBAT_final.hic``) and the assembly file (``C_rostrata_JBAT_final.assembly``) in your computer. ```` scp igomez@login.spc.cica.es:~/genome_assembly/B3_juicer_after_mc/C_rostrata/C_rostrata_JBAT_final.hic /your/home/folder scp igomez@login.spc.cica.es:~/genome_assembly/B3_juicer_after_mc/C_rostrata/C_rostrata_JBAT_final.assembly /your/home/folder scp igomez@login.spc.cica.es:~/genome_assembly/B3_juicer_after_mc/C_helodes/C_helodes_JBAT_final.hic /your/home/folder scp igomez@login.spc.cica.es:~/genome_assembly/B3_juicer_after_mc/C_helodes/C_helodes_JBAT_final.assembly /your/home/folder ```` Then, open them in Juicebox. We will check if the changes made were implemented. In our case all the changes were made in the smallest scaffols (scaffold_), so we will look at them. As we see, the changes are implemented so we can continue forward. ## B4. Decontamination Most organisms do not live in sterile environmentsa, so extracted DNA may be contaminated with foreign DNA from associated microbiota and endosymbionts, like mithocondrial and plastidial DNA. In addition, laboratory reagents and procedures can also introduce foreign DNA. Eliminating those is necessary for a successful assembly. In model organism, contaminants are well-known. Therefore, contaminant sequences are identified by comparaing them to a contaminant database. However, in *de novo* assembly in non-model organisms contaminants are not known. ``Tiara`` software proposes a solution: using *machine learning* to create a decision tree to classify scaffolds in general categories: **eukarya**, **organelle**, **bacterya**, **archaea** and **unknown** depending on general DNA characteristics like length, GC content, sequecing coverage and percentage of scaffold covered in DNA alignment. Then, a further analysis is done for the scaffols classified as **organelle** to classify them into: **plastid**, **mithocondria** and **unknown**. Then, ``BlobToolKit`` separates the assembly ``.fa``, into different ``.fa`` files according to the ``tiara`` classification. ### Software requirements * Conda environments * minimap_samtools * minimap2 2.28 (https://anaconda.org/bioconda/minimap2) * samtools 1.20 (https://anaconda.org/bioconda/samtools) * tiara: Install it in a conda environment with python 3.8 * tiara 1.0.3 (https://anaconda.org/conda-forge/tiara ``mamba create -y -n tiara -c conda-forge python=3.8 tiara`` * btk * blobtk 0.5.3. (https://anaconda.org/tolkit/blobtk) * BlobToolKit 4.3.11 (https://blobtoolkit.genomehubs.org/install/). Install it using python as it is indicated in the link. ```` conda create -y -n btk -c conda-forge python=3.9 conda activate btk pip install "blobtoolkit[full]" conda deactivate ```` This script will be run in a folder called ``B4_decontaminate``.This script will take several hours to run. ```` mkdir -p ~/genome_assembly/B4_decontaminate cd ~/genome_assembly/B4_decontaminate ```` ### Run the script First, create the script file: ``nano B4_decontaminate.sbatch`` Then, we will add the commands whithin it. This script should take several hours to run per sample. ```` #!/bin/bash #SBATCH --job-name=B4_decontaminate #SBATCH --export=ALL #SBATCH --output=B4_decontaminate_output #SBATCH --error=B4_decontaminate_error #SBATCH --mem=60G #SBATCH --cpus-per-task=16 #SBATCH --time=23:30:00 ##Load directories INDIR=/home/igomez/genome_assembly/B3_juicer_after_mc OUTDIR=/home/igomez/genome_assembly/B4_decontaminate SOFTWAREDIR=/home/igomez/software SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" ## Run the Blobtools pipeline in parallel for each sample blobs_func() { source ~/anaconda3/etc/profile.d/conda.sh SAMPLE=$1 REF_GEN=$INDIR/${SAMPLE}/${SAMPLE}_JBAT.FINAL.fa READS=/home/igomez/genome_assembly/raw_data/PacBio/${SAMPLE}/hifi.fastq ##Create output folder for each sample mkdir -p $OUTDIR/${SAMPLE} cd $OUTDIR/${SAMPLE} ### Create yaml file for the sample touch ${SAMPLE}.yaml echo -e "assembly:\n alias: ${SAMPLE}\n record_type: contig\ntaxon:\n name: ${SAMPLE}" > ${SAMPLE}.yaml echo "yaml file created" ## Create coverage file (adding directly the bam does not work for some reason) conda activate minimap_samtools minimap2 -ax map-pb -t 16 $REF_GEN $READS | samtools sort -@16 -O BAM -o ${SAMPLE}_coverage.bam samtools coverage ${SAMPLE}_coverage.bam > ${SAMPLE}_coverage.txt conda deactivate echo "coverage file created" ## Use tiara to quickly check if a contig is eukaryote, bacteria or organelle conda activate tiara tiara -i $REF_GEN -o ${SAMPLE}.tiara -t 16 --pr --tf all -m 1000 conda deactivate echo "tiara run" ## Adding input files to blobtoolkit dataset conda activate btk mkdir -p $OUTDIR/${SAMPLE}/btk blobtools create --fasta $REF_GEN --meta ${SAMPLE}.yaml --taxdump ../taxdump/ --replace ./btk blobtools add --text ${SAMPLE}_coverage.txt --text-header --text-cols '#rname=identifier,meandepth=reads_cov' --replace ./btk blobtools add --key plot.y=reads_cov --replace ./btk blobtools add --text ${SAMPLE}.tiara --text-delimiter "\t" --text-header --text-cols "sequence_id=identifier,class_fst_stage=tiara" --replace ./btk blobtools add --key plot.cat=tiara --replace ./btk echo "bloobtools run" ## View the dataset on blobtool viewer:: call from python3 export RUST_BACKTRACE=full # Create a summary table on the dataset blobtools filter --table $OUTDIR/${SAMPLE}/${SAMPLE}_unfiltered_summary.tsv --table-fields gc,length,reads_cov,tiara ./btk conda deactivate ### The decontaminated assembly is eukarya_${SAMPLE}_JBAT.FINAL.fa ### Keep mitochondrial genome and bacterial genomes separated } ### Parallelize over individuals export INDIR OUTDIR SOFTWAREDIR SAMPLE_DIR export -f blobs_func parallel 'blobs_func {1}' :::: $SAMPLE_DIR ### SOFTWARE VERSIONS # samtools 1.20 # minimap2 2.28 # blobtoolkit v4.3.0 # blobtk 0.5.3 # tiara 1.0.3 ```` ### Output and interpretation There are multiple tables and files created from this folder. We will focus on the ``C_rostrata_unfiltered_summary.tsv`` table, as it is the easier to interpret. The table has the following columns: * **identifiers:** The number of the scaffold identified. These are the same names we see in the Juicebox ``.hic`` map, so you can compare the results of the decontamination analysis with the Hi-C map. * **gc** Proportion of GC in the scaffold. Scaffolds classified as *eukarya* tend to have similar proportions of GC, around * **length** Length of the scaffold measured in base pairs. Organelle and bacterya scaffolds tend to be short. * **reads_cov** Coverage of each scaffold in the PacBio sequences. * **tiara** Classification of the scaffold according to tiara: **eukarya**, **bacterya**, **organelle** and **unknown**. If there are many **unknown** or **bacterya** scaffolds, it means there were high levels of contamination during DNA extraction and/or sequenciation. In the worst of cases, one would have to repeat sequenciation and DNA extraction. If they are mostly in small scaffolds, the contamination is separated from the chromsomes which is a good sign. **Organelle DNA** is normally expected and rarely a problem if it is not assembled with the chromsomes. If you want more information about which type of organelle the scaffolds come from, check the ``C_rostrata.tiara`` file. ```` cat C_rostrata_unfiltered_summary.tsv index identifiers gc length reads_cov tiara 0 scaffold_1 0.3407 20519163 68.8836 eukarya 1 scaffold_2 0.3536 18889618 73.6254 eukarya 2 scaffold_3 0.3424 16979481 71.2241 eukarya 3 scaffold_4 0.3483 15431245 68.0746 eukarya 4 scaffold_5 0.3378 15144400 68.996 eukarya 5 scaffold_6 0.3495 15086000 73.9218 eukarya 6 scaffold_7 0.3394 14748687 69.1416 eukarya 7 scaffold_8 0.3401 14602151 69.2029 eukarya 8 scaffold_9 0.3393 14208447 68.8214 eukarya 9 scaffold_10 0.3571 13909784 75.0702 eukarya 10 scaffold_11 0.3388 13873836 68.3305 eukarya 11 scaffold_12 0.3388 13702659 70.0808 eukarya 12 scaffold_13 0.3436 11161153 68.8056 eukarya 13 scaffold_14 0.34 10647859 68.486 eukarya 14 scaffold_15 0.3388 10645412 70.5592 eukarya 15 scaffold_16 0.3405 10612874 67.7808 eukarya 16 scaffold_17 0.3401 10446310 68.8682 eukarya 17 scaffold_18 0.3439 9570554 69.5611 eukarya 18 scaffold_19 0.3388 9534215 69.4222 eukarya 19 scaffold_20 0.3383 9347619 70.2492 eukarya 20 scaffold_21 0.3399 8844035 69.4575 eukarya 21 scaffold_22 0.3368 8824609 69.1734 eukarya 22 scaffold_23 0.3391 8716604 69.5628 eukarya 23 scaffold_24 0.3371 8676010 69.4134 eukarya 24 scaffold_25 0.3409 8179128 68.8671 eukarya 25 scaffold_26 0.3399 8062943 69.8144 eukarya 26 scaffold_27 0.34 7823859 69.3813 eukarya 27 scaffold_28 0.3351 7748284 69.2471 eukarya 28 scaffold_29 0.3351 6519814 69.3641 eukarya 29 scaffold_30 0.3385 6290409 69.3558 eukarya 30 scaffold_31 0.334 6094630 73.821 eukarya 31 scaffold_32 0.3375 5587025 70.7281 eukarya 32 scaffold_33 0.3364 5472348 66.2694 eukarya 33 scaffold_34 0.3366 5152814 66.9996 eukarya 34 scaffold_35 0.3331 4910038 68.9357 eukarya 35 scaffold_36 0.3298 74523 439.511 organelle 36 scaffold_37 0.6268 70145 3.12213 bacteria 37 scaffold_38 0.191 65092 2.65747 organelle 38 scaffold_39 0.1966 61652 1.65685 organelle 39 scaffold_40 0.1906 61222 2.33374 organelle 40 scaffold_41 0.3392 59135 85.36 eukarya 41 scaffold_42 0.4701 58810 20.9611 eukarya 42 scaffold_43 0.1916 58246 2.53853 organelle 43 scaffold_44 0.2055 58209 2.32397 unknown 44 scaffold_45 0.4216 58180 322.099 organelle 45 scaffold_46 0.6278 58150 1.60504 bacteria 46 scaffold_47 0.5025 57737 1.87627 eukarya 47 scaffold_48 0.2004 57082 1.41439 unknown 48 scaffold_49 0.5489 55609 395.976 unknown 49 scaffold_50 0.2204 52069 2.42939 unknown 50 scaffold_51 0.3873 51883 49.6402 eukarya 51 scaffold_52 0.2022 50878 1.69987 organelle 52 scaffold_53 0.2016 50437 2.07457 organelle 53 scaffold_54 0.3205 50095 3042.3 organelle 54 scaffold_55 0.2079 48092 2.59794 organelle 55 scaffold_56 0.2031 48059 2.0556 organelle 56 scaffold_57 0.3156 46241 1644.23 organelle 57 scaffold_58 0.1918 45266 2.93282 organelle 58 scaffold_59 0.4193 44433 226.836 organelle 59 scaffold_60 0.1917 44402 2.9094 organelle 60 scaffold_61 0.3278 43641 734.054 organelle 61 scaffold_62 0.206 43028 2.27094 organelle 62 scaffold_63 0.2035 42061 2.66377 organelle 63 scaffold_64 0.3873 41885 2.22347 bacteria 64 scaffold_65 0.3428 40959 4045.39 organelle 65 scaffold_66 0.2017 40118 1.76457 organelle 66 scaffold_67 0.3131 39074 959.684 organelle 67 scaffold_68 0.1807 38619 2.07333 organelle 68 scaffold_69 0.2024 37946 2.48611 organelle 69 scaffold_70 0.1967 37592 1.88889 organelle 70 scaffold_71 0.3353 37243 3222.14 organelle 71 scaffold_72 0.2003 36881 1.83496 organelle 72 scaffold_73 0.1981 36445 2.36373 organelle 73 scaffold_74 0.191 36405 2.11092 organelle 74 scaffold_75 0.4711 36136 1.51146 eukarya 75 scaffold_76 0.2122 35878 2.11698 organelle 76 scaffold_77 0.5212 35258 1.66893 eukarya 77 scaffold_78 0.195 35248 3.15768 organelle 78 scaffold_79 0.4137 35135 540.227 organelle 79 scaffold_80 0.1916 35060 1.61834 organelle 80 scaffold_81 0.5061 34973 2.04218 eukarya 81 scaffold_82 0.4149 34253 328.986 organelle 82 scaffold_83 0.2052 30879 2.12011 unknown 83 scaffold_84 0.3858 28719 5870.67 organelle 84 scaffold_85 0.3888 27991 4190.74 organelle 85 scaffold_86 0.3926 26371 780.334 organelle 86 scaffold_87 0.3471 25543 79.4332 eukarya 87 scaffold_88 0.1918 25157 2.37866 organelle 88 scaffold_89 0.558 25069 925.192 unknown 89 scaffold_90 0.2998 22585 126.543 eukarya 90 scaffold_91 0.3398 22307 3786.25 organelle 91 scaffold_92 0.308 20145 1758.45 organelle 92 scaffold_93 0.5509 251759 129.548 eukarya 93 scaffold_94 0.3836 214937 71.5338 eukarya 94 scaffold_95 0.3843 139078 26.7298 organelle 95 scaffold_96 0.339 101866 2631.44 organelle 96 scaffold_97 0.4181 138986 486.046 organelle 97 scaffold_98 0.4019 1175730 1254.63 organelle ```` Then Blobtools have created ``.fa`` files containing a certain type of DNA according to the ``tiara`` classification: ``mitochondrion_C_rostrata_JBAT.FINAL.fa`` (which contains mithocondrial DNA), ``plastid_C_rostrata_JBAT.FINAL.fa``(which contains plastidial DNA), ``bacteria_C_rostrata_JBAT.FINAL.fa``(which contains bacterial DNA) ,``unknown_C_rostrata_JBAT.FINAL.fa`` (which contains DNA from unknown source) and ``eukarya_C_rostrata_JBAT.FINAL.fa``, which contains eukaryotic DNA, the most important one for us and the one we will use in further analysis. ## B5. Quality Control with telomere analysis of the scaffolded and decontaminated assembly We will repeat the completedness and contiuguity analysis we performed for step A3. In this step we will evaluate the success in scaffolding, comparing the genome contiguity with A3_QC results. We expect to see a higher contigüity with fewer and longer DNA molecules.Finally, we will evaluate whether the assembly achieves the highest possible level of contiguity—telomere-to-telomere—using a tool that identifies telomeric repeats. ### Software requirements * Conda environments: * compleasm: * compleasm 0.2.6 (https://anaconda.org/bioconda/compleasm) * gfastats: * gfastats 1.3.6 (https://anaconda.org/bioconda/gfastats) * merqury * merqury 1.3 (https://anaconda.org/bioconda/merqury) * tidk * tidk 0.2.41 (https://anaconda.org/bioconda/tidk) * Perl scripts (I normally save them in a software folder): * contig_length_fasta.pl (https://github.com/ishengtsai/perlscripts_jit/blob/master/contig_length_fasta.pl) * R packages * ggplot2 (https://r-charts.com/es/ggplot2/) This script will be run in the folder ``B5_QC``. This scripts is quite fast to run, less than an hour. ```` mkdir -p ~/genome_assembly/B5_QC cd ~/genome_assembly/B5_QC ```` ### Run Script First, create the script file: nano B4_decontaminate.sbatch Then, we will add the commands whithin it. This script should take some minutes to run per sample. ```` #!/bin/bash #SBATCH --job-name=B5_QC #SBATCH --export=ALL #SBATCH --output=B5_QC_output #SBATCH --error=B5_QC_error #SBATCH --mem=60G #SBATCH --cpus-per-task=16 #SBATCH --time=6:00:00 #set source for conda source ~/anaconda3/etc/profile.d/conda.sh ### Set directories OUTDIR="/home/igomez/genome_assembly/B5_QC" INDIR="/home/igomez/genome_assembly/B4_decontaminate" SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" for SAMPLE in $(cat $SAMPLE_DIR | cut -f1) do REF_EUR=$INDIR/${SAMPLE}/eukarya_${SAMPLE}_JBAT.FINAL.fa mkdir -p $OUTDIR/${SAMPLE} cd $OUTDIR/${SAMPLE} ## Find telomeres conda activate tidk tidk find -c Poales -d $OUTDIR/${SAMPLE} -o ${SAMPLE}_telomeres --log $REF_EUR conda deactivate echo "telomeric analysis done for ${SAMPLE}" ## First test: genome completedness with compleasm conda activate compleasm compleasm run -a $REF_EUR -o $OUTDIR/${SAMPLE}/${SAMPLE}_compleasm -l poales_odb10 -t 16 conda deactivate echo "compleasm done for ${SAMPLE}" ## Second test: Genome contiguity with Gfastats: Contigs and scaffolds length conda activate gfastats gfastats $REF_EUR -j 16 > $OUTDIR/${SAMPLE}/${SAMPLE}_gfastats.txt conda deactivate echo "gfastats done for ${SAMPLE}" ## Third test: General scaffold lenth distribution with perl script. perl /home/igomez/software/contig_length_fasta.pl $REF_EUR > $OUTDIR/${SAMPLE}/${SAMPLE}_scaffold_length.txt echo "contig list complete for ${SAMPLE}" ## Forth test: Repeatedness with merqury #create directory to run merqury mkdir -p $OUTDIR/${SAMPLE}/merqury cd $OUTDIR/${SAMPLE}/merqury #The meryl reads is calculated from the PacBio reads so it is the same for all the merqury analysis #in the assembly cp -r ~/genome_assembly/A3_QC/${SAMPLE}/${SAMPLE}.meryl . # Set path to to the alternate assembly REF_ALT=~/genome_assembly/A2_purge_dups/${SAMPLE}/alternate/purged_alt.fa #Run merqury conda activate merqury merqury.sh ../${SAMPLE}.meryl $REF_EUR $REF_ALT B3_merqury_${SAMPLE} conda deactivate ## Fifth test: Find telomeres conda activate tidk tidk find -c Poales -d $OUTDIR/${SAMPLE} -o ${SAMPLE}_telomeres --log $REF_EUR conda deactivate echo "telomeric analysis done for ${SAMPLE}" done ### SOFTWARE VERSIONS # compleasm 0.2.6 # contig_length_fasta.pl https://github.com/ishengtsai/perlscripts_jit/blob/master/contig_length_fasta.pl # gfastat 1.3.6 # merqury 1.3 # tidk 0.2.41 ```` The plot produced by tidk software, is not very clear, so using the ``.tsv`` file created by the ``tidk plot`` we can create a better plot using the ``ggplot2``. Therefore, download the ``.tsv`` in your computer ( ``scp usuario@login.spc.cica.es:~/genome_assembly/B5_QC/C_rostrata/C_rostrata_telomeres.tsv /your/home/folder/gmetp_linux_64.tar.gz usuario@login.spc.cica.es:~/`` ) and run the following script on Rstudio: Run this script in RStudio in your computer. ```` ###Create a function to plot telomeres plot_all_telomere_counts <- function(df, chr){ df_filt <- filter(df, id == chr) total <- ggplot(data = df_filt, aes(x=window, y=total_count)) + geom_bar(stat='identity', col='blue') + theme_classic() + ggtitle(label=chr) + labs(x='Position (Mb)', y='Count of TTAGG per window') + theme_classic() + theme(panel.background = element_rect(fill = 'transparent'), #transparent panel bg plot.background = element_rect(fill = 'transparent', color = NA), #transparent plot bg panel.grid.major = element_blank(), #remove major gridlines panel.grid.minor = element_blank(), #remove minor gridlines legend.background = element_rect(fill = 'transparent'), #transparent legend bg legend.box.background = element_rect(fill = 'transparent')) #transparent legend panel return(total) } plot_all_chr <- function(df){ total <- ggplot(data = df, aes(x=window, y=total_count)) + geom_bar(stat='identity', col='blue') + theme_classic() + labs(x='Position (Mb)', y='Count of TTAGG per window') + facet_wrap(~id) + ylim(0,200) + # to prevent outliers driving the scale theme_classic() + theme(panel.background = element_rect(fill = 'transparent'), #transparent panel bg plot.background = element_rect(fill = 'transparent', color = NA), #transparent plot bg panel.grid.major = element_blank(), #remove major gridlines panel.grid.minor = element_blank(), #remove minor gridlines legend.background = element_rect(fill = 'transparent'), #transparent legend bg legend.box.background = element_rect(fill = 'transparent')) #transparent legend panel return(total) } #------------------------------------------------------------------------------------------- ###Then we run the function library(ggplot2) setwd("your/path/way/to/tsv_file") df <- read.csv('${SAMPLE}_.tsv', sep='\t') df$total_count <- df$forward_repeat_number + df$reverse_repeat_number df$window <- df$window/1000000 # convert to Mb X3311_telo_plot <- plot_all_chr(df) X3311_telo_plot ```` ## B6. Eliminate small scaffolds ### Software requirements * Conda environments * seqkit * seqkit 2.8.2 (https://anaconda.org/bioconda/seqkit) ```` mkdir -p B6_clean_assemblies cd B6_clean_assemblies ```` ### Run script ```` #!/bin/bash #SBATCH --job-name=B6_clean_assemblies #SBATCH --export=ALL #SBATCH --output=B6_clean_assemblies_output #SBATCH --error=B6_clean_assemblies_error #SBATCH --mem=60G #SBATCH --cpus-per-task=16 #SBATCH --time=23:30:00 ### Set directories INDIR=~/genome_assembly/B4_decontaminate OUTDIR=~/genome_assembly/B6_clean_assemblies SAMPLE_DIR=~/genome_assembly/raw_data/samples.txt for SAMPLE in $(cat $SAMPLE_DIR | cut -f1) do mkdir -p $OUTDIR/${SAMPLE} cd $OUTDIR/${SAMPLE} source ~/anaconda3/etc/profile.d/conda.sh conda activate seqkit MINLEN=4000000 #maybe modify it according to contig length list, hiC and telomeres. seqkit seq --threads 16 --min-len $MINLEN $INDIR/${SAMPLE}/eukarya_${SAMPLE}_JBAT.FINAL.fa > $OUTDIR/${SAMPLE}/${SAMPLE}.fasta conda deactivate echo "clean assemblies generated without small, umapped scaffolds for ${SAMPLE}" perl ~/software/contig_length_fasta.pl ${SAMPLE}.fasta > clean_assembly_${SAMPLE}_scaffold_length.txt conda activate gfastats gfastats ${SAMPLE}.fasta -j 16 > clean_assembly_${SAMPLE}_gfastats.txt conda deactivate echo "trimming test done for ${SAMPLE}" done ```` ### Output ```` ```` ## C. Annotation and Synteny graphs ## C1. Soft masking ### Software requirements * Conda environments * Repeatmasker * repeatmasker 4.1.5(https://anaconda.org/bioconda/repeatmasker) ### Run script ```` #!/usr/bin/bash #SBATCH --job-name=C1_repeatmask #SBATCH --output=C1_repeatmask_output #SBATCH --error=C1_repeatmask_error #SBATCH --nodes=1 # Using one node #SBATCH --cpus-per-task=24 #SBATCH --mem=120G #SBATCH --partition=standard #SBATCH --time=24:00:00 ### Set directories INDIR=/home/igomez/genome_assembly/B6_clean_assemblies OUTDIR=/home/igomez/genome_assembly/C1_repeatmask SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" ### Run repeatMasker in parallel rmasker() { SAMPLE=$1 REF_trim=$INDIR/${SAMPLE}/${SAMPLE}.fasta #set source for conda and activate environment source ~/anaconda3/etc/profile.d/conda.sh conda activate repeatmasker #establish working directories mkdir -p $OUTDIR/${SAMPLE} cd $OUTDIR/${SAMPLE} RepeatMasker -pa 4 -e rmblast -species ${SAMPLE} -dir $OUTDIR/${SAMPLE} -xsmall -html -gff $REF_trim # -pa runs in parallel (pa 4 uses 16 cores in total in combination with rmblast) # -e specifies the search engine (rmblast = ncbi is the one used by EarlGrey) # -nolow does not mask low complexity regions # -s slow but more sensitive, -q fast but less sensitive # IMPORTANT if using RepeatMasker before gene prediction, I should not mask low complexity regions # because sometimes codons from genes would be masked # -xsmall for softmasking (repeats in lowercase) rather than changing them to Ns # -html and -gff create additional output files conda deactivate } ### Parallelize over individuals export INDIR OUTDIR SAMPLE_DIR export -f rmasker parallel 'rmasker {1}' :::: $SAMPLE_DIR ### SOFTWARE VERSIONS # repeatmasker 4.1.5. ```` ## C3. Gene annotation with braker Once we have softmasked our genomes, we can procede with gene annotation. Braker3 allows for fully automated training of the gene prediction tools GeneMark-ES/ET/EP/ETP R14, R15, R17, F1 and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction. In contrast to other available methods that rely on protein homology information, BRAKER2 reaches high gene prediction accuracy even in the absence of the annotation of very closely related species and in the absence of RNA-Seq data. ### Software requirements * **Conda environments** * **braker3 3.0.8** (https://anaconda.org/bioconda/braker3) * **Programs** * **Genemark-ETP 1.0** (https://github.com/gatech-genemark/GeneMark-ETP). We need the Genemark-ETP 1.0 software (http://topaz.gatech.edu/Genemark/license_download.cgi). You will need to register and then download the software and the key in your computer to then upload it to your cluster. Store the .key in your home folder in the cluster. Then, decompress them and procede with the installation. Make sure Perl and Python3 is installed. Make sure Perl and Python3 is previously installed. ```` scp /your/home/folder/gmetp_linux_64.tar.gz igomez@login.spc.cica.es:/home/igomez/software/gmetp_linux_64.tar.gz scp /your/home/folder/gm_key_64.gz igomez@login.spc.cica.es:/home/igomez/gm_key_64.gz ssh igomez@login.spc.cica.es gunzip ~/gm_key_64.gz cd software tar -xvzf gmetp_linux_64.tar.gz ```` * **ProtHint** (https://github.com/gatech-genemark/ProtHint) ```` cd ~/software git clone https://github.com/gatech-genemark/ProtHint.git ```` Running the braker.pl command requires many dependencies and a very long computational time, much longer than anything we have run in this tutorial (3-4 days at least per genome). As braker installation is tricky due to the many different dependencies accross multiple manually installed softwares, it is strongly encouraged to run a test with a small dataset. We will use the example proposed in the braker documentation (https://github.com/Gaius-Augustus/BRAKER#example-data), where we will annotate 1,000,000 nucleotides of *Arabidopsis tathiana* using protein data from OrthoDB, particurlarly the database for Brassicales ``brassicales_odb10``. The data is available in https://github.com/Gaius-Augustus/BRAKER/tree/master/example. We will perform the test in a folder called ``test``. ```` mkdir -p ~/genome_assembly/C3_braker/test cd ~/genome_assembly/C3_braker/test wget https://github.com/Gaius-Augustus/BRAKER/blob/master/example/genome.fa wget https://github.com/Gaius-Augustus/BRAKER/blob/master/example/proteins.fa ```` Then, we will launch a test job. This test should take less than an hour to finish. ``nano ~/genome_assembly/C3_braker/test/C3_braker_test.sbatch`` This is the test script: ```` #!/usr/bin/bash #SBATCH --job-name=C3_braker_test #SBATCH --output=C3_braker_test_output #SBATCH --error=C3_braker_test_error #SBATCH --nodes=1 #SBATCH --cpus-per-task=32 #SBATCH --mem=240G #SBATCH --partition=standard #SBATCH --time=6:00:00 ###Set dependencies directories and export them GENEMARK_PATH=/home/igomez/software/gmetp_linux_64/bin PROTHINT_PATH=/home/igomez/software/gmetp_linux_64/bin/gmes/ProtHint/bin ###Set conda source and activate environment source ~/anaconda3/etc/profile.d/conda.sh conda activate braker export GENEMARK_PATH PROTHINT_PATH ###Set output directories OUTDIR="/home/igomez/genome_assembly/C3_braker/test" cd $OUTDIR #Run braker braker.pl --genome=genome.fa --prot_seq=proteins.fa --threads=32 --species=arabidopsis_thatiana --busco_lineage=brassicales_odb10 conda deactivate ### SOFTWARE VERSIONS # BRAKER version 3.0.8 # genemark_ETP # ProtHint v2.6 ```` When it finishes check the error file ``C3_braker_test_error``. Then go to the ``braker`` folder and check the ``braker.log`` file, the error folder and all the files whithin it. Your output folder should look like this: ```` cd ~/genome_assembly/C3_braker/test du -h * 880K Augustus 20M bbc/genemark/brassicales_odb10_hmmsearch_output 21M bbc/genemark 20M bbc/augustus/brassicales_odb10_hmmsearch_output 21M bbc/augustus 21M bbc/braker/brassicales_odb10_hmmsearch_output 21M bbc/braker 61M bbc 4.0K best_by_compleasm.log 128K braker.aa 376K braker.codingseq 396K braker.gtf 76K braker.log 24K errors 2.6M GeneMark-EP 2.3M GeneMark-ES 4.0K genome_header.map 428K hintsfile.gff 140K prothint.gff 384K species/arabidopsis_thatiana 388K species 4.0K what-to-cite.txt ```` Pay attention to any error messages sometimes. Sometimes it does not find some scripts or executables. I'll show an example: ```` cat ~/genome_assembly/C3_braker/test/C3_braker_test_error ERROR: augustus binary not found or not executable. ```` Make sure it exists in your cluster using ``find``: ```` find home/igomez -name augustus /home/igomez/augustus/bin ```` Then, just add the path to the slurm script:``export AUGUSTUS_BIN_PATH=~/augustus/bin`` Sometimes installations are incomplete and you will have to go to GitHub repositories and copy the scripts directly in ``bin``or ``scripts`` folder inside the software directories or whithin the conda environment folder. ### Run script ```` #!/usr/bin/bash #SBATCH --job-name=C2_braker #SBATCH --output=C2_braker_output #SBATCH --error=C2b_braker.error #SBATCH --nodes=1 #SBATCH --cpus-per-task=64 #SBATCH --mem=180G #SBATCH --partition=standard #SBATCH --time=120:00:00 ###Set directories OUTDIR="/home/igomez/genome_assembly_helodes_thesis/C2_braker" export GENEMARK_PATH=/home/igomez/software/gmetp_linux_64/bin export PROTHINT_PATH=/home/igomez/software/gmetp_linux_64/bin/gmes/ProtHint/bin ###Create braker function REF=~/genome_assembly_helodes_thesis/C1b_repeatmasker/1JMC18/1JMC18.fasta.masked PROT=~/genome_assembly_helodes_thesis/C2a_braker/poales_odb10/refseq_db.faa #set directories mkdir -p $OUTDIR/1JMC18 cd $OUTDIR/1JMC18 #set source for conda and activate environment source ~/anaconda3/etc/profile.d/conda.sh conda activate braker #Run braker braker.pl --genome=$REF --prot_seq=$PROT --species=1JMC18_3 --threads=64 --busco_lineage=poales_odb10 --gff3 conda deactivate ### SOFTWARE VERSIONS # BRAKER version 3.0.8 # genemark_etp (install whithin braker env) # ProtHint v2.6 (install with git clone outside environment) ````