Genome assembly tutorial

# Genome assembly from PacBio and hiC data ## Introduction We will assemble two genomes at the same time throuout this tutorial. This methodoly will allow you to assemble multiple genomes more efficiently and consistently. In particular we will assemble two *Carex* species: *Carex rostrata* Stokes and *Carex helodes* Link. These species have small genomes so it requires less computational resources and less time to assemble. Also, they have different levels of heterozygosity, allowing us genome assembly looks like with different levels of heterozygosity. In bioinformatics is very important to be consistent when we name folders and files. Before starting, we will choose the names of our samples: ``C_rostrata`` and ``C_helodes``. ## Structure of this manual This manual will explain all of the steps necessary to perform genome assembly, gene annotation and synteny analysis in one, two or more samples, using HiFi long reads and Hi-C pair-end short reads. Each step in genome assembly will be explained in a separate section in the tutorial. The sections corresponds to the four main parts of this tutorial: **Long reads quality control (QC)**, **De novo assembly and duplicate purging (A)**, **Scaffolding and polishing of the genome assembly (B)** and **Gene annotation and Synteny Analysis ( C )**. In each section we will have four main parts: a small introduction, a list of the softwares you need, the script you will launch as a *slurm* job and an explanation of the expected output and how to interpret it. Most of the analysis will be performed in a cluster whose Operative System is Linux. Create a folder inside the cluster, where we will perform all our analysis. Then move to this directory ``mkdir -p ~/genome_assembly`` Then move to this directory: ``cd ~/genome_assembly`` Most of the analysis performed will submitted as jobs to **SLURM (Simple Linux Utility for Resource Management)** workload manager, as they normally require multiple hours to run and large computational ressources. In order to run a function in **SLURM** you will need to create a file with extension ``.sbatch`` containing the function with a proper header. Below you can see an example of a job ready to be launched. Create a ``practice`` folder to practice the launching of **SLURM** jobs: ```` mkdir -p ~/genome_assembly/practice cd ~/genome_assembly/practice ```` ``nano print_hello.sbatch`` ```` #!/bin/bash #SBATCH --job-name=print_hello # Job name #SBATCH --output=print_hello_output # Standard output (use %j to insert job ID) #SBATCH --error=print_hello_output # Standard error (use %j to insert job ID) #SBATCH --ntasks=1 # Number of tasks (cores) #SBATCH --nodes=1 # Number of nodes #SBATCH --time=1:00:00 # Time limit (24 hours) #SBATCH --mem=1G # Memory limit (16 GB) #SBATCH --cpus-per-task=1 # Number of CPU cores per task echo "Hello" ```` We can launch it by using the function ``sbatch print_hello.sbatch``. Then using the function ``squeue`` we can check how much time has it been running and if it is still running. Here you can see which node it was asigned to and if it is running (R) or pending (PD) which means all the nodes with the characteristics to run your job are occupied and you will need to wait to other jobs to end. ```` squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9848 standard print_hello igomez R 0:02 1 node3011 ```` When it ends it will appear like this: ```` squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ```` Then we will procede to check the error file to check if there were any errors. Also, we will check the output file or the output folder, depending on the script. ```` cat print_hello_error cat print_hello_output Hello ```` In each section of the tutorial we will launch a **SLURM** job. Furthermore, as we will assemble two or more genomes at the same time we will **parallelize** or **loop** the functions. Normally, parallelization is used for more time and ressource consuming jobs. First, in order to parallelize the functions, we need a ``.txt`` file that contains the names of the samples to assemble, it will be stored whithin ``genome_assembly`` folder, inside another folder called ``raw_data``. This file will be used to parallelize the functions in every step. ```` mkdir -p raw_data cd raw_data nano samples.txt ```` Then write whithin the file: ```` C_rostrata C_helodes ```` Once we have this file we will launch a job **looping** a simple printing function: ``cd ~/genome_assembly/practice`` ``nano print_loop.sbatch`` Copy this in the sbatch file: ```` #!/bin/bash #SBATCH --job-name=print_loop # Job name #SBATCH --output=print_loop_output # Standard output (use %j to insert job ID) #SBATCH --error=print_loop_output # Standard error (use %j to insert job ID) #SBATCH --ntasks=1 # Number of tasks (cores) #SBATCH --nodes=1 # Number of nodes #SBATCH --time=1:00:00 # Time limit (1 hours) #SBATCH --mem=1G # Memory limit (1 GB) #SBATCH --cpus-per-task=1 # Number of CPU cores per task SAMPLE_DIR=~/genome_assembly/raw_data/samples.txt for SAMPLE in $(cat $SAMPLE_DIR| cut -f1) do echo "printing ${SAMPLE}" done ```` Then we launch it: ``sbatch print_loop.sbatch`` Once it is finshed we check the output: ```` cat ~/genome_assembly/print_loop_output printing C_rostrata printing C_helodes ```` Now we will do the same with ``parallelize`` function: ``nano print_parallel.sbatch`` Copy this in the ``.sbatch`` file: ```` #!/bin/bash #SBATCH --job-name=print_parallel # Job name #SBATCH --output=print_parallel_output # Standard output (use %j to insert job ID) #SBATCH --error=print_parallel_output # Standard error (use %j to insert job ID) #SBATCH --ntasks=1 # Number of tasks (cores) #SBATCH --nodes=1 # Number of nodes #SBATCH --time=1:00:00 # Time limit (1 hours) #SBATCH --mem=1G # Memory limit (1 GB) #SBATCH --cpus-per-task=1 # Number of CPU cores per task SAMPLE_DIR=~/genome_assembly/raw_data/samples.txt printing(){ SAMPLE=$1 echo "printing ${SAMPLE}" } export SAMPLE_DIR export -f printing parallel 'printing {1}' :::: $SAMPLE_DIR ```` Then, lauch the slurm job ``sbatch print_parallel.sbatch`` Finally, check the output: ```` cat print_parallel_output printing C_rostrata printing C_helodes ```` This jobs were just to practice launching, so once you have finished you can delete the ``practice`` folder: ``rm -r ~/genome_assembly/practice`` ## Transfer the data to cluster The sequence data will be download from the Darwin Tree of life project (https://portal.darwintreeoflife.org/data/) for *Carex rostrata* and (no se donde se subira el genoma) for *Carex helodes*. We will store it whithin the ``raw_data`` folder. We will create one folder for each type of data: ``PacBio`` and ``hiC``. Then whithin each of the folders, another one for each species following the names we have previously chosen. cd ~/genome_assembly/raw_data mkdir PacBio hiC mkdir PacBio/C_rostrata PacBio/C_helodes mkdir hiC/C_rostrata hiC/C_helodes The sequence data will be download from the Darwin Tree of life project (https://portal.darwintreeoflife.org/data/) for *Carex rostrata* and (no se donde se subira el genoma) for *Carex helodes*. The data can be distributed in more than one file. You should check the structure in the website You can download the three *Carex rostrata* PacBio files and the two Hi-C files (two directions) directly from darwintreeoflife using the folloging codes. As it will take some time, you can launch a slurm job to download the data: ``nano dowloading.sbatch`` ```` #!/usr/bin/bash #SBATCH --job-name=downloading #SBATCH --output=downloading_output #SBATCH --error=downloading_error #SBATCH --nodes=1 #SBATCH --ntasks=6 #SBATCH --mem=40G #SBATCH --partition=standard #SBATCH --time=12:00:00 wget -O /home/igomez/genome_assembly/raw_data/PacBio/C_rostrata/C_rostrata_hifi_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR122/000/ERR12205300/ERR12205300.fastq.gz wget -O /home/igomez/genome_assembly/raw_data/PacBio/C_arenaria/C_rostrata_hifi_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR122/001/ERR12205301/ERR12205301.fastq.gz wget -O /home/igomez/genome_assembly/raw_data/PacBio/C_rostrata/C_rostrata_hifi_3.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR122/002/ERR12205302/ERR12205302.fastq.gz wget -O /home/igomez/genome_assembly/raw_data/hiC/C_rostrata/C_rostrata_hiC_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR122/034/ERR12245634/ERR12245634_1.fastq.gz wget -O /home/igomez/genome_assembly/raw_data/hiC/C_rostrata/C_rostrata_hiC_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR122/034/ERR12245634/ERR12245634_2.fastq.gz ```` Then merge the PacBio files (if necessary) into one zip file. Next, you can decompress the files. This process can take some time and perhaps better to do it with a script ````nano decompressing.sbatch````. . ```` #!/usr/bin/bash #SBATCH --job-name=decompressing #SBATCH --output=decompressing_output #SBATCH --error=decompressing_error #SBATCH --nodes=1 #SBATCH --ntasks=6 #SBATCH --mem=40G #SBATCH --partition=standard #SBATCH --time=12:00:00 cd /home/igomez/genome_assembly/raw_data/PacBio/C_rostrata #Unite and decompress the fastq files zcat *.fastq.gz > C_rostrata_hifi.fastq cd /home/igomez/genome_assembly/raw_data/hiC/C_rostrata gunzip C_rostrata_hiC_1.fastq.gz gunzip C_rostrata_hiC_2.fastq.gz cd /home/igomez/genome_assembly/raw_data/PacBio/C_helodes gunzip C_helodes_hifi.fastq.gz cd /home/igomez/genome_assembly/raw_data/hiC/C_helodes gunzip C_helodes_hiC_1.fastq.gz gunzip C_helodes_hiC_2.fastq.gz ```` Before running this param file you need to create the samples.txt file by typing ```nano samples.txt``` within the folder raw_data and the names of the folders with the species, in this case C_helodes and C_rostrata. ``` C_helodes C_rostrata ``` ## QC. Quality control of PacBio reads In this sections, we will explain how to evaluate the quality of our PacBio reads and start to evaluate key genome features such as genome size and complexity ### Sofware requirements - Conda environments: * kmers: * kmer-jellyfish 2.3.1 (https://anaconda.org/bioconda/kmer-jellyfish) * nanoplot * nanoplot 1.42.0 (https://anaconda.org/bioconda/nanoplot) * smudgeplot * smudgeplot 0.2.5 (https://anaconda.org/bioconda/smudgeplot) - R packages: - findGSE (https://github.com/schneebergerlab/findGSE) These analysis will be stored in a folder named "qc" in raw_data/PacBio. ```` cd /home/igomez/genome_assembly/raw_data/PacBio mkdir qc ```` ## QC1.NanoPlot NanoPlot is used to visualize and assess the quality of long-read sequencing data by generating plots of read lengths, quality scores and other metrics. In the early stages of genome assembly, this tool is crucial for identifying potential issues with sequenced data, helping to prevent assembly errors caused by poor-quality reads. By assessing metrics like read length distribution and quality scores, it ensures that the input data is reliable and suitable for accurate assembly. ### Sofware requirements * Conda environments: * nanoplot * nanoplot 1.42.0 (https://anaconda.org/bioconda/nanoplot) ### Run the script You can create the param file by typing ``nano qc_raw_nanoplot.sbatch`` and copy and paste, and edit if necessary, the folloguing codes. ```` #!/usr/bin/bash #SBATCH --job-name=qc_hifi_reads #SBATCH --output=qc_hifi_reads_output #SBATCH --error=qc_hifi_reads_error #SBATCH --partition=standard #SBATCH --nodes=1 #SBATCH --ntasks=6 #SBATCH --cpus-per-task=4 #SBATCH --time=6:00:00 ## Set the number of threads to use based on the allocated CPUs export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ## Define variables INDIR="/home/igomez/genome_assembly/raw_data/PacBio" OUTDIR="/home/igomez/genome_assembly/raw_data/PacBio/qc" SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" ### Create a function to run NanoPlot nanoplot_func () { SAMPLE=$1 #Create a directory for each sample mkdir -p $OUTDIR/${SAMPLE} #Run NanoPlot ## Load and activate the necessary conda environment source ~/anaconda3/etc/profile.d/conda.sh conda activate nanoplot NanoPlot -t $SLURM_CPUS_PER_TASK --fastq $INDIR/${SAMPLE}/${SAMPLE}_hifi.fastq -o $OUTDIR/${SAMPLE} -p ${SAMPLE}_hifi --loglength --plots dot -f jpeg conda deactivate echo "nanoplot ${SAMPLE} done" } ### Parallelize over samples export INDIR OUTDIR SAMPLE_DIR export -f nanoplot_func parallel 'nanoplot_func {1}' :::: $SAMPLE_DIR ```` Launch the **slurm** job using ```sbatch qc_raw_nanoplot.sbatch``` The images are nor easily seen from the cluster, so the best option is to download the images in your computer. It is also, recommended for tables as they are explored more comfortably in your computer than in the cluster. Download them using in the terminal in your computer (after exiting the cluster). ```` scp igomez@login.spc.cica.es:/home/igomez/genome_assembly/raw_data/PacBio/qc/C_rostrata/*.{jpeg,txt} your/home/folder/ ```` ### Output and interpretation The graphs of NanoPlot explore the main metrics that explore the quality of the sequenced reads. Most of them focus on the length of the reads and on the quality. The quality of a read corresponds to the average Phred-like quality score of all the bases in the read. This Phred-like quality score measures the probability of a base call being incorrect. If a base has a quality scoreof Q10, it means the accuracy is 90% (1 in 10 bases incorrect). While a quality score of Q100 indicates a accuracy of 99% (1 in 100 bases incorrect). In NanoPlot environment, there is a package called NanoStats. This package calculate metrics of quality and length distribution of the reads. These metrics are: * **Number of reads** * **Mean, median and standard deviation of read length distribution:** The longer the reads the better, around 10000 pb it would be ideal. It should follow a normal distribution, so the mean and the median should be similar. For *C.rostrata*, the mean read length is around 14000 which is quite high. Also, read length distribution seems to be normal, as mean and median metrics are quite similar. * **Mean and median of read quality distribution:** The higher the quality score the better. The threshold for genome assembly usually is around 20. * **Quality cutoffs**(``Number, percentage and megabases of reads above quality cutoffs``): This section of the table displays the number and percentage of reads that meet or exceed the quality threshold. It also indicates how many bases pass this quality cutoff. This is useful for determining a quality cutoff for our data, ensuring that there are enough reads but also that they have acceptable quality. In our case all of them have a quality score over Q20, so we are not going to purge any reads. * **Top 5 reads with the highest quality** (``Top 5 highest mean basecall quality scores and their read lengths``): * **Top 5 longer reads** (``Top 5 highest mean basecall quality scores and their read lengths``): Using PacBio and other long-sequencing technologies, reads can tens of thousands of bases. Also, this table includes metrics about the amount of data we have: **number of reads** and **total number of bases**. ```` cat ~/genome_assembly/raw_data/PacBio/qc/C_rostrata/C_rostrata_hifiNanoStats.txt General summary: Mean read length: 14,804.9 Mean read quality: 27.3 Median read length: 14,639.0 Median read quality: 31.7 Number of reads: 2,115,296.0 Read length N50: 15,931.0 STDEV read length: 4,556.5 Total bases: 31,316,780,083.0 Number, percentage and megabases of reads above quality cutoffs >Q10: 2115296 (100.0%) 31316.8Mb >Q15: 2115296 (100.0%) 31316.8Mb >Q20: 2114923 (100.0%) 31311.3Mb >Q25: 1747805 (82.6%) 25376.8Mb >Q30: 1259036 (59.5%) 17300.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 93.0 (4923) 2: 93.0 (4365) 3: 93.0 (61) 4: 93.0 (2385) 5: 93.0 (4645) Top 5 longest reads and their mean basecall quality score 1: 44000 (21.2) 2: 43415 (20.7) 3: 43079 (20.6) 4: 42358 (23.9) 5: 42352 (20.3) ```` NanoPlot also produces plot that represent visually the read length, read quality and the interaction between the two. Here there is a little exaplanation of the different plots. * **Length histograms** (``C_rostrata_hifiHistogramReadlength.jpg`` and ``C_rostrata_hifiLogTransformed_HistogramReadlength.jpg``): For genome assembly using PacBio or other long-read sequencing technologies, it is expected for reads to reach tens of thousands of bases. Normally, the longer the reads the better. There should be only one peak as multiple peaks can indicate contamination. If the distribution is very skewed towards shorter reads ot can indicate problems with library preparation, degradation of the sample and/or incomplete sequencing. This program produces also a log transformed version of the histogram. <div style="display: flex; justify-content: space-between;"> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/H1qpmmvpR.jpg" alt="![C_rostrata_hifiHistogramReadlength (002)]"> <p style="font-size: 0.6em; font-weight: bold;">Normal</p> </div> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/ryVC77wTC.jpg" alt="! [C_rostrata_hifiLogTransformed_HistogramReadlength (002)]" width="100%" style="display: block;"> <p style="font-size: 0.6em; font-weight: bold;">Log transformed</p> </div> </div> NanoPlot also produces **Weighted transformed length histograms**,where reads are weighted by length. The frequency is measured by how many bases are at each position from the beginning of the read (read length). These graphs give more value to longer reads, which are important in genome assembly. A log transformed version is also produced. <div style="display: flex; justify-content: space-between;"> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/SyLmb0taR.jpg" alt="![C_rostrata_hifiWeighted_HistogramReadlength]"> <p style="font-size: 0.6em; font-weight: bold;">Normal</p> </div> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/S1y4-RY6A.jpg" alt="![C_rostrata_hifiWeighted_LogTransformed_HistogramReadlength]" width="100%" style="display: block;"> <p style="font-size: 0.6em; font-weight: bold;">Log transformed</p> </div> </div> * **Length vs Quality graphs** (``C_rostrata_hifiLengthvsQualityScatterPlot_dot.jpg`` and ``C_rostrata_hifiLengthvsQualityScatterPlot_loglength_dot.jpg``): This graph represents how the quality of the reads change with read size. In sequenciation, the longer the read the more errors occur in the last postions. In our case we can see that most pf the reads have a quality score between 20 and 30, but among the shorter reads there are some that reach a score of 80. Even the longer reads do not fall below 20, which is a very good sign. <div style="display: flex; justify-content: space-between;"> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/SJAx87vTR.jpg" alt="![C_rostrata_hifiLengthvsQualityScatterPlot_dot]"> <p style="font-size: 0.6em; font-weight: bold;">Normal</p> </div> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/ryFWU7w6R.jpg" alt="![C_rostrata_hifiLengthvsQualityScatterPlot_loglength_dot]" width="100%" style="display: block;"> <p style="font-size: 0.6em; font-weight: bold;">Log transformed read lengths</p> </div> </div> * **Yield by length graph** This graph represent the length of the reads that contribute most to the data. The y-axis represents the total number of bases (or total yield) for all reads of a given length and longer. The x-axis represents the read length. This visualization highlights which read lengths contribute most significantly to your data. In our case the decrease in yield happend between reads which length is betrween 10kb and 20 kb, most of information come from reads shorter than 20 kb. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HJDjHRYaA.jpg" style="width:75%; text-align: center"> </div> ## QC2. GenomeScope GenomeScope estimates key genome features, such as its size, repeat content, and genetic variation, using k-mer frequencies from DNA sequencing data, providing a quick overview of the genome's structure and complexity. k-mers are substrings of a specific length obtained from a DNA (or RNA) sequence. The k-mer distribution of a sequence are all the posible substrings of a certain length (*k*), previously established. In the image bellow we can see all the posible k-mers of length 4. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/r1x0pDoc6A.png" style="width:60%; text-align: center"> </div> We will calculate the k-mer for all the reads in our data set. From this k-mer database, we will calculate a jk-mer coverage histogram, where the coverage is how many times a certain k-mer appears (x-axis) and the frequency is the number of k-mers that has a certain coverage (y-axis). This histogram will be calculated using ``jellyfish`` software. Then, we will use the software ``GenomeScope``, available online (http://genomescope.org/). The algorithm of this model predicts **genome size**, **heterozygosity**, **repeat content** and **error rate** of our genome by fitting a model to the the k-mer coverage histogram. In this analysis, the length of our k-mers (*k*) is very important in this analysis. A higher size of k-mer will capture more complexity, as we are examining longer sequences. However, the coverage of each k-mer will be lower. Also, compuatational ressources required increases exponentially with k-mer size. In more complex genomes (high repetitiveness) higher k-mer length is normally recommended. However, coverage has to be also considered, as if our general coverage is low, using a high k-mer length could be counter productive. The most common value of k used is **21**, as it is a good compromise between enough coverage and detecting complexity. Nonetheless if our genome is very complex or we have a low coverage we should consider changing the value of k. In our case we will use **k=21**. ### Sofware requirements * Conda environments: * kmers: * kmer-jellyfish 2.3.1 (https://anaconda.org/bioconda/kmer-jellyfish) ### Run the script You can create the param file by typing ```nano qc_raw_ .sbatch``` and copy and paste, and edit if necessary, the following codes. ``` #!/usr/bin/bash # SLURM job submission settings: #SBATCH --job-name=qc_jellyfish_raw # Job name to identify the task #SBATCH --output=qc_jellyfish_raw_output # File for standard output logs #SBATCH --error=qc_jellyfish_raw_error # File for error logs #SBATCH --partition=standard # Job partition/queue #SBATCH --nodes=1 # Number of nodes to use #SBATCH --ntasks=1 # Number of tasks #SBATCH --cpus-per-task=16 # Number of CPU cores allocated to the task #SBATCH --time=12:00:00 # Maximum allowed runtime # Define directories and input files SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" # File with sample names (one per line) INDIR="/home/igomez/genome_assembly/raw_data/PacBio" # Directory containing input .fastq files OUTDIR="/home/igomez/genome_assembly/raw_data/PacBio/qc" # Directory where output files will be saved # Set the number of threads to use based on the allocated CPUs export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # Function to run Jellyfish for k-mer counting and histogram creation jellyfish_histo(){ SAMPLE=$1 # Create outout directory mkdir -p $OUTDIR/${SAMPLE} # Load and activate the necessary conda environment source ~/anaconda3/etc/profile.d/conda.sh conda activate kmers # Count k-mers in the input .fastq file following the parameters specified in GenomeScope # -C: count canonical k-mers (counts both strands) # -m 21: k-mer length of 21 # -t: number of threads # -s 1000000000: hash size for the k-mer database # -o: output file for the counted k-mers jellyfish count -C -m 21 -t $SLURM_CPUS_PER_TASK $INDIR/${SAMPLE}/${SAMPLE}_hifi.fastq -s 1000000000 -o $OUTDIR/${SAMPLE}/${SAMPLE}.jf echo "k-mer counting for ${SAMPLE} completed" # Generate a histogram of k-mer frequencies # -t: number of threads # Output is redirected to a .histo file jellyfish histo -t $SLURM_CPUS_PER_TASK $OUTDIR/${SAMPLE}/${SAMPLE}.jf > $OUTDIR/${SAMPLE}/${SAMPLE}.histo echo "Histogram generation for ${SAMPLE} completed" } # Export necessary variables and function for use by GNU Parallel export INDIR OUTDIR SAMPLE_DIR export -f jellyfish_histo # Run the jellyfish_histo function in parallel for each sample listed in SAMPLE_DIR parallel 'jellyfish_histo {1}' :::: $SAMPLE_DIR ``` When the run has finished, download the .histo files in your computer using ``` scp igomez@login.spc.cica.es:/home/igomez/genome_assembly/raw_data/PacBio/qc/C_rostrata/*histo your/home/folder ``` and upload them in GenomeScope (http://genomescope.org/). This k-mer histo file contains the information necessary to recreate coverage histogram. It is composed by two column, the first indicates k-mer coverage, while the second column represents the coverage. Remember to change kmer length in GenomeScope box if you used a different one in ``jellyfish``. This ``.histo`` file will be used multiple times along this tutorial. ### Output and interpretation Once we have uploaded our ``C_rostrata.histo`` file to GenomeScope, it will lead us to an ``.html`` file with multiple graphs and information. Firstly, we will see the k-mer coverage histogram. There are two coverage histograms, the one on the right is cumulative. We will focus on the the one on the left. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HyKhwRqTC.png" style="width:75%; text-align: center"> </div> In this graph, the blue bars represent the blue bars represent our histogram (the observed data), the different color lines represent the model calculated in GenomeScope. The parameters of the model are shown in the bottom right corner. However, in this tutorial we will not focus on model fitting and its parameters. The first thing one should check if it the lines predict accuratedly the shape of the k-mer coverage histogram, as if it is not the case the estimations are not reliable. In our case the model appears to predict quite accurately the k-mer coverage histogram. The peaks in our graph are indicated by the black dashed line. Now, we will analyse the typical peaks found in a diploid genome k-mer coverage histogram: * **Errors peak** (red line) Peak in the left side of the graph, correspond to k-mers with very low coverage which are typically due to sequencing errors. Based on the amount of k-mers with this very low coverage, GenomeScope estimates the error rate in sequenciation. * **Heterozygous sequences peak** The next peak we find when we increase coverage is the peak corresponding to the heterozygous sequences. The height of this peak compared to the next gone (homozygous peak) determines how heterozygous is your genome. The level of heterozygosity influences the diffciulty of the assembly, as more heterozygous genomes are normally more difficult to assemble. For *C.rostrata* this first peak is even bigger than the homozygous one, which means that the genome is quite heterozygous. In contrast in *C.helodes* this peak is almost indetectable (see suppl. materials) which means that it is a very homozygous genome. In literature when **genome coverage** is mentioned it refers to the coverage of this peak. * **Homozygous sequences peak** At a coverage close to double the one of the previous peak, we find another big peak. This peak is the one from homozygous regions, which is why its coverage is double the previous one. * **High coverage peaks** Peaks located at higher coverages than the homozygous peak, come from duplicated and repetitive regions. Based on them GenomeScope, estimate the percentage of repetitive regions in our genome. A high proportion of repetitive regions, also complicate genome asssembly. The Genome Scope models create a model including only unique sequences (including heterozygous and homozygous) excluding repetitive regions. This one is represented by the yellow line. But, it also creates another model including the repetitive regions (black line). GenomeScope also produces a table with some genome properties estimations giving a the minimum and the maximum for each parameter. The mean estimations appear in a box whithin the k-mer coverage histogram described before. | Property | Min | Max | |---------------------------|-------------------|-------------------| | Heterozygosity | 1.45527% | 1.47433% | | Genome Haploid Length | 310,810,170 bp | 311,367,356 bp | | Genome Repeat Length | 98,318,205 bp | 98,494,459 bp | | Genome Unique Length | 212,491,965 bp | 212,872,897 bp | | Model Fit | 91.4374% | 95.0758% | | Read Error Rate | 0.268019% | 0.268019% | GenomeScope model predicted quite well the k-mer coverage histogram as graph as we have suspected from the graph. Also, sequencing errors are low (0.26%) far less than 1 %, as it has been previously established by NanoPlot. From this table we confirm *C.rostrata* high heterozygosity and a moderatedly high proportion of repetitive regions. GenomeSize is estimated by dividing the genome yield by the coverage, exluding the errors. Firstly we estimate genome size of k-mers which coverage corresponds to the homozygous peak of a diploid or lower. On the other hand, the full model also acounts for repetitive regions to calculate genome size. In simpler smaller genomes with few repetitive regions these two models are similar. However, in *C.rostrata* this is not the case. ## QC3. findGSE GenomeScope algorithm tend to understimate genome size, especially in genomes with either high or low heterozygosity. In contrast to the negative binomial model of GenomeScope, findGSE implements a skew normal distribution iteratively which has proven more accurate accross a wider variety of genome sizes (Sun et al., 2018) ### Software requirements * R packages: We will operate them from `R studio` in your computer. * findGSE (https://github.com/schneebergerlab/findGSE) ### Run the script To use findGSE, one needs to input a k value (k=21 is the most commonly used) and a corresponding k-mer histo by jellyfish-kmers. ``` library(findGSE) setwd("your/path/way") ##The value of exp_hom is an estimated value the peak of the homozygous region. You can see it in the GenomeScope graph. findGSE(histo="C_rostrata.histo", sizek=21, outdir="GSE",exp_hom=75) ``` ### Output and interpretation This program generates two output files in the folder you indicated in the R script: a txt file (``v1.94.est.lacustris.histo.genome.size.estimated.k21to21.fitted.txt``) and an image (``v1.94.est.lacustris.histo.sizek21.curvefitted.pdf``). The text file contains some estimations of the genome, similar to those calculated by GenomeScope, focussing more on genome size. There are six genome size genome estimations, depending on how they are estimated: * ``size_all``: Considering all kmers * ``size_exl``: Excluding low coverage k-mers, as they are probably sequencing errors. * ``size_fit``: Using the a model that fits the k-mer coverage data into to a Poisson model distribution. * ``size_cat``: Using the Poisson fitted model for value whithin a ceratin threshold, for the range outside of the threshold (extreme low or high coverage) the observed data is used. * ``size_cor2``: Poisson fitted model normalized by the observed mean k-mer coverage for the homozygous region. Then, ``findGSE`` also estimates: heterozygosity (``Het_rate``) giving a range and an incertity estimation (``het_fitting_delta``) when uncertainty is high enough as it is the case for *C.helodes* ;repeat content ratio (``Est. ratio of repeats``) and k-mer coverage (``Final k-mer cov``). ```` size_all 400468061 size_exl 377573383 size_cat 385868875 size_fit 376361987 size_cor2 380561362 Het_rate 0.01432125 0.01432125 Est. ratio of repeats 0.36377277 Final k-mer cov 73.59223956 ```` Then the image is anothe k-mer coverage histogram with the observed data (in grey) and the different model used to calculate genome size with different colors., as indicated in the legend. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/Bk2JhCzCC.png" style="width:75%; text-align: center"> </div> Then, there is a second graph that represents visually the differences between the different genome size estimations: <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/rkXpa0GCR.png" style="width:75%; text-align: center"> </div> As we can see in the graph for *C.rostrata* all the estimations are quite similar. Normally, the most reliable one is **size_cor2**, as it not only accounts for the inaccuracy of the fitted model in sequencing errors (low coverage) and repetitive regions (high coverage). If we compare this estiamtion to the one made by **GenomeScope**, we can see how the latter underestimate genome size, by not considering correctly repetitive regions, which are an important proportion in the genome for this species. ## QC4. Smudgeplot Smudgeplot is a tool that specifically helps in determining the ploidy level of an organism by analyzing k-mer frequencies in sequencing data. It identifies and visualizes distinct clusters of k-mers that correspond to different copy numbers in the genome, allowing researchers to infer whether the organism is diploid, triploid, tetraploid, or another ploidy level. Even when we are confident in the level of ploidy of our organism it is a good practice to test it before starting genome assembly, especially in plants were different levels of ploidy are documented. You can create the param file by typing ```nano smudgeplot.sbatch``` and copy and paste, and edit if necessary, the folloguing codes. ``` #!/usr/bin/bash #SBATCH --job-name=smudgeplot #SBATCH --output=smudgeplot_output #SBATCH --error=smudgeplot_error #SBATCH --partition=standard #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem=240G #This program requires a lot of memory #SBATCH --time=23:00:00 # Define directories and input files SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" # File with sample names (one per line) INDIR="/home/igomez/genome_assembly/raw_data/PacBio" # Directory containing input .fastq files OUTDIR="/home/igomez/genome_assembly/raw_data/PacBio/qc" # Directory where output files will be saved ### Function to run smudgeplot smudgeplot_func(){ SAMPLE=$1 cd $OUTDIR/${SAMPLE} # Load and activate the necessary conda environment source ~/anaconda3/etc/profile.d/conda.sh ##Calculate coverage cutoffs, as kmers with too ##low (probably errors) and too high (repetitive regions) ##can negatively affect our estimation) #Upper and lower limits are calculated from kmer distribution conda activate smudgeplot L=$(smudgeplot.py cutoff ${SAMPLE}.histo L) U=$(smudgeplot.py cutoff ${SAMPLE}.histo U) conda deactivate echo "cutoffs calculated for ${SAMPLE}" #Eliminate kmers outside these limits conda activate kmers jellyfish dump -c -L $L -U $U ${SAMPLE}.jf -o ${SAMPLE}_dump conda deactivate echo "dumping done for ${SAMPLE}" ##Calculate unique kmer pairs. This is the most ##time and ressource consuming step conda activate smudgeplot smudgeplot.py hetkmers -o ${SAMPLE}_kmer_pairs ${SAMPLE}_dump ##Then we do the plot the results. smudgeplot.py plot ${SAMPLE}_kmer_pairs_coverages.tsv -o ${SAMPLE}_smudgeplot conda deactivate } ###Parallelize the function export INDIR OUTDIR SAMPLE_DIR export -f smudgeplot_func parallel 'smudgeplot_func {1}' :::: $SAMPLE_DIR ``` Download the results using scp ``` scp igomez@login.spc.cica.es:/home/igomez/genome_assembly/raw_data/PacBio/qc/{C_rostrata_smudgeplot_smudgeplot_log10.png,C_rostrata_smudgeplot_smudgeplot.png} ./``` Do the same for *C. helodes*. You will obtain this graph where we represent the frequency of pairs of k-mers against each other, revealing patterns that indicate whether the organism is haploid (one set), diploid (two sets), or polyploid (multiple sets). The "smudges" or clusters of points in the graph correspond to different ploidy levels. This is the one for *Carex rostrata*. As we see the highlighted smudge is AB, indicating that it is a diploid organism. ![rostrata_smudgeplot_smudgeplot](https://hackmd.io/_uploads/HkgOr_c9C.png) ## A. De novo assembly ## A1. Haplotype-resolved De Novo Genome Assembly with hifiasm Once we made sure that our sequence data was of good quality and we characterized our genome, we will assemble our genome de novo (which means whithout previous information). We will use the software hifiasm (Haoyu Cheng et al., 2022), which given both Hifi and hiC data can produce high quality assemblies in a short time. Given Hifi and hiC data, this software can produce primary/alternate assemblies and haplotype resolved assemblies. Primary genome assembly is a single, blended sequence that combines genetic material from both parental haplotypes without distinguishing between them. This is the most complete and contiguous assembly. The additional sequences that represent significant alternative genetic variants or structural differences, are called the alternate assembly. In contrast, a haplotype-resolved assembly creates separate sequences for each parental haplotype, preserving their unique variations. In the primary/alternate assembly we only care about the contigüity and completeness of the primary genome, while in the haplotype resolved one we care for both haplotypes. ### Software requirements - Conda environments: * hifiasm: * hifiasm 0.19.9 (https://anaconda.org/bioconda/hifiasm) Create the folder `A1_hifihiC_map` within genome assembly. ```` mkdir -p genome_assembly/A1_hifihiC_map cd A1_hifihiC_map ```` ### Run hifiasm script We will create the script to produce both primary/alternate and haplotype resolved de novo genome assemblies with hifiasm. First, we create the slurm script file. ```nano A1_hifiasm_hifihic_assembly.sbatch``` Then, we write the commands whithin it. This job can take several hours to run. ``` #!/usr/bin/bash #SBATCH --job-name=A1_hifiasm_hifihic #SBATCH --output=A1_hifiasm_hifihic_output #SBATCH --error=A1_hifiasm_hifihic_error #SBATCH --nodes=1 #SBATCH --ntasks=32 #SBATCH --mem=40G #SBATCH --partition=standard #SBATCH --time=12:00:00 # Define variables INDIR_hifi="/home/igomez/genome_assembly/raw_data/PacBio" INDIR_hic="/home/igomez/genome_assembly/raw_data/hiC" OUTDIR="/home/igomez/genome_assembly/A1_hifihiC_map/" SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" #Run hifiasm hifiasm_assembly() { SAMPLE=$1 mkdir -p $OUTDIR/${SAMPLE} # Activate the conda environment source ~/anaconda3/etc/profile.d/conda.sh conda activate hifiasm hifiasm --primary -t 32 -D 10 -N 200 -o $OUTDIR/${SAMPLE}/${SAMPLE}.hifihic.asm --h1 $INDIR_hic/${SAMPLE}/${SAMPLE}_hiC_1.fastq --h2 $INDIR_hic/${SAMPLE}/${SAMPLE}_hiC_2.fastq $INDIR_hifi/${SAMPLE}/${SAMPLE}_hifi.fastq echo "hifiasm ${SAMPLE} done" conda deactivate } ### Parallelize over samples export INDIR_hifi INDIR_hic OUTDIR SAMPLE_DIR export -f hifiasm_assembly parallel 'hifiasm_assembly {1}' :::: $SAMPLE_DIR ### VERSIONS # hifiasm 0.19.9 ``` ### Hifiasm output This is the output you should get for C_rostrata: we will look through the files. There are four assemblies: primary (p_ctg), alternate (a_ctg) and both haplotype assemblies (hap1 and hap2). For each of the assemplies there is: * GFA (Graphical Fragment Assembly) file (.gfa): They represent the assembly graph, conatinaing sequences of contigs and the connections (edges) between them. These are the ones that contain the assembly. * BED (Browser Extensible Data) files (.lowQ.bed): They conatin the coordinates of low quality regions. * Simplified version of the assembly graph (.noseq.gfa): They removed the sequences of the graph, in order to use them for visualization. * Binary files (.bin): Used tore various types of data, typically related to the graph structure, overlaps, and other internal representations of the assembly process. They are not important for us. There is also the (u_ctg files), they represent an assembly without any ambigüities or gaps. We will not use these files neither. ```` ls ~/genome_assembly/A1_hifihiC_map/C_rostrata C_rostrata.hifihic.asm.ec.bin C_rostrata.hifihic.asm.hic.a_ctg.gfa C_rostrata.hifihic.asm.hic.a_ctg.lowQ.bed C_rostrata.hifihic.asm.hic.a_ctg.noseq.gfa C_rostrata.hifihic.asm.hic.hap1.p_ctg.gfa C_rostrata.hifihic.asm.hic.hap1.p_ctg.lowQ.bed C_rostrata.hifihic.asm.hic.hap1.p_ctg.noseq.gfa C_rostrata.hifihic.asm.hic.hap2.p_ctg.gfa C_rostrata.hifihic.asm.hic.hap2.p_ctg.lowQ.bed C_rostrata.hifihic.asm.hic.hap2.p_ctg.noseq.gfa C_rostrata.hifihic.asm.hic.lk.bin C_rostrata.hifihic.asm.hic.p_ctg.gfa C_rostrata.hifihic.asm.hic.p_ctg.lowQ.bed C_rostrata.hifihic.asm.hic.p_ctg.noseq.gfa C_rostrata.hifihic.asm.hic.p_utg.gfa C_rostrata.hifihic.asm.hic.p_utg.lowQ.bed C_rostrata.hifihic.asm.hic.p_utg.noseq.gfa C_rostrata.hifihic.asm.hic.r_utg.gfa C_rostrata.hifihic.asm.hic.r_utg.lowQ.bed C_rostrata.hifihic.asm.hic.r_utg.noseq.gfa C_rostrata.hifihic.asm.hic.tlb.bin C_rostrata.hifihic.asm.ovlp.reverse.bin C_rostrata.hifihic.asm.ovlp.source.bin ```` In order to use the .gfa files as output for other programs, we should transform them into the more commonly use FASTA format (.fa). To do that we can run this command: ```` cd ~/genome_assembly/A1_hifihiC_map/C_rostrata awk '/^S/ {print ">" $2 "\n" $3}' C_rostrata.hifihic.asm.hic.p_ctg.gfa > C_rostrata.hifihic.asm.hic.p_ctg.fa awk '/^S/ {print ">" $2 "\n" $3}' C_rostrata.hifihic.asm.hic.a_ctg.gfa > C_rostrata.hifihic.asm.hic.a_ctg.fa cd ~/genome_assembly/A1_hifihiC_map/C_helodes awk '/^S/ {print ">" $2 "\n" $3}' C_helodes.hifihic.asm.hic.p_ctg.gfa > C_helodes.hifihic.asm.hic.p_ctg.fa awk '/^S/ {print ">" $2 "\n" $3}' C_helodes.hifihic.asm.hic.a_ctg.gfa > C_helodes.hifihic.asm.hic.a_ctg.fa ```` At this point, you should choose what type of assembly do you need in your study: a primary or an haplotype-resolved one. This will depend on your goals. For comparative genomics, primary assembly is more commonly used, while haplotype resolved genomes can be required for functional genomics. You should keep in mind that haplotyped-resolved genome assembly take twice the ressources as you have to do each step twice, one for each haplotype. In this tutorial we will do a primary assembly. ## A2. Reducing repetitiveness within the genome assembly by purguing duplicates Purge_dups is a software tool designed to enhance the accuracy of genome assemblies by identifying and removing duplicate sequences. These duplicates often result from assembly errors or the presence of highly similar genomic regions, such as tandem repeats or paralogous genes. The tool works by analyzing sequence coverage and alignments, distinguishing between true haplotypes and erroneous duplicates. By purging these duplicates, Purge_dups improves the continuity and correctness of assembled genomes, particularly in complex organisms with high repeat content. It's particularly useful for refining assemblies generated by long-read sequencing technologies like PacBio or Oxford Nanopore. When we do *de novo* genome assembly in diploid or polyploid genomes, we must identify the homologous regions in heterozygote areas in order not to count them as different loci. This is called: **purging duplicates**. While, as we have seen, Hifiasm does this purging, it is recommended to in addition run a purging pipeline to make sure we have purged as much as the homologous regions as we can. This is especially important in highly heterozygous regions as *C.helodes*. ### Software requirements - Conda environments: * purge_duplicates: * purge-dups 1.2.6 ( https://github.com/dfguan/purge_dups) * Make sure to have installed the matplotlib 3.9.2 dependency (https://anaconda.org/conda-forge/matplotlib) * minimap2 2.28 (https://anaconda.org/bioconda/minimap2) Create the folder `A2_purge_dups` within genome assembly. ```` mkdir -p ~/genome_assembly/A2_purge_dups cd A2_purge_dups ```` ### Run purge-dups script We will create the script to purge duplicates for both primary and alternate assembly from Hifiasm. First, we create the slurm script file. ```nano A2_purge_dups_afterA1.sbatch``` Then, we write the commands whithin it. This job can take several hours to run. ``` #!/usr/bin/bash #SBATCH --job-name=A2_purge_dups_after_hifiasm #SBATCH --output=A2_purge_dups_after_hifiasm_output #SBATCH --error=A2_purge_dups_after_hifiasm_error #SBATCH --nodes=1 #SBATCH --ntasks=48 #SBATCH --mem=96G #SBATCH --partition=standard #SBATCH --time=23:30:00 # Define variables OUTDIR="/home/igomez/genome_assembly/A2_purge_dups" SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" # Create the function with all the purge_dups commands purge_dups_func() { SAMPLE=$1 # Load conda source and activate conda environment source ~/anaconda3/etc/profile.d/conda.sh conda activate purge_dups #Load directories REF=/home/igomez/genome_assembly/A1_hifihiC_map/${SAMPLE}/${SAMPLE}.hifihic.asm.hic.p_ctg.fa ALT=/home/igomez/genome_assembly/A1_hifihiC_map/${SAMPLE}/${SAMPLE}.hifihic.asm.hic.a_ctg.fa READS=/home/igomez/genome_assembly/raw_data/PacBio/${SAMPLE}/${SAMPLE}_hifi.fastq # Step1: Map reads to reference and calculate coverage mkdir -p $OUTDIR/${SAMPLE} cd $OUTDIR/${SAMPLE} minimap2 -xasm20 -t 12 $REF $READS | gzip -c - > ${SAMPLE}.paf.gz #Calculate coverage statistics pbcstat ${SAMPLE}.paf.gz #Calculate cutoff values based on coverage calcuts PB.stat > cutoffs 2> calcults.log echo "step 1 ${SAMPLE} done" # Intermediate step (optional): Create an histogram representing the coverage #and the cutoffs hist_plot.py -c cutoffs PB.stat cov_unpurged_hist.png # Step2: Split the assembly to do self-self alignment split_fa $REF > ${SAMPLE}.split minimap2 -xasm5 -t 12 -DP ${SAMPLE}.split ${SAMPLE}.split | gzip -c - > ${SAMPLE}.split.self.paf.gz echo "step 2 ${SAMPLE} done" # Step3: Purge haplotigs and overlaps purge_dups -2 -T cutoffs -c PB.base.cov ${SAMPLE}.split.self.paf.gz > dups.bed 2> purge_dups.log echo "step 3 ${SAMPLE} done" # Step4: Get purged primary and haplotigs sequences get_seqs -e dups.bed $REF echo "step 4 ${SAMPLE} done" ## We will do the same steps with the alternate assembly. #Create directory for the alternate assembly mkdir -p alternate cd alternate # When purging duplicates, more haplotig sequences were #identified (hap) that should be added to the ones identified #by hifiasm that are found in the alternate assembly. So we #will merge both files cat $ALT ../hap.fa > ${SAMPLE}_alt_hap.fa ALTREF=${SAMPLE}_alt_hap.fa #Step 1 again: Map reads into the reference genome minimap2 -xasm20 -t 12 $ALTREF $READS | gzip -c - > ${SAMPLE}.paf.gz pbcstat ${SAMPLE}.paf.gz calcuts PB.stat > cutoffs 2> calcults.log echo "step 1 part 2 ${SAMPLE} done" #Step 2 again: Align the reference genome with it self split_fa $ALTREF > ${SAMPLE}.split minimap2 -xasm5 -t 12 -DP ${SAMPLE}.split ${SAMPLE}.split | gzip -c - > ${SAMPLE}.split.self.paf.gz echo "step 2 part 2 ${SAMPLE} done" #Step 3 again:Purge duplicates purge_dups -2 -T cutoffs -c PB.base.cov ${SAMPLE}.split.self.paf.gz > dups.bed 2> purge_dups.log echo "step 3 part 2 ${SAMPLE} done" #Step 4 again: Get the alternate assembly get_seqs -e dups.bed $ALTREF echo "step 4 part 2 ${SAMPLE} done" conda deactivate } ### Parallelize over samples export OUTDIR SAMPLE_DIR export -f purge_dups_func parallel 'purge_dups_func {1}' :::: $SAMPLE_DIR ### SOFTWARE VERSIONS # minimap2 2.28 # purge_dups 1.2.6 ``` ### Purging duplicates output In the output you will have something like what it is showed in the script. This is an explanation of the most important files: * **Coverage per base pair:** The coverage per base pair is calculated when aligning the sequence reads to the assembled genome. Based on coverage, the heterozygous, homozygous, errors and repetitive regions are identified. * **PB.base.cov:** Coverage for each postion in the genome. * **PB.stat:** Statistics based of the pair base coverage distribution. * **cutoffs:** File containing the coverage thresholds for distinguishing between errors, heterozygous, homozygous and repetitive regions. * **cov_unpurged_hist.png**: Histogram representing the pair base coverage model (based on PB.stat) with the cutoffs. This is useful to identify the most important cutoffs: * Low coverage threshold (in red): The sequences with a lower coverage than this threshold are considered sequencing errors. * Mid coverage threshold (in green): Separates heterozygous and homozygous sequences. * High coverage threshold (in blue): Sequences above this thresholds are considered repetitive; they can come from repetitive regions or from replicated regions who need purging. #porfa pon aqui tus cutoffs y pega tu figura pls #Inés, cuando he visto esto ya había borrado eso de los análisis de C.rostrata :( ```` cat ~/genome_assembly/A2_purge_dups/C_helodes/cutoffs 5 33 53 66 108 198 ```` ![cov_unpurged_hist_1JMC18](https://hackmd.io/_uploads/r11eqpxi0.png) * **Primary purged assembly**: They are whithin the species folder * **purged.fa**: This is the primary assembly purged, this should be of similar size to the .p_ctg.fa file but a bit smaller (as the duplicated sequences are not present). * **hap.fa**: The sequences purged from the primary assembly are in this file. During the pipeline we merge them with the alternate assembly (.a_ctg.fa). * **dups.bed**: Coordinates of the duplicated regions. * **Altenate purged assembly**: It contains the same files as the primary purged assembly. They are whithin the alternate folder. * **C_helodes_alt_hap.fa**: We don't apply purging durectlty over the alternate file output from Hifiasm but we merge with the hap.fa file from the primary purge. * **purged.fa**: Purged alternate assembly. It is recommended to change its name in order for it to be different to the purged primary assembly, as it causes confussion for certain softwares in the next step: ``` cd ~/genome_assembly/A2_purge_dups mv ./C_rostrata/alternate/purged.fa ./C_rostrata/alternate/purged_alt.fa mv ./C_helodes/alternate/purged.fa ./C_helodes/alternate/purged_alt.fa ``` List of the output you should get after purge_dups script: ``` ls -R ~/genome_assembly/A2_purge_dups/C_helodes .: C_helodes.paf.gz C_helodes.split C_helodes.split.self.paf.gz C_helodes.log cov_unpurged_hist.png cutoffs dups.bed hap.fa PB.base.cov PB.cov.wig PB.stat purged.fa purge_dups.log ./alternate: C_helodes_alt_hap.fa C_helodes.paf.gz C_helodes.split C_helodes.split.self.paf.gz C_helodes.log cutoffs dups.bed hap.fa PB.base.cov PB.cov.wig PB.stat purged.fa purge_dups.log ```` ## A3. Quality control of the genome assembly, both purged and unpurged Reached this point, we will do a quality control of the assembly focusing on the completedness, contiguity and repetitiveness of the primary assembly. This analysis will tell us if we should continue forward with the assembly or if we should revise some of the previous steps. These tests will be performed in following steps along the genome assembly, so we will be able to monitor how these processes affect the quality of our genome assembly. For example we will see how purging affect our genome assembly by applying quality control over the purged (whithin A2_purge_dups folder) and the unpurged assembly (whithin A1_hifihic_map folder). ### Software requirements * Conda environments: * compleasm: * compleasm 0.2.6 (https://anaconda.org/bioconda/compleasm) * gfastats: * gfastats 1.3.6 (https://anaconda.org/bioconda/gfastats) * merqury * merqury 1.3 (https://anaconda.org/bioconda/merqury) * Perl scripts (I normally save them in a software folder): * contig_length_fasta.pl (https://github.com/ishengtsai/perlscripts_jit/blob/master/contig_length_fasta.pl) The analysis will be performed in the folder A3_QC whithin genome_assembly. ```` mkdir -p ~/genome_assembly/A3_QC cd ~/genome_assembly/A3_QC ```` ### Run Quality Control script First, create the script file: ````nano A3_QC.sbatch```` Then, we write the commands whithin it. We will run 4 tests in one script, for both purged and unpurged assembly. This job does not take a lot of resources and it will finish in one or two hours. As it does not require a lot of ressources we will not parallelize it but instead run it in a loop. When it comes to be merqury.sh command, if it does not work I recommend to run it directly in the console after rebooting the cluster. Before running it, you should reboot the cluster. ```` #!/usr/bin/bash #SBATCH --job-name=A3_QC #SBATCH --output=A3_QC_output #SBATCH --error=A3_QC_error #SBATCH --nodes=1 #SBATCH --ntasks=16 #SBATCH --mem=30G #SBATCH --partition=standard #SBATCH --time=6:00:00 ###Set directories OUTDIR="/home/igomez/genome_assembly/A3_QC" INDIR_UNPURGED="/home/igomez/genome_assembly/A1_hifihiC_map" INDIR_PURGED="/home/igomez/genome_assembly/A2_purge_dups" SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" ###Prepare loop for quality control: for SAMPLE in $(cat "$SAMPLE_DIR" | cut -f1) do ## Load conda source: source ~/anaconda3/etc/profile.d/conda.sh ##First, we will run the tests of the unpurged assembly #Create directories mkdir -p $OUTDIR/${SAMPLE} mkdir -p $OUTDIR/${SAMPLE}/unpurged cd $OUTDIR/${SAMPLE}/unpurged #First test: We will test genome completedness with compleasm conda activate compleasm compleasm run -a $INDIR_UNPURGED/${SAMPLE}/${SAMPLE}.hifihic.asm.hic.p_ctg.fa -o $OUTDIR/${SAMPLE}/unpurged/ -l poales_odb10 -t 16 conda deactivate #Second test: We will test contiguity with Gfastats conda activate gfastats gfastats $INDIR_UNPURGED/${SAMPLE}/${SAMPLE}.hifihic.asm.hic.p_ctg.fa -j 16 > $OUTDIR/${SAMPLE}/unpurged/${SAMPLE}_gfastats.txt conda deactivate # Third test: General contig lenth distribution with perl script. perl /home/igomez/software/contig_length_fasta.pl $INDIR_UNPURGED/${SAMPLE}/${SAMPLE}.hifihic.asm.hic.p_ctg.fa > $OUTDIR/${SAMPLE}/unpurged/${SAMPLE}_contig_length.txt # Forth test: Did we purge enough? with merqury conda activate merqury READS=/home/igomez/genome_assembly/raw_data/PacBio/${SAMPLE}/${SAMPLE}_hifi.fastq #meryl count k-mers in the sequenced reads, this step is only #needed once per genome meryl k=21 count $READS output $OUTDIR/${SAMPLE}/${SAMPLE}.meryl merqury.sh $OUTDIR/${SAMPLE}/${SAMPLE}.meryl $INDIR_UNPURGED/${SAMPLE}/${SAMPLE}.hifihic.asm.hic.p_ctg.fa $INDIR_UNPURGED/${SAMPLE}/${SAMPLE}.hifihic.asm.hic.a_ctg.fa merqury_${SAMPLE}_unpurged conda deactivate ##Then, we run them for the purged assembly. mkdir -p $OUTDIR/${SAMPLE}/purged cd $OUTDIR/${SAMPLE}/purged conda activate compleasm compleasm run -a $INDIR_PURGED/${SAMPLE}/purged.fa -o $OUTDIR/${SAMPLE}/purged/ -l poales_odb10 -t 16 conda deactivate conda activate gfastats gfastats $INDIR_PURGED/${SAMPLE}/purged.fa -j 16 > $OUTDIR/${SAMPLE}/purged/${SAMPLE}_gfastats.txt conda deactivate perl /home/igomez/software/contig_length_fasta.pl $INDIR_PURGED/${SAMPLE}/purged.fa > $OUTDIR/${SAMPLE}/purged/${SAMPLE}_contig_length.txt conda activate merqury merqury.sh $OUTDIR/${SAMPLE}/${SAMPLE}.meryl $INDIR_PURGED/${SAMPLE}/purged.fa $INDIR_PURGED/${SAMPLE}/alternate/purged_alt.fa merqury_${SAMPLE}_purged conda deactivate done ### SOFTWARE VERSIONS # compleasm 0.2.6 # contig_length_fasta.pl # gfastat 1.3.6 # merqury 1.3 ```` ### Quality control output and interpretation #### A3.1. Compleasm ##### What does it do: This software evaluates if complete genes can be found in our assembly. It has a database of genes accross the genomes distributed along the tree of life. In our case, it our case the database in formed by the genes found in 17 different species of Poales. In addition, to searching for those genes it tells you if it has been found multiple times and if only part of the gene is found. ##### Output and general interpretation: The folders ``mb downloads`` and ``poales_odb10`` contain the gene database. The file that contains our assemblies results is ``summary.txt``. If you open this file you will see something like this: ```` cat ~/genome_assembly/A3_QC/C_rostrata/purged/summary.txt ## lineage: poales_odb10 S:76.29%, 3735 D:4.84%, 237 F:0.96%, 47 I:0.00%, 0 M:17.91%, 877 N:4896 cat ~/genome_assembly/A3_QC/C_rostrata/unpurged/summary.txt ## lineage: poales_odb10 S:75.98%, 3720 D:5.17%, 253 F:0.96%, 47 I:0.00%, 0 M:17.89%, 876 N:4896 ```` These are results for the purged and unpurged *C.rostrata* assembly respectively. This analysis tells you what number and percentage of genes... * are found one single time (**S**) * are found more than one time, as they are duplicated (**D**) * are fragmented, the gene is distributed along more than one contig (**F**) * are incomplete, only one part of the gene was found (**I**) * are missing (**M**) out of the total number of genes studied (**N**). The higher the percentage of genes appearing only one time (**S**), the better the assembly. If the proportion of genes fragmented or incomplete in very high, that means the genome is still poorly assembled. A high proportion of duplicated genes could indicate that the genome assembly may require further purging to remove redundancy. However, it may also reflect true biological duplications, where the organism naturally possesses multiple copies of those genes. A lot of missing genes can indicate a low-quality sequencing and/or poor assembly, but it could also result from genetic divergence between the reference genomes used in the gene database and the species under study. Comapring the results from purged and unpurged genome assembly, is interesenting to evaluate the purging. The proportion of duplicated genomes should be lower in the purged assembly. The proportion of missing, fragmented or incomplete genes should not increase after purging. If this happens it indicates that we have purged too much deleting compromising valuable genetic information. ##### *Carex rostrata* results: | Metric | Unpurged Assembly | Purged Assembly | |-----------------------|-------------------------|-------------------------| | **Single Copy Genes (S)** | 75.98% (3720) | 76.29% (3735) | | **Duplicated Genes (D)** | 5.17% (253) | 4.84% (237) | | **Fragmented Genes (F)** | 0.96% (47) | 0.96% (47) | | **Incomplete Genes (I)** | 0.00% (0) | 0.00% (0) | | **Missing Genes (M)** | 17.89% (876) | 17.91% (877) | | **Total Genes (N)** | 4896 | 4896 | In our case the proportion of genes found only once is high (76%), while at the same time the proportion of incomplete (1%) and fragmented genes (0%) is very low. This incates a good level of completeness and integrity in our assembly. However, there is a relatively high proportion of missing genes (18%). Although we should keep that in mind, it is most probable that it is related to the genetic distance between the reference genomes and our species, as neither of them are from the Cyperaceae family. The proportion of duplicated genes is quite high (5 %), compare to what is normally expected 1-2%. The purging did decrease the number of duplicated genes, it also increased the number of fragmented and missing genes. Considering this, we could consider that we have overpurged a bit compromising some valuable genetic information. Given this information, we could conclude that the genome itself is repetitive leading to a high number of duplicated genes. We could conclude that the genome assembly did not benefit from further purging and we should use the hifiasm output (the one in folder A1_hifihiC_map folder). #### A3.2. Gfastats ##### What does it do? This tools will calculate summary stadistics about the number, length and base composition of contigs and scaffolds. This analysis can be useful to monitor the contiguity of the genome along the assembly proccess. ##### Output and general interpretation As we have not yet done any scaffolding we will focuss on contig metrics. In fact, as scaffolding was not still made the scaffold and contig metrics are the same. The concept of gaps only applies after the scaffolding is done, so we will ignore them for now. The typical metrics in gfastats related to contigs are: * number of contigs (**contigs:**) * sum of the lengths of all contigs (**Total scaffold length:**) * the average length of the contigs (**Average contig length:**) * the median length of the contigs. It represents the length of the contig at which 50% of the total assembly length is contained in contigs of this length or longer. This is one of the most commonly used metrics to assess the quality of the genome assembly, as contig length does not tend to follow a normal distribution. The higher, the better as it indicates there are fewer and longer contigs. (**Contig N50:**) * The auN (area under the N-curve) is a more refined metric than the N50. It provides a weighted average of the contig lengths, considering their contribution to the total assembly. (**Contig auN:**) * The L50 is the number of contigs that make up the half of the total contig length. A smaller L50 number indicates that fewer, larger contigs are contributing to the bulk of the assembly, which is typically a sign of a good assembly (**Contig L50:**). * Largest and smallest contig length. * The base composition of the assembly, the proportion of each of the four DNA bases: **Base composition (A:C:G:T):** * **GC content %:** This metrics id useful to identify if our genome is very contaminates (as bacterial genome has a higher GC contect than eucaryotes). ```` cat ~/genome_assembly/A3_QC/C_rostrata/purged/C_rostrata_gfastats.txt +++Assembly summary+++: # scaffolds: 106 Total scaffold length: 380458824 Average scaffold length: 3589234.19 Scaffold N50: 11161153 Scaffold auN: 11947747.58 Scaffold L50: 13 Largest scaffold: 20519163 Smallest scaffold: 20145 # contigs: 106 Total contig length: 380458824 Average contig length: 3589234.19 Contig N50: 11161153 Contig auN: 11947747.58 Contig L50: 13 Largest contig: 20519163 Smallest contig: 20145 # gaps in scaffolds: 0 Total gap length in scaffolds: 0 Average gap length in scaffolds: 0.00 Gap N50 in scaffolds: 0 Gap auN in scaffolds: 0.00 Gap L50 in scaffolds: 0 Largest gap in scaffolds: 0 Smallest gap in scaffolds: 0 Base composition (A:C:G:T): 125278479:65002847:64961350:125216148 GC content %: 34.16 # soft-masked bases: 0 # segments: 106 Total segment length: 380458824 Average segment length: 3589234.19 # gaps: 0 # paths: 106 ```` ##### C.rostrata results | Metric | Purged Assembly | Unpurged Assembly | |-----------------------|-------------------------|-------------------------| | **Number of Contigs** | 106 | 1056 | | **Total Contig Length** | 380,458,824 bp | 427,114,718 bp | | **Average Contig Length** | 3,589,234.19 bp | 404,464.70 bp | | **Contig N50** | 11,161,153 bp | 10,645,412 bp | | **Contig auN** | 11,947,747.58 bp | 10,665,737.45 bp | | **Contig L50** | 13 | 15 | | **Largest Contig** | 20,519,163 bp | 20,519,163 bp | | **Smallest Contig** | 20,145 bp | 14,793 bp | | **GC content (%)** | 34.16 % | 34.58 % | The number of contigs before the additional purging was 1056 with an N50 of 10.6 Mb and L50 of 15, which is a good start in assembly. After the additional purging with ``purge_dups``, the number of contigs decreased to 106, as well as the total contig length while the N50 increased to 11,2 GB, which means there are fewer and bigger contigs. We can interpret that during the purging some contigs were discarded from the primary assembly, more so, the smaller ones that normally contain repetitive regions. The additional purging discarded smaller contigs that contain repetive regions leading to an overall better assembly. This purge of smaller repetitive contigs was not reflected in compleaSM, as this regions do not tend to contain genes. #### A3.3. Contig_length_fasta ##### What does it do? This script just creates a list with all the contigs and their length. This file is mainly used to manually inspect the contig length, and to create graphs to monitor the evolution of the contig length. This will be more useful in the next steps. ```` head ~/genome_assembly/A3_QC/C_rostrata/purged/C_rostrata_contig_length.txt ptg000001l 8716604 ptg000002l 13702659 ptg000003l 8824609 ptg000004l 9347619 ptg000005l 8179128 ptg000006l 5587025 ptg000007l 9570554 ptg000008l 7559943 ptg000009l 15431245 ptg000010l 14602151 ```` #### A3.4. Merqury ##### What does it do? Merqury is a tool for evaluating de novo genome assembly based on the comparison between k-mers in the assembly to those found in unassembled high-accuracy reads. It can estimate base-level accuracy and completeness. A key feature that sets Merqury apart from other analysis tools is its ability to incorporate alternate assemblies into the evaluation process. This includes assessing the accuracy and completeness of the alternate assembly and determining how much it contributes to the overall genome assembly. ##### Output and general interpretation Merqury produces multiple files in its output. We will examine the most important ones. * **.meryl files**: They are databases of k-mers generated by the software. This database is created for both the primarary and the alternate genome assembly and for the PacBio reads. These k-mer databases will be compared to produce the rest of the analysis. * **.completeness.stats**: The completedness of the assembly is evaluated by calculating the percentage of kmers obtained from the PacBio reads that are found in the genome assembly. This is calculated for the primmary, the alternate genome assembly and combination of both. For example, for the primmary purged assembly out of the 316552299 kmers from the PacBio reads, 253234707 were found in the assembly (79.9977%). ```` cat ~/genome_assembly/A3_QC/C_rostrata/purged/merqury_C_rostrata_purged.completeness.stats purged all 253234707 316552299 79.9977 purged_alt all 81051800 316552299 25.6046 both all 264858723 316552299 83.6698 cat ~/genome_assembly/A3_QC/C_rostrata/unpurged/merqury_C_rostrata_unpurged.completeness.stats C_rostrata.hifihic.asm.hic.p_ctg all 254052832 316552299 80.2562 C_rostrata.hifihic.asm.hic.a_ctg all 260600899 316552299 82.3248 both all 313210214 316552299 98.9442 ```` * **Consensus quality scores (.qv)**: The table shows measures of the probability of error. This is shown in the last column, in consequence the lower the better. In the 4th column the consensus queality score is shown, it is a Phred-scaled score where QV = − 10 log10 E for a probability of error E at each base in the assembly. In this case, the higher the better. ```` cat ~/genome_assembly/A3_QC/C_rostrata/purged/merqury_C_rostrata_purged.qv purged 2574 380456704 64.9191 3.2217e-07 purged_alt 14947 124956463 52.444 5.6964e-06 Both 17521 505413167 57.823 1.65082e-06 cat ~/genome_assembly/A3_QC/C_rostrata/unpurged/merqury_C_rostrata_unpurged.qv C_rostrata.hifihic.asm.hic.p_ctg 17521 427093598 57.0917 1.95355e-06 C_rostrata.hifihic.asm.hic.a_ctg 28116 561414272 56.2254 2.38485e-06 Both 45637 988507870 56.5787 2.1985e-06 ```` * **Graphs based on k-mer coverage and frequency:** Merury created multiple graphs based on the distribution k-mer coverage, where one can see how many k-mers are present at different coverage levels. They are similar to the histogram created with ``jellyfish``. The k-mer distrbution graphs are generated considering only one of the assembies and considering, we will only analyze the last ones (). ```` scp -r igomez@login.spc.cica.es:/home/igomez/genome_assembly/A3_QC/C_rostrata/purged/merqury_C_rostrata_purged.spectra*.png ./your/folder scp -r igomez@login.spc.cica.es:/home/igomez/genome_assembly/A3_QC/C_rostrata/unpurged/merqury_C_rostrata_unpurged.spectra*.png ./your/folder scp -r igomez@login.spc.cica.es:/home/igomez/genome_assembly/A3_QC/C_helodes/purged/merqury_C_helodes_purged.spectra*.png ./your/folder scp -r igomez@login.spc.cica.es:/home/igomez/genome_assembly/A3_QC/C_helodes/unpurged/merqury_C_helodes_purged.spectra*.png ./your/folder ```` There are two different kind of plots depending on the additional information they represent on top of the k-mer coverage frequencies: * **spectra-asm plot**: This plot explore which proportion of the k-mers are found in the primmary assembly, in the alternate or in both. <div style="display: flex; justify-content: space-between;"> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/ByHiGiEoR.png" alt="![merqury_C_rostrata_unpurged.spectra-asm.fl]"> <p style="font-size: 0.8em; font-weight: bold;">unpurged</p> </div> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/rkx-ZiVoC.png" alt="![merqury_C_rostrata_purged.spectra-asm.fl]" width="100%" style="display: block;"> <p style="font-size: 0.8em; font-weight: bold;">purged</p> </div> </div> * **spectra-cn plot**: This plot represents the estimated ploidy based on the peaks on the k-mer coverage distribution. The coverage cutoffs for different ploidy levels are found in the ``hist.ploidy`` files. <div style="display: flex; justify-content: space-between;"> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/H16e4iVoR.png" alt="![merqury_C_rostrata_unpurged.spectra-cn.fl]"> <p style="font-size: 0.8em; font-weight: bold;">unpurged</p> </div> <div style="width: 49%; text-align: center;"> <img src="https://hackmd.io/_uploads/Sk3JViEjC.png" alt="![merqury_C_rostrata_purged.spectra-cn.fl]" width="100%" style="display: block;"> <p style="font-size: 0.8em; font-weight: bold;">purged</p> </div> </div> For both plots, there are 3 .png files, each with a different histogram aesthetic: *.st* (stacked histogram), *.fl* (filled) and *.ln* (lines). The four histograms above use the filled style. Here is an example of the spectra-cn plots of *C.rostrata* purged assembly in the 3 histogram types: stacked, filled and lines. <div style="display: flex; justify-content: space-between;"> <div style="width: 30%; text-align: center;"> <img src="https://hackmd.io/_uploads/HJsv3tVoR.png" alt="merqury_C_rostrata_purged.spectra-asm.st" width="100%" style="display: block;"> <p style="font-size: 0.6em; font-weight: bold;">stacked (.st)</p> </div> <div style="width: 30%; text-align: center;"> <img src="https://hackmd.io/_uploads/SyD53t4jA.png" alt="merqury_C_rostrata_purged.spectra-asm.fl" width="100%" style="display: block;"> <p style="font-size: 0.6em; font-weight: bold;">filled (.fl)</p> </div> <div style="width: 30%; text-align: center;"> <img src="https://hackmd.io/_uploads/H1np2FEjA.png" alt="merqury_C_rostrata_purged.spectra-asm.ln" width="100%" style="display: block;"> <p style="font-size: 0.6em; font-weight: bold;">lines (.ln)</p> </div> </div> ## B. Scaffolding The next step in our genome assembly is scaffolding. This process involves linking shorter DNA sequences (contigs) into longer sequences (scaffolds) by using paired-end reads, short DNA sequences that contain information from two reads When one read of a paired-end sequence maps to one contig and the other read maps to a different contig, we can infer that these two contigs are close to each other in the genome even if we do not have the sequences that connect them. This connection between two contigs via a paired-end read is called a 'contact'. Normally, we rely on more than one contact to group contigs into contacts as there can be errors in mapping, especially in repetitive regions. The more contacts there are between two contigs the more likely it is that they are close, and we can put them in one scaffold more confidently. In addition to forming scaffolds we can explore which scaffolds are closer together. H-Ci data in addition allows us to explore some structure between chromosomes in the nucleus. In our case, the pair-end reads we have used are Hi-C reads (see figure ..). Hi-C adds more information than other pair-end reads, as it captures the three-dimensional structure of the genome, linking contigs based on their spatial proximity within the nucleus. This is achieved by first crosslinking reads that are close in the nucleus, so when they are digested by restruction enzymes they will be sequenced in the same hiC DNA molecule as they are sticked together. The single-stranded 5’-overhangs are then filled in causing digested ends to be labeled with a biotinylated nucleotide. Next,spatially proximal digested ends of DNA are ligated, preserving both short- and long-range DNA contiguity. In contrast to other pair-end reads,HiC data allows us to explore some structure between chromosomes in the nucleus as pair end reads can include information from different chromosomes. ![1869fig1](https://hackmd.io/_uploads/rJJzwMcjR.jpg) <span style="font-size: 0.8em;">**Figure :** Hi-C overview. Cells are cross-linked with formaldehyde, resulting in covalent links between spatially adjacent chromatin segments (DNA fragments: dark blue, red; Proteins, which can mediate such interactions, are shown in light blue and cyan). Chromatin is digested with a restriction enzyme (here, HindIII; restriction site: dashed line). The resulting sticky ends are filled in with nucleotides, one of which is biotinylated (purple dot). Ligation is performed under extremely dilute conditions favoring intramolecular ligation events; the HindIII site is lost and an NheI site is created (inset). DNA is purified and sheared, and biotinylated junctions are isolated using streptavidin beads. Interacting fragments are identified by paired-end sequencing. Obtained from Van Berkum et al., 2010. </span> ## B1. Aligning hiC reads in our genome assembly. The first step in scaffolding is mapping hiC reads in our genome assembly. This mapping procedure also includes trimming steps to eliminate adapter sequences and molecular barcodes. It also includes some filtering of Hi-C reads to correct for erroneous mapping. HiC technique produces two types of reads depending on the closeness of the crosslinked regions: * **Non-chimeric reads:** If the two crosslinked regions are contiguous to each other, the hiC read will be mapped to only one region in the genome. They will behave as single-end reads. * **Chimeric reads:** If the two crosslinked regions are not contiguous in the genome, then the hiC reads are mapped in two parts of our reference genome, the two regions that were crosslinked together. These are the most important ones for scaffolding. As we have chimeric reads, we will maps both DNA strands.We will map each of the strands as single-end reads following the 5' end, so we can map both ends of the DNA read. Then, we will pair up the read partners in subsequent steps. We will follow the Arima User Guide mapping pipeline. To find more information visit their github page (https://github.com/ArimaGenomics/mapping_pipeline/tree/master). ### Software requirements * Conda environments: * ref_map: * bwa 0.7.18 (https://anaconda.org/bioconda/bwa) * fastp 0.23.4 (https://anaconda.org/bioconda/fastp) * picard 3.2.0 (https://anaconda.org/bioconda/picard) * samtools 1.20 (https://anaconda.org/bioconda/samtools) * Perl scripts from Arima Mapping Pipeline. They will be store in SOFTWAREDIR ("/home/igomez/software") * filter_five_end.pl (https://github.com/ArimaGenomics/mapping_pipeline/blob/master/filter_five_end.pl) * get_stats.pl (https://github.com/ArimaGenomics/mapping_pipeline/blob/master/get_stats.pl) * two_read_bam_combiner.pl (https://github.com/ArimaGenomics/mapping_pipeline/blob/master/two_read_bam_combiner.pl) This step requires the most disk space, almost 400 GB per sample, as may big intermidiate files need to be created. Make sure that you have enough space before running this scripts. This script also requires multiple hours to finish, almost a whole day. Also, it requires 120 GB of CPU memmory or more. ### Run the script First, we create the slurm script file. ```` nano B1_hiCmap_shallow.sbatch ```` This script will be run in the following folder: ```` mkdir -p ~/genome_assembly/B1_hiC_mapping cd ~/genome_assembly/B1_hiC_mapping ```` Then, we write the commands whithin it. This job can take several hours to run ```` #!/bin/bash #SBATCH --job-name=B1_hiC_mapping #SBATCH --output=B1_hiC_mapping_output #SBATCH --error=B1_hiC_mapping_error #SBATCH --export=ALL #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem=240G #SBATCH --cpus-per-task=32 #SBATCH --partition=standard #SBATCH --time=23:30:00 ####Set directories INDIR_hic=/home/igomez/genome_assembly/raw_data/hiC OUTDIR=/home/igomez/genome_assembly/B1_hiC_mapping SOFTWAREDIR=/home/igomez/software SAMPLE_DIR="/home/igomez/genome_assembly/raw_data/samples.txt" #### Create a function to parallelize the mapping of hiC reads mapHiCreads() { SAMPLE=$1 mkdir -p $OUTDIR/${SAMPLE} ##Set conda source and activate environment source ~/anaconda3/etc/profile.d/conda.sh conda activate ref_map ## Index ref, both with samtools and bwa REF=/home/igomez/genome_assembly/A2_purge_dups/${SAMPLE}/purged.fa bwa index -a bwtsw $REF samtools faidx $REF ## Trimming 5 bases out of 5' end out of read 1 and read 2, ##as they contain adapter sequencuences awk '{ if(NR%2==0) {print substr($1,6)} else {print} }' $INDIR_hic/${SAMPLE}/${SAMPLE}_hiC_1.fastq | gzip > $INDIR_hic/${SAMPLE}/${SAMPLE}_R1_5trim.fastq.gz awk '{ if(NR%2==0) {print substr($1,6)} else {print} }' $INDIR_hic/${SAMPLE}/${SAMPLE}_hiC_2.fastq | gzip > $INDIR_hic/${SAMPLE}/${SAMPLE}_R2_5trim.fastq.gz echo "5 bases trim done for ${SAMPLE}" ##Trim poly-G tails from the ends of sequencing reads,as they are sequencing artifacts fastp --trim_poly_g -l 120 -q 30 -w 16 \ --in1 $INDIR_hic/${SAMPLE}/${SAMPLE}_R1_5trim.fastq.gz \ --out1 $INDIR_hic/${SAMPLE}/${SAMPLE}_R1_trimmed.fastq.gz \ --in2 $INDIR_hic/${SAMPLE}/${SAMPLE}_R2_5trim.fastq.gz \ --out2 $INDIR_hic/${SAMPLE}/${SAMPLE}_R2_trimmed.fastq.gz echo "poly-G tail trim done for ${SAMPLE}" ## Map HiC reads to ref using bwa mem ## First we need to map them as single reads, but we do it twice for both DNA strands mkdir -p $OUTDIR/${SAMPLE}/rawbamdir bwa mem -t 16 $REF $INDIR_hic/${SAMPLE}/${SAMPLE}_R1_trimmed.fastq.gz | samtools view -@ 16 -Sb - > $OUTDIR/${SAMPLE}/rawbamdir/${SAMPLE}_1.bam bwa mem -t 16 $REF $INDIR_hic/${SAMPLE}/${SAMPLE}_R2_trimmed.fastq.gz | samtools view -@ 16 -Sb - > $OUTDIR/${SAMPLE}/rawbamdir/${SAMPLE}_2.bam echo "single read mapping done for ${SAMPLE}" ## We will eliminate the 3' end as it is redundant with the 5' end of the other strand. mkdir -p $OUTDIR/${SAMPLE}/filtbamdir samtools view -h $OUTDIR/${SAMPLE}/rawbamdir/${SAMPLE}_1.bam | perl $SOFTWAREDIR/filter_five_end.pl | \ samtools view -Sb - > $OUTDIR/${SAMPLE}/filtbamdir/${SAMPLE}_1.bam samtools view -h $OUTDIR/${SAMPLE}/rawbamdir/${SAMPLE}_2.bam | perl $SOFTWAREDIR/filter_five_end.pl | \ samtools view -Sb - > $OUTDIR/${SAMPLE}/filtbamdir/${SAMPLE}_2.bam echo " 3' end eliminated from ${SAMPLE} " ## Now we will pair the two mapped DNA strands using Map quality 0! is an important parameter which ##outputs a sorted, mapping quality filtered, paired-end BAM file. We then add read groups to this BAM ##file using Picard Tools. mkdir -p $OUTDIR/${SAMPLE}/tmpdir perl $SOFTWAREDIR/two_read_bam_combiner.pl $OUTDIR/${SAMPLE}/filtbamdir/${SAMPLE}_1.bam \ $OUTDIR/${SAMPLE}/filtbamdir/${SAMPLE}_2.bam samtools 1 | \ samtools view -bS -t ${REF}.fai - | samtools sort -@ 16 -o $OUTDIR/${SAMPLE}/tmpdir/${SAMPLE}.bam - mkdir -p $OUTDIR/${SAMPLE}/pairdir picard -Xmx100g AddOrReplaceReadGroups INPUT=$OUTDIR/${SAMPLE}/tmpdir/${SAMPLE}.bam OUTPUT=$OUTDIR/${SAMPLE}/pairdir/${SAMPLE}.bam ID=${SAMPLE} \ LB=${SAMPLE} SM=scaffolding PL=ILLUMINA PU=none echo "read pairing done for ${SAMPLE}" ## Remove PCR duplicates from pair-end BAM file mkdir -p $OUTDIR/${SAMPLE}/repdir picard -Xmx100g MarkDuplicates \ INPUT=$OUTDIR/${SAMPLE}/pairdir/${SAMPLE}.bam OUTPUT=$OUTDIR/${SAMPLE}/repdir/${SAMPLE}_final.bam \ METRICS_FILE=$OUTDIR/${SAMPLE}/repdir/metrics.replicates.txt TMP_DIR=$OUTDIR/${SAMPLE}/tmpdir/ \ ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=TRUE echo "PCR duplicates removed from ${SAMPLE}" ##Finally, index the final file and calculate statistics to monitor the mapping samtools index $OUTDIR/${SAMPLE}/repdir/${SAMPLE}_final.bam perl $SOFTWAREDIR/get_stats.pl $OUTDIR/${SAMPLE}/repdir/${SAMPLE}_final.bam > $OUTDIR/${SAMPLE}/repdir/${SAMPLE}_final.bam.stats ## Not only calculate statistics of the final file but also from intermediate steps. Add them to flagstat file touch $OUTDIR/${SAMPLE}/flagstat.txt echo ">>> Stats raw bam R1" >> $OUTDIR/${SAMPLE}/flagstat.txt samtools flagstat $OUTDIR/${SAMPLE}/rawbamdir/${SAMPLE}_1.bam >> $OUTDIR/${SAMPLE}/flagstat.txt echo ">>> Stats raw bam R2" >> $OUTDIR/${SAMPLE}/flagstat.txt samtools flagstat $OUTDIR/${SAMPLE}/rawbamdir/${SAMPLE}_2.bam >> $OUTDIR/${SAMPLE}/flagstat.txt echo ">>> Stats bam chimeric reads filtered R1" >> $OUTDIR/${SAMPLE}/flagstat.txt samtools flagstat $OUTDIR/${SAMPLE}/filtbamdir/${SAMPLE}_1.bam >> $OUTDIR/${SAMPLE}/flagstat.txt echo ">>> Stats bam chimeric reads filtered R2" >> $OUTDIR/${SAMPLE}/flagstat.txt samtools flagstat $OUTDIR/${SAMPLE}/filtbamdir/${SAMPLE}_2.bam >> $OUTDIR/${SAMPLE}/flagstat.txt echo ">>> Stats bam paired and mapping quality filtered" >> $OUTDIR/${SAMPLE}/flagstat.txt samtools flagstat $OUTDIR/${SAMPLE}/tmpdir/${SAMPLE}.bam >> $OUTDIR/${SAMPLE}/flagstat.txt echo ">>> Stats final bam (duplicates filtered)" >> $OUTDIR/${SAMPLE}/flagstat.txt samtools flagstat $OUTDIR/${SAMPLE}/repdir/${SAMPLE}_final.bam >> $OUTDIR/${SAMPLE}/flagstat.txt echo "index and stats done for ${SAMPLE}" } ### Parallelize over individuals export INDIR_hic OUTDIR SOFTWAREDIR SAMPLE_DIR export -f mapHiCreads parallel 'mapHiCreads {1}' :::: $SAMPLE_DIR ### SOFTWARE VERSIONS # fastp 0.23.4 # bwa 0.7.15 # Picard 1.141 # Arima mapping pipeline V03 # Samtools 1.20 ```` ### Output and interpretation This script produces a lot of big intermediate files. Here, we show the output that you should obtain for *C.rostrata*. This is useful to troubleshoot in order to monitor in which step the scripts failed . ```` cd ~/genome_assembly/B1_hiC_mapping/C_rostrata/ du -ah . 24G ./pairdir/C_rostrata.bam 24G ./pairdir 24G ./tmpdir/C_rostrata.bam 24G ./tmpdir 8.0K ./flagstat.txt 18G ./repdir/C_rostrata_final.bam 8.0K ./repdir/metrics.replicates.txt 1.2M ./repdir/C_rostrata_final.bam.bai 29G ./repdir/C_rostrata_final_sorted.bam 4.0K ./repdir/C_rostrata_final.bam.stats 46G ./repdir 47G ./filtbamdir/C_rostrata_2.bam 46G ./filtbamdir/C_rostrata_1.bam 93G ./filtbamdir 61G ./rawbamdir/C_rostrata_2.bam 60G ./rawbamdir/C_rostrata_1.bam 121G ./rawbamdir 305G . ```` Then, we will look at the files that contain the startistics of our mapping (``C_rostrata_final.bam.stats``) and monitor the proccess of mapping (``flagstat.txt``). The file ``C_rostrata_final.bam.stats`` contains metrics about the frequency of different types of pair end interactions, depending on the distance between read pairs and depending on if they mapped within the same chromosomes. * **All**: Number of read pairs mapped in total. * **All intra:** Number of read pairs mapped whithin the same chromosome. If the HiC mapping is successful the majority (around 80-90%) of the contacts should be in the same chromsome. * **All intra 1, 10, 15, 20 kb:** Number of pairs mapped whithin at least 1,10,15,20 kb of distance respectively in the same chromsome. If these are balanced is a symbol of a good HiC mapping as it is able to capture long and short interactions. * **All inter:** Number of read pairs mapped on different chromosomes. ```` cat ~/genome_assembly/B1_hiC_mapping/C_rostrata/repdir/C_rostrata_final.bam.stats All 148040651 All intra 63999126 All intra 1kb 44548372 All intra 10kb 37390870 All intra 15kb 36188408 All intra 20kb 35348058 All inter 84041525 ```` Then, we will look through the ``flagstats.txt`` file. This file monitors the quality of the Hi-C reads and the success of their mapping into the genome and their pairing with their read partner, at the difference steps of the script. The file contain the number of reads in different categories depending on their mapping and pairing, in each of the categories it distinguishes between Hi-C reads that past the quality control test and those who fail it: * **Total number of reads before they are mapped** (``in total``) * **Metrics related to mapping:** * **Primary, scondary and supplementary alignments** (``primary`` and ``secondary``): The best, most reliable alignment of the Hi-C reads into the genome are the **primary alignement** of a read. **Secondary alignments** are alternative alignments for a read that also aligns well to different locations in the genome but are not the "best" alignment, they are common in repetitive areas of the genome. Generally, higher proportion of primary reads the better, as we are more confident of the read mapping. * **Supplementary alignments** (``supplementary``): They happen when different parts of the read map to different genome regions. For other types of reads these are not desirable, but these are expected for chimeric Hi-C reads. * **Percentage of reads who got mapped** (``mapped``): Includes all the reads that got mapped either in their primary, secondary or supplementary alignment. * **Percentage of reads that got mapped to their primary alignemnts** (``primary mapped``): Percentage of reads that got mapped to their primmary alignment. * **Metrics related to pairing:** * **Read pairs in sequencing** (``paired in sequencing``,``read1`` and ``read2`` ): Number of reads marked as part of a pair during the sequencing. From these reads we distinguish between those that start from the 5' end (``read1``) and those who start from the 3' end (``read2``) of the pair. Ideally,``read1`` and ``read2`` should have the same number of reads, and they should add to ``paired in sequencing``. * **Mapped read pairs** (``with itself and mate mapped`` and ``singletons``): Percentage of the paired reads that both partners are both mapped. If only one partner of the read pair is mapped, the read pair is calssified as **singleton**. * **Properly paired read pairs** (``properly paired``): Percentage of the read pairs where not only both read partners are mapped but also their relative orientation (read 1 is upstream of read 2) and distance is coherent. * **Interchromosmosal read pairs** (``with mate mapped to a different chr`` and ``with mate mapped to a different chr (mapQ>=5)``): Number of read pairs where read partners are aligned to different chromosomes. There is another metric that only count those pairs if the mapping quality is high (mapQ>=5). The metrics above are measured after each of this different steps. Here you see the folder that conatain the files which result of each of the different proccess in the Hi-C mapping. We will not look at pairing metrics until step 3, when pairing takes place. Before the pairing for each step we have to sets of metrics one for each read orientation: **R1** for those reads who start at 5' end and **R2** for those reads that start at 3' end. Stats | raw bam R1 | raw bam R2 | bam chimeric reads filtered R1 | bam chimeric reads filtered R2 | bam paired and mapping quality filtered | final bam (duplicates filtered) | |-------------------------------------------------|------------------------|------------------------|--------------------------------|--------------------------------|-----------------------------------------|--------------------------------| | In Total (QC-passed reads + QC-failed reads) | 670034363 | 668863715 | 501226609 | 501226609 | 504186860 | 296081302 | | Secondary | 0 | 0 | 0 | 0 | 0 | 0 | | Supplementary | 168807754 | 167637106 | 76422484 | 75774711 | 0 | 0 | | Duplicates | 0 | 0 | 0 | 0 | 0 | 0 | | Mapped (% : N/A) | 660060481 (98.51%) | 658816371 (98.50%) | 443667898 (88.52%) | 444226454 (88.63%) | 504186860 (100.00%) | 296081302 (100.00%) | | Paired in Sequencing | 0 | 0 | 0 | 0 | 504186860 | 296081302 | | Read1 | 0 | 0 | 0 | 0 | 252093430 | 148040651 | | Read2 | 0 | 0 | 0 | 0 | 252093430 | 148040651 | | Properly Paired (N/A : N/A) | 0 | 0 | 0 | 0 | 504186860 (100.00%) | 296081302 (100.00%) | | With Itself and Mate Mapped | 0 | 0 | 0 | 0 | 504186860 | 296081302 | | Singletons (N/A : N/A) | 0 | 0 | 0 | 0 | 0 (0.00%) | 0 (0.00%) | | With Mate Mapped to a Different Chr | 0 | 0 | 0 | 0 | 287123470 | 168083050 | | With Mate Mapped to a Different Chr (mapQ>=5) | 0 | 0 | 0 | 0 | 280843995 | 164320590 | * **1. Hi-C reads mapping as single end reads of both read files (R1 and R2)** (``rawbambir``) : Firstly we can check the quality of the hiC reads, if the number of reads that fail the quality control test is too high, the sequenciation was not successful. Then, we will evaluate the success of mapping by looking at what percentage of the reads are mapped. We should also consider which percentage of the mapped reads are mapped to their primary alignment. A very high number of mappings to the secondary alignment can indicate problems with the proccess of mapping, for example incorrect sizing of pair end reads as shorter reads can align in more position in the genome. However, high number of secondary alignments can be common in complex genomes. * **2. Elimination of 3' end mapping for chimeric reads for both read files (R1 and R2)** (``filtbamdir``): As we have seen before a certain number of supplementary alignments are expected in Hi-C due to the nature of the technique. These alignments are expected to significantly decrease after the filtering step. Some may remain, as they are caused by other factors like structural variants or repetitive regions. * **3. Pairing of read partners, sorting and filtering according to mapping quality** (``tmpdir``): From this step we can evaluate the read pairing. Firstly, we can look at the percentage of ``properly paired`` reads. After we make sure is high enough, we look at the ``with mate mapped to a different chr``, a high numeber of these is expected and desired in Hi-C. This indicates that scaffolding will make the genome more contiguous. In this step there is also a further filtering of alignments according to map quality. * **4. Filtering of PCR duplicates** (``repdir``): Then, PCR duplicates (DNA strands that originate from the same DNA fragment due to amplification) are eliminated ```` cat ~/genome_assembly/B1_hiC_mapping/C_rostrata/flagstat.txt >>> Stats raw bam R1 670034363 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 168807754 + 0 supplementary 0 + 0 duplicates 660060481 + 0 mapped (98.51% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) >>> Stats raw bam R2 668863715 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 167637106 + 0 supplementary 0 + 0 duplicates 658816371 + 0 mapped (98.50% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) >>> Stats bam chimeric reads filtered R1 501226609 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 76422484 + 0 supplementary 0 + 0 duplicates 443667898 + 0 mapped (88.52% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) >>> Stats bam chimeric reads filtered R2 501226609 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 75774711 + 0 supplementary 0 + 0 duplicates 444226454 + 0 mapped (88.63% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) >>> Stats bam paired and mapping quality filtered 504186860 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 504186860 + 0 mapped (100.00% : N/A) 504186860 + 0 paired in sequencing 252093430 + 0 read1 252093430 + 0 read2 504186860 + 0 properly paired (100.00% : N/A) 504186860 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 287123470 + 0 with mate mapped to a different chr 280843995 + 0 with mate mapped to a different chr (mapQ>=5) >>> Stats final bam (duplicates filtered) 296081302 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 296081302 + 0 mapped (100.00% : N/A) 296081302 + 0 paired in sequencing 148040651 + 0 read1 148040651 + 0 read2 296081302 + 0 properly paired (100.00% : N/A) 296081302 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 168083050 + 0 with mate mapped to a different chr 164320590 + 0 with mate mapped to a different chr (mapQ>=5) ````