Germline Variant Calling WGS Pipeline

# Germline Variant Calling WGS Pipeline Over the past few years, whole genome sequencing (WGS) has rapidly advanced in diagnostic testing for the diagnosis of genetic disorders. Variants in both coding and non-coding regions can be found using Whole Genome Sequencing (WGS), which covers a broad spectrum of complex. A precise method of identifying variations for clinical diagnostic applications is to use bioinformatic pipelines for variant calling and annotation on WGS data. Pipelines automates the analysis of WGS data with tasks such as quality control, alignment, variant calling and annotation. Pipelines allow us to save time and ensure consistency over the time period of analysis. The aim of this project is to develop a comprehensive, user-friendly, and efficient WGS variant calling and annotating pipeline on Next-flow workflow management system with diverse computational bioinformatic tools to achieve precise results on variant identification for clinical diagnosis and treatment of variety of diseases. --- The structure of the WGS variant calling and annotating pipeline constructed by a workflow management system called Next-flow (https://www.nextflow.io/). Using Genome Analysis Toolkit 4 (GATK4) for variant calling, data pre-processing, and annotation and various other vcf and annotation tools in accordance with Broad Institute best practices, this pipeline produces comparison over tool and increases the reliability of the pipeline's results. # Full List of Tools Used and Required in the Pipeline | Software | Availability | | -------- | -------- | | Nextflow | https://www.nextflow.io/ | | Singularity | https://sylabs.io/ | | Fastp | https://github.com/OpenGene/fastp | | MultiQC | http://multiqc.info/ | | FastQC | https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ | | BWA-MEM | http://bio-bwa.sourceforge.net/ | | GATK4 | https://software.broadinstitute.org/gatk/ | | Freebayes | https://github.com/ekg/freebayes | | Samtools | https://github.com/samtools/samtools | | snpEff | http://snpeff.sourceforge.net/ | # github To clone the WGSpipe repository from GitHub, you can follow these steps: First, ensure you have git installed on your system. If not, you can install it by following the instructions [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git). Once git is installed, open a terminal and navigate to the directory where you want to clone the repository. Then, use the following command to clone the repository: ``` https://github.com/CompGenomeLab/WGSpipe.git ``` This command will create a new directory named WGSpipe in your current location, containing all the files and history from the WGSpipe repository. After cloning, you can navigate into the WGSpipe directory. Setting up the environment to execute the workflows and scripts supplied by the repository. Following directory, nf-files, and .sh files will be installed after cloning: ``` WGS-pipe-picard.nf # main.nf file nextflow.config # configuration of cpu, memory, and container images of the processes. genome.config # contains the default parameter configurations /modules # a directory that contains all the source code of the processes. /subworkflows # a directory that contains all the source code of the sub-workflows. data.sh # a shell-script file that contains datasets that are a must for pipeline to be run sif.sh # a shell-script that builds the Singularity images ``` # How to Use the WGS Pipeline To build up the WGS pipeline, we must first establish a connection to an HPC computer, which can be ***sakura.sabanciuniv.edu*** or any other HPC account. Sakura system has already have Singularity installed. This pipeline is written under Nextflow language, a workflow management system. Installing in ***sakura*** is as easy as this: ``` module load nextflow ``` Another way to set up Nextflow is by using [conda](https://docs.conda.io/en/latest/). Follow the [instructions](https://nf-co.re/docs/usage/getting_started/installation) down below according to the order to set up the priority correctly. ``` conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge ``` Then create an environment by installing Nextflow from conda. ``` conda create --name env_nf nextflow conda activate env_nf ``` If you want to deactivate the environment simply run: ``` conda deactivate ``` For the WGS bioinformatics pipeline Singularity container solution is chosen to prevent the risks of not being reachable from high-performance computing centers. Conda will be utilized in setting up Singularity. You can install the singularity into the environment created when setting up Nextflow (env_nf). ``` conda install conda-forge::singularity ``` It is necessary to generate SIF images in the "Singularity" directory. For that reason create a directroy called "Singularity": ``` mkdir -p Singularity ``` Once the directory have been created you can build the images that is needed for this pipeline to be working. A shell script file is written in order to install the snpeff database and to build the images. Simply make the script executable: ``` chmod +x sif.sh ``` And execute the script by running: ``` ./sif.sh ``` --- # Datasets The WGS pipeline is now functional after installing the required software and configuring the containers as images. For WGS-pipeline to be running there are default datasets for tools to be working without any errors. A shell script file is written to download the required datasets called `data.sh`. Simply make the script executable: ``` chmod +x data.sh ``` And execute the script by running: ``` ./data.sh ``` # snpEff and ENSEMBL-VEP database installment The Snpeff database installation is based on the desired genome version. You must enter the container, start a shell, and download the database. In the launching directory, create the snpeff-data directory. * A default genome version is provided by shell script (GRCh38.105). * You must make the changes of the genome version and download it manually. It caused many complications when implemented in the pipeline. ``` #Installing snpeff dataset according to the genome version snpEff databases //Will print all the databases included in snpEff #By defining the specie version you can choose the genome and cache version of the database. snpEff databases | grep -i homo_sapiens #You must be in the launching directory singularity shell Singularity/snpeff.sif mkdir snpeff-data snpEff download -dataDir /path/to/snpeff-data GRCh38.105 #/path/to/snpeff-data is an example. #You need to give the full path of the directory. #You can check the full path of the directory using pwd. ``` --- Data.sh includes the installment of the default database for the ENSEMBL-VEP annotation analysis from ENSEMBL-VEP public directory. You can download the databases from the [public directory](https://ftp.ensembl.org/pub/) according to the cache-genome-specie version you want to use. * Release version meaning the cache version. * Genome version defines the GRCh(37/38) version. >You need to navigate to this directory /pub/release-(CACHE VERSION)/variation/vep >You may easily remove the default database version from data.sh if you don't want it to download. You MUST download the ENSEMBL-VEP database into vep-data directory in launching directroy. ``` #make sure you are in the launching directory by checking with pwd command. mkdir -p vep-data wget https://... ``` >If you want to utilize a different database, be sure to adjust params.vep_cache_version, params.vep_species, and params.vep_genome. Another easy way to integrate the ENSEMBL-VEP databases by including the --database command into the source code of the annotation tool. However this will take way longer due to high demand. >This will retrieve all annotations directly from the Ensembl MySQL databases hosted at ensembldb.ensembl.org # Setting Up Configuration Files After Nextflow has been set up, the pipeline can be runned. You can modify both of the configuration files to suit your needs. The required parameters for pipeline to be worked can be found in **genome.config**: 1. `params.reads`:The directory with the paired end fastq files. Example: *params.reads = "/path/to/data/*_R{1,2}_*.fastq.gz"* 2. `params.ref`:The directory with the reference genome fasta. The index and dictionary files needs to be located under the same directory. If you’re using your own reference data, you will need to build index files using BWA-index, a fasta index (using samtools), and a reference dictionary (using GATK). 3. `params.outdir`:By default, all the generated directories of the processes will stored in this directory called "results". 4. `params.snpeff_db`:Provide the SnpEff genome version database name that you wish to utilize. 5. `params.step`: Use this to start from any specific step in the pipeline. By default, all the steps to analyze a WGS/WES fastq is included. Write without any space. Check default values from genome.config. 6. `params.vc_tools`: Use this to specify tools of variant calling step. By default, all the tools to do variant call is included. Check default values from genome.config. 7. `params.ann_tools`: Use this to specify tools of annotation step. By default, all the tools to do annotation is included. Check default values from genome.config. 8. `params.vep_cache_version`: Specifies the version of the VEP cache to use. Check default value from genome.config. 9. `params.vep_species`:Specifies the species for which the variants are being annotated. Check default value from genome.config. 10. `params.vep_genome`: Defines the reference genome assembly. Check default value from genome.config. --- # How does various steps is utilized? Two inputs are required if you decide to skip the pipeline's data preprocessing step: 1. `params.bam_file`: A BAM file that is going to be analyzed. 2. `params.bam_file_idx`: index file of the provided BAM file. You can also provide the following in the command line if you don't want to provide in the configuration file:: ``` nextflow run main.nf --step "variant_calling,annotate" --bam_file /path/to/bam_file --bam_file_idx /path/to/bam_idx ``` Two inputs are required if you decide to skip the pipeline's data preprocessing & variant calling step: 1. `params.vcf`: A BAM file that is going to be analyzed. 2. `params.vcf_idx`: index file of the provided BAM file. You can also provide the following in the command line if you don't want to provide in the configuration file:: ``` nextflow run main.nf --step "annotate" --vcf /path/to/vcf --vcf_idx /path/to/vcf_idx ``` > The values of bam_file, bam_file_idx, vcf, and vcf_idx must be **null** if every step is utilized. > In case the annotate and variant calling steps are included, vcf and vcf_idx must be **null**. --- # How does various tools or single tool is utilized? To define the tools for variant calling and annotation there are two parameters defined. The one that is defined for the tools specific to the variant calling is vc_tools and the other tools specific to the annotation the ann_tools. A example for vc_tools: ``` nextflow run WGS-pipe-picard.nf --step "variant_calling" --vc_tools “deepvariant“ --bam_file "path/to/bam" --bam_file_idx "path/to/bam_idx" ``` A example for ann_tools: ``` nextflow run WGS-pipe-picard.nf --step "annotate" --ann_tools “snpeff“ --vcf "path/to/vcf" --vcf_idx "path/to/vcf_idx" ``` --- # DeepVariant Deepvariant has a built-in command that require a manual change according to the model data type (WGS or WES). To change the regions on deepvariant: ``` #regions command is removed (If needed you can add like the following): cd modules nano deepvariant.nf --regions (provide the following: if WES data --> BED file || if WGS data --> chr) ``` For changing the model-type provided on the deepvariant built in command: ``` nextflow run main.nf --wes true ``` You have to change the boolean into true for WES data, false for WGS data. --- # Hap.py For Hap.py you need to input three parameters: * params.benchmark = "path/to/benchmark.vcf.gz" * params.bench_idx = "path/to/benchmark.vcf.gz.tbi" * params.bench_bed = "path/to/benchmark.bed" These parameters are implemented into the main nextflow file. You can change them manually from main.nf or give them the paths from the command line. Hap.py has a built-in command that require a manual change according to the model data type (WGS or WES). To add -T command for WES data: ``` cd modules nano happy.nf -T /path/to/BED-file-of-WES-data #add the command to script block ``` --- # CPU, Memory, and Container Assignment To modify container images, CPU and memory assignments, and report and trace files in the nextflow.config file, follow the structure of the process configuration as shown below: ``` process:withName'process-name' { container = 'your-image' conda = 'your-yml-file' cpu = 2 memory = '4 GB' } ``` Change these settings according to your needs specific to the process. # How to run in Sakura HPC When working on a cluster like Sakura, you often need to run long jobs or pipelines that could take a while to complete. To do this efficiently, you can use the screen utility to keep your session active even if your connection drops, and salloc to allocate compute resources on a node. The screen command allows you to start a new terminal session that can be detached and reattached later. ``` screen -S my_pipeline_session ``` Now, you are in a new terminal session. Any commands you run here will continue even if you disconnect from the server. To detach from the screen session: ``` #Mac user control-a-d #Windows user ctrl-a-d ``` To reattach to the screen session: ``` screen -r my_pipeline_session ``` After you’ve set up your screen session, you’ll need to allocate a compute node to run your pipeline using the salloc command. ``` salloc -A users -p long_mdbf --qos=long_mdbf $SHELL ``` * -p long_mdbf: Specifies the partition (queue) to which the job is submitted. long_mdbf is likely configured for jobs that need to run for an extended time. After allocating resources with salloc, you might want to check which specific compute node has been assigned to your job. You can do this using the scontrol command with the job ID. **Example:** ``` scontrol show job 2814850 | grep NodeList ``` The output will display something like NodeList=cn02, where cn02 is the name of the node allocated to your job. The connect to node using ssh. **Example:** ``` ssh cn02 ``` # Launching Nextflow **Running main.nf** To execute your Nextflow pipeline, navigate to the directory containing your main.nf script and run the following command: ``` nextflow run main.nf ``` --- **Resuming Nextflow in Case of Errors** If your pipeline encounters an error or is interrupted, you can resume it from the last successful step using the -resume flag. This is useful to avoid re-running the entire pipeline from scratch and save cached processes: ``` nextflow run main.nf -resume ``` --- **Providing Parameters through the Command Line** You can pass parameters to your Nextflow script via the command line using the -- prefix followed by the parameter name and its value. For example, if your script requires parameters like reads, outdir, and reference, you can call the script as follow: ``` nextflow run main.nf \ --reads data/fastq \ --outdir results/output_directory \ --ref data/reference_genome \ ... ``` --- Be aware that the following files will be created in your working directory by the pipeline: ``` work # Directory containing the nextflow working files results # Finished results (configurable, see below) .nextflow_log # Log file from Nextflow, not visible # Other nextflow hidden files, eg. history of pipeline runs and old logs. ``` **Work directory** The task work directories are created in the folder work in the launching path by default. Once the analysis is finished you can remove the work directory. A different location for the execution work directory can be specified using the command line option -w: ``` nextflow run main.nf -w /path/to/workdir ``` --- # Test the Pipeline Pair-end fastq files were created so you could test the pipeline and check for mistakes made during installation and configuration. The test fastq files can be found under test_fastq_files directory. --- # Workflow Overview ```mermaid graph TD; A[Start] --> B{Check Parameters}; B -->|No ref| C[Exit: Reference genome not specified!]; B -->|No reads| D[Exit: Paired-end reads not specified!]; B -->|Valid| E{Is ref default?}; E -->|No| F[Index & Create Dict]; F --> G[Set params.fasta_index & params.dict]; E -->|Yes| H[Preprocessing]; H --> I[FASTQC]; G --> I; I --> J[MULTIQC]; J --> K[BWA_INDEX]; K --> L[FASTP]; L --> M[BWA_MEM]; M --> N[SAM_CONVERTER]; N --> O[MARK_DEDUP_PICARD]; O --> P[BASE_RECAP_PICARD]; P --> Q[APPLY_BQSR_PICARD]; Q --> R[Set final_bam_ch & final_bam_idx_ch]; R --> |ONLY PREPROCESSING IS CHOSEN|GG[END] B -->|variant_calling| S{Use BAM file?}; S -->|No| T[Exit: BAM file not specified!]; S -->|Yes| U[Set final_bam_ch & final_bam_idx_ch from params]; R --> V{Variant Calling Tools}; U --> V{Variant Calling Tools}; V -->|HaplotypeCaller| W[Run HaplotypeCaller]; V -->|DeepVariant| X[Run DeepVariant]; V -->|FreeBayes| Y[Run FreeBayes]; W --> Z[VAR_RECAL]; Z --> AA[APPLY_VQSR]; AA --> AB[VARIANT_FILTER]; AB --> AC[Set vcf_annotation_ch]; X --> AC[Set vcf_annotation_ch]; Y --> AC[Set vcf_annotation_ch]; AC[Set vcf_annotation_ch] --> AG[Create VCF Index]; AC --> AF[HAPPY]; AC[Set vcf_annotation_ch] --> AL{Annotation Tools}; AG --> |IF EXCLUDING ANNOTATION|JJ[END] AG[Create VCF Index] --> AL{Annotation Tools}; B -->|annotate| AI{Use VCF file?}; AI -->|No| AJ[Exit: VCF file not specified!]; AI -->|Yes| AK[Set vcf_annotation_ch from params]; AK --> AL{Annotation Tools}; AL -->|SnpEff| AM[Run SNPEFF]; AL -->|VEP| AN[Run ENSEMBL_VEP]; AM --> AH[END]; AN --> AH[END]; ``` ### Explanation of the Revised Diagram - **Start**: The workflow begins with parameter checks. - **Parameter Checks**: If the required parameters (`ref`, `reads`, etc.) are not provided, the workflow exits with an error message. - **Reference Genome Handling**: If the reference genome is not the default, it generates an index and dictionary. - **Preprocessing Steps**: If preprocessing is included, it runs quality checks, indexing, trimming, alignment, deduplication, and base recalibration. - **Variant Calling**: - **final_bam_ch** and **final_bam_idx_ch** enter the variant calling section. - **HaplotypeCaller** enters `VAR_RECAL`, `APPLY_VQSR`, and `VARIANT_FILTER` before setting `vcf_annotation_ch`. - **DeepVariant** and **FreeBayes** directly set the `vcf_annotation_ch` without entering the recalibration and filtering steps. - **Annotation**: If annotation is specified, it checks for VCF files and runs the selected annotation tools (SnpEff, VEP). - **End**: The workflow concludes after processing. # Data Sources The benchmark data sets that have been used in this pipeline can be found in this [website](https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))). # Questions If you have any questions you can reach to my e-mail for your questions: baharsevgin@sabanciuniv.edu