🧬 VCF Files Demo

# 🧬 VCF Files Demo This demo from the class guides you through viewing variants from VCF files. --- ## Step 0: Setup and Navigation ### 🔍 Let’s check where we are by printing the working directory: ```bash pwd ``` ### 📂 Let’s list the files in our current location: ```bash ls ``` ### ➡️ Let’s move into the directory with the VCF file: ```bash cd /Downloads/ ``` --- ## Step 1: Downloading Tools Let’s determine variation against the reference by scanning through alignments using a variant caller. There are many options, including `samtools mpileup`, `platypus`, and the GATK. Let’s keep things simple and use **freebayes**. It’s easy to set up and run, and it produces a well-annotated VCF output that’s ready for downstream filtering. ### 🧪 Let’s create a conda environment and install the required tools: ```bash conda create -n variant_env -c bioconda -c conda-forge \ bcftools samtools freebayes -y ``` > Remember to activate the environment you just created. --- ## Step 2: Viewing variants with `bcftools` ### 📖 Let’s view the help documentation for `bcftools`: ```bash bcftools --help ``` *This shows all available options for `bcftools`, a versatile toolkit for working with VCF/BCF files.* --- ### 👁️ Let’s view the contents of a VCF file: Let's use a VCF file from the 1000 Genome Project for exploring. Download the VCF file from [here](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2b.20130502.genotypes.vcf.gz) ```bash bcftools view file.vcf | head ``` *The `head` command displays the first 10 lines by default.* VCF files contain: * **Meta-information** (`##`): format version, tools used, reference genome, INFO/FORMAT descriptions. * The INFO level fields have site-level information, that apply to the whole row and all of the samples contained within that row. * The FORMAT level fields have details that apply to each individual sample that's encoded. * **Header line** (`#CHROM`...): defines column names. 🔗 Column definitions: [VCF Wikipedia](https://en.wikipedia.org/wiki/Variant_Call_Format#The_columns_of_a_VCF) * **Data lines**: one row per variant call. --- ### 📜 Let’s scroll through the full content — especially the meta-information — using `less`: ```bash bcftools view file.vcf | less ``` --- ### 🚫 Let’s skip the header and view only the data rows: ```bash bcftools view file.vcf | grep -v "^#" | head ``` **Alternative:** ```bash bcftools view -H file.vcf | head ``` --- ### 🧮 Let’s count the number of variant records (i.e., body lines): ```bash bcftools view -H file.vcf | wc -l ``` *This gives us the total number of variant entries in the VCF.* --- ### ✅ Let’s filter for high-confidence (`PASS`) variants: ```bash bcftools query -i 'FILTER="PASS"' -f '%CHROM %POS %FILTER\n' file.vcf | head ``` --- ### ➕ Let’s extract insertion events (`<INS>`): ```bash bcftools query -i 'ALT="<INS>"' -f '%CHROM %POS %ALT\n' file.vcf | head ``` --- ### 📊 Let’s generate summary statistics: ```bash bcftools stats file.vcf ``` --- # 🧬 Assignment - Command Line Interface Let’s explore variant calling. --- ## Step 1: Let’s download the E. Coli K12 strain's reference genome: ```bash curl -s http://hypervolu.me/%7Eerik/genomes/E.coli_K12_MG1655.fa > E.coli_K12_MG1655.fa ``` ### Indexing the FASTA file To enable tools like the variant caller **FreeBayes** to quickly access specific regions of the reference genome, we first index the FASTA file using `samtools faidx`. This creates a `.fai` index file that records the length and byte-offset of each sequence in the reference: ```bash samtools faidx E.coli_K12_MG1655.fa ``` With the reference indexed, we can quickly extract specific regions of the genome without reading the entire file. For example, to retrieve 100 bp from position 2,700,000 to 2,700,100 on the *E. coli* K12 genome, we use the following format: `[chromosome]:[start]-[end]`. ```bash samtools faidx E.coli_K12_MG1655.fa NC_000913.3:2700000-2700100 ``` This returns the corresponding sequence in FASTA format: ```plaintext >NC_000913.3:2700000-2700100 ATTAGCATAATCATTCACTAAGTTAATTTATATAGTATCTGCCCAGACACTTATTTATAGTTATTAAAGG TGCGTCCGACTGGTTCACCGGACGCACCTTA ``` This functionality is essential for targeted analysis, visualization, or validation of specific genomic regions. --- ## Step 2: Let’s download the alignment of sequencing reads to the reference genome: We downloaded the E. Coli K12 Illumina 2x300bp MiSeq sequencing results in paired FASTQ format from the University of Exeter archive using the sratoolkit. We downsampled the FASTQs using seqtk tools for faster processing. We aligned our K12 strain reads against the K12 reference genome using BWA. The SAM output was then converted to BAM format, sorted, and processed to mark PCR duplicates, which result from the exact duplication of templates during amplification. Since you have already practiced alignment in Assignment 5, we want to keep it simple for you and give you the the aligned BAM file directly. Download the BAM file from this [Google Drive folder](https://drive.google.com/drive/folders/1J7kmUfSBHkMhrLtM4Lgm8-FknbDPFAqe?usp=sharing). ### Indexing the BAM file ``` samtools index SRR1770413.bam ``` --- ## Step 3: Let’s call variants with `FreeBayes`: FreeBayes requires both the reference genome and the sequencing reads aligned to it. The --ploidy 1 option is used to specify that the sample is haploid and should be genotyped accordingly. ``` freebayes -f E.coli_K12_MG1655.fa --ploidy 1 SRR1770413.bam > SRR1770413.vcf ``` > Keep in mind that even though the index files generated in earlier steps aren’t explicitly specified in the variant calling command, the variant caller automatically searches for them in the current working directory. --- ### 🔎 Assignment Questions – Let’s answer these and attach screenshots of the commands and outputs: 1. How many rows are there in the VCF header? 2. How many rows are there in the body (i.e., variant lines)? 3. What does `bcftools stats` say about the variants in the VCF file? 4. Describe every component of the output of the following command: ``` bcftools view SRR1770413.vcf.gz | sed -n '64,65p' ``` <details> <summary>HINT</summary> Remember that the `INFO` and `FORMAT` fields in the VCF metadata provide definitions to help you understand the meaning of each component in the data lines. </details> --- # 🧬 Assignment – Galaxy This tutorial walks you through variant calling and VCF analysis using the [Galaxy Project](https://usegalaxy.org). --- ## 📁 Step 0: Prepare and Upload Data ### 1. E. coli K12 Strain Reference Genome Upload using Galaxy’s left panel: * Click **Upload Data** * Select **Paste/Fetch Data** * Paste the following URL: ``` http://hypervolu.me/%7Eerik/genomes/E.coli_K12_MG1655.fa ``` Then click **Start** to upload the file. ### 2. Aligned BAM File (Illumina MiSeq 2×300 bp) We downloaded the E. Coli K12 Illumina 2x300bp MiSeq sequencing results in paired FASTQ format from the University of Exeter archive using the sratoolkit. We downsampled the FASTQs using seqtk tools for faster processing. We aligned our K12 strain reads against the K12 reference genome using BWA. The SAM output was then converted to BAM format, sorted, and processed to mark PCR duplicates, which result from the exact duplication of templates during amplification. Since you have already practiced alignment in Assignment 5, we want to keep it simple for you and give you the the aligned BAM file directly. Download and upload the BAM file from this [Google Drive folder](https://drive.google.com/drive/folders/1J7kmUfSBHkMhrLtM4Lgm8-FknbDPFAqe?usp=sharing). --- ## 🚀 Step 1: Access Galaxy 1. Visit [https://usegalaxy.org](https://usegalaxy.org) 2. Log in or create a free account. --- ## 🧪 Step 2: Run FreeBayes for Variant Calling 1. In the **Tools** panel (left side), search for **FreeBayes** and open it. 2. Set up the tool as follows: * **Choose the source for the reference genome** → `History` * **Use the following dataset as the reference sequence** → `E.coli_K12_MG1655.fa` * **BAM or CRAM dataset** → `SRR1770413.bam` * **Choose parameter selection level** → `5. Full list of options` * Under **Population model options** → `Set population model options` * Set **ploidy** to `1` (since bacterial genomes are haploid) 3. Click **Run Tool**. --- ## 🔬 Step 3: View the VCF Output Once FreeBayes completes: * Locate the `.vcf` file in your Galaxy history. * Click the **eye icon** to preview the file contents. --- ## 📊 Step 4: Generate Variant Statistics 1. Search for **`bcftools stats`** in the Tools panel. 2. Input: your `.vcf` file from Step 3. 3. The output will summarize: * Total variant count * SNPs vs. indels * Quality distributions * Sample-specific statistics (if applicable) ### 🔎 Assignment Questions – Let’s answer these and attach screenshots of the commands and outputs: 1. How many rows are there in the VCF header? 2. How many rows are there in the body (i.e., variant lines)? 3. What does `bcftools stats` say about the variants in the VCF file? 4. Describe every component of the output for the first variant line. <details> <summary>HINT</summary> Remember that the `INFO` and `FORMAT` fields in the VCF metadata provide definitions to help you understand the meaning of each component in the data lines. </details> --- :::success In this assignment, we used a genome reference FASTA file and an aligned BAM file as input for variant calling and analyzed the resulting VCF file to interpret the identified genetic variants. ::: --- ### 📚 Credits This assignment is based on: * **Eric Garrison’s variant calling tutorial** for **Evomics 2025** * The [YouTube walkthrough](https://www.youtube.com/watch?v=VooqrSM_Rqk&t=358s) --- ### Other resources on variant calling: [Genome Browsing IGV](https://genviz.org/module-01-intro/0001/05/01/GenomeBrowsingIGV/) [Short read mapping & variants on IGV](https://avantonder.github.io/Ghana_course/materials/05-short_read_mapping/5.1-short_read_mapping.html) [Structural variants on IGV](https://help.dragen.illumina.com/product-guides/dragen-v4.4/dragen-dna-pipeline/sv-calling/sv-igv-tutorial) [Long read structural variant calling](https://www.melbournebioinformatics.org.au/tutorials/tutorials/longread_sv_calling/longread_sv_calling/) [BCFtools cheatsheet](https://gist.github.com/elowy01/93922762e131d7abd3c7e8e166a74a0b)