# 🧬 VCF Files Demo
This demo from the class guides you through viewing variants from VCF files.
---
## Step 0: Setup and Navigation
### 🔍 Let’s check where we are by printing the working directory:
```bash
pwd
```
### 📂 Let’s list the files in our current location:
```bash
ls
```
### ➡️ Let’s move into the directory with the VCF file:
```bash
cd /Downloads/
```
---
## Step 1: Downloading Tools
Let’s determine variation against the reference by scanning through alignments using a variant caller. There are many options, including `samtools mpileup`, `platypus`, and the GATK.
Let’s keep things simple and use **freebayes**. It’s easy to set up and run, and it produces a well-annotated VCF output that’s ready for downstream filtering.
### 🧪 Let’s create a conda environment and install the required tools:
```bash
conda create -n variant_env -c bioconda -c conda-forge \
bcftools samtools freebayes -y
```
> Remember to activate the environment you just created.
---
## Step 2: Viewing variants with `bcftools`
### 📖 Let’s view the help documentation for `bcftools`:
```bash
bcftools --help
```
*This shows all available options for `bcftools`, a versatile toolkit for working with VCF/BCF files.*
---
### 👁️ Let’s view the contents of a VCF file:
Let's use a VCF file from the 1000 Genome Project for exploring.
Download the VCF file from [here](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2b.20130502.genotypes.vcf.gz)
```bash
bcftools view file.vcf | head
```
*The `head` command displays the first 10 lines by default.*
VCF files contain:
* **Meta-information** (`##`): format version, tools used, reference genome, INFO/FORMAT descriptions.
* The INFO level fields have site-level information, that apply to the whole row and all of the samples contained within that row.
* The FORMAT level fields have details that apply to each individual sample that's encoded.
* **Header line** (`#CHROM`...): defines column names.
🔗 Column definitions: [VCF Wikipedia](https://en.wikipedia.org/wiki/Variant_Call_Format#The_columns_of_a_VCF)
* **Data lines**: one row per variant call.
---
### 📜 Let’s scroll through the full content — especially the meta-information — using `less`:
```bash
bcftools view file.vcf | less
```
---
### 🚫 Let’s skip the header and view only the data rows:
```bash
bcftools view file.vcf | grep -v "^#" | head
```
**Alternative:**
```bash
bcftools view -H file.vcf | head
```
---
### 🧮 Let’s count the number of variant records (i.e., body lines):
```bash
bcftools view -H file.vcf | wc -l
```
*This gives us the total number of variant entries in the VCF.*
---
### ✅ Let’s filter for high-confidence (`PASS`) variants:
```bash
bcftools query -i 'FILTER="PASS"' -f '%CHROM %POS %FILTER\n' file.vcf | head
```
---
### ➕ Let’s extract insertion events (`<INS>`):
```bash
bcftools query -i 'ALT="<INS>"' -f '%CHROM %POS %ALT\n' file.vcf | head
```
---
### 📊 Let’s generate summary statistics:
```bash
bcftools stats file.vcf
```
---
# 🧬 Assignment - Command Line Interface
Let’s explore variant calling.
---
## Step 1: Let’s download the E. Coli K12 strain's reference genome:
```bash
curl -s http://hypervolu.me/%7Eerik/genomes/E.coli_K12_MG1655.fa > E.coli_K12_MG1655.fa
```
### Indexing the FASTA file
To enable tools like the variant caller **FreeBayes** to quickly access specific regions of the reference genome, we first index the FASTA file using `samtools faidx`. This creates a `.fai` index file that records the length and byte-offset of each sequence in the reference:
```bash
samtools faidx E.coli_K12_MG1655.fa
```
With the reference indexed, we can quickly extract specific regions of the genome without reading the entire file. For example, to retrieve 100 bp from position 2,700,000 to 2,700,100 on the *E. coli* K12 genome, we use the following format: `[chromosome]:[start]-[end]`.
```bash
samtools faidx E.coli_K12_MG1655.fa NC_000913.3:2700000-2700100
```
This returns the corresponding sequence in FASTA format:
```plaintext
>NC_000913.3:2700000-2700100
ATTAGCATAATCATTCACTAAGTTAATTTATATAGTATCTGCCCAGACACTTATTTATAGTTATTAAAGG
TGCGTCCGACTGGTTCACCGGACGCACCTTA
```
This functionality is essential for targeted analysis, visualization, or validation of specific genomic regions.
---
## Step 2: Let’s download the alignment of sequencing reads to the reference genome:
We downloaded the E. Coli K12 Illumina 2x300bp MiSeq sequencing results in paired FASTQ format from the University of Exeter archive using the sratoolkit. We downsampled the FASTQs using seqtk tools for faster processing. We aligned our K12 strain reads against the K12 reference genome using BWA. The SAM output was then converted to BAM format, sorted, and processed to mark PCR duplicates, which result from the exact duplication of templates during amplification.
Since you have already practiced alignment in Assignment 5, we want to keep it simple for you and give you the the aligned BAM file directly.
Download the BAM file from this [Google Drive folder](https://drive.google.com/drive/folders/1J7kmUfSBHkMhrLtM4Lgm8-FknbDPFAqe?usp=sharing).
### Indexing the BAM file
```
samtools index SRR1770413.bam
```
---
## Step 3: Let’s call variants with `FreeBayes`:
FreeBayes requires both the reference genome and the sequencing reads aligned to it.
The --ploidy 1 option is used to specify that the sample is haploid and should be genotyped accordingly.
```
freebayes -f E.coli_K12_MG1655.fa --ploidy 1 SRR1770413.bam > SRR1770413.vcf
```
> Keep in mind that even though the index files generated in earlier steps aren’t explicitly specified in the variant calling command, the variant caller automatically searches for them in the current working directory.
---
### 🔎 Assignment Questions – Let’s answer these and attach screenshots of the commands and outputs:
1. How many rows are there in the VCF header?
2. How many rows are there in the body (i.e., variant lines)?
3. What does `bcftools stats` say about the variants in the VCF file?
4. Describe every component of the output of the following command:
```
bcftools view SRR1770413.vcf.gz | sed -n '64,65p'
```
<details>
<summary>HINT</summary>
Remember that the `INFO` and `FORMAT` fields in the VCF metadata provide definitions to help you understand the meaning of each component in the data lines.
</details>
---
# 🧬 Assignment – Galaxy
This tutorial walks you through variant calling and VCF analysis using the [Galaxy Project](https://usegalaxy.org).
---
## 📁 Step 0: Prepare and Upload Data
### 1. E. coli K12 Strain Reference Genome
Upload using Galaxy’s left panel:
* Click **Upload Data**
* Select **Paste/Fetch Data**
* Paste the following URL:
```
http://hypervolu.me/%7Eerik/genomes/E.coli_K12_MG1655.fa
```
Then click **Start** to upload the file.
### 2. Aligned BAM File (Illumina MiSeq 2×300 bp)
We downloaded the E. Coli K12 Illumina 2x300bp MiSeq sequencing results in paired FASTQ format from the University of Exeter archive using the sratoolkit. We downsampled the FASTQs using seqtk tools for faster processing. We aligned our K12 strain reads against the K12 reference genome using BWA. The SAM output was then converted to BAM format, sorted, and processed to mark PCR duplicates, which result from the exact duplication of templates during amplification.
Since you have already practiced alignment in Assignment 5, we want to keep it simple for you and give you the the aligned BAM file directly.
Download and upload the BAM file from this [Google Drive folder](https://drive.google.com/drive/folders/1J7kmUfSBHkMhrLtM4Lgm8-FknbDPFAqe?usp=sharing).
---
## 🚀 Step 1: Access Galaxy
1. Visit [https://usegalaxy.org](https://usegalaxy.org)
2. Log in or create a free account.
---
## 🧪 Step 2: Run FreeBayes for Variant Calling
1. In the **Tools** panel (left side), search for **FreeBayes** and open it.
2. Set up the tool as follows:
* **Choose the source for the reference genome** → `History`
* **Use the following dataset as the reference sequence** → `E.coli_K12_MG1655.fa`
* **BAM or CRAM dataset** → `SRR1770413.bam`
* **Choose parameter selection level** → `5. Full list of options`
* Under **Population model options** → `Set population model options`
* Set **ploidy** to `1` (since bacterial genomes are haploid)
3. Click **Run Tool**.
---
## 🔬 Step 3: View the VCF Output
Once FreeBayes completes:
* Locate the `.vcf` file in your Galaxy history.
* Click the **eye icon** to preview the file contents.
---
## 📊 Step 4: Generate Variant Statistics
1. Search for **`bcftools stats`** in the Tools panel.
2. Input: your `.vcf` file from Step 3.
3. The output will summarize:
* Total variant count
* SNPs vs. indels
* Quality distributions
* Sample-specific statistics (if applicable)
### 🔎 Assignment Questions – Let’s answer these and attach screenshots of the commands and outputs:
1. How many rows are there in the VCF header?
2. How many rows are there in the body (i.e., variant lines)?
3. What does `bcftools stats` say about the variants in the VCF file?
4. Describe every component of the output for the first variant line.
<details>
<summary>HINT</summary>
Remember that the `INFO` and `FORMAT` fields in the VCF metadata provide definitions to help you understand the meaning of each component in the data lines.
</details>
---
:::success
In this assignment, we used a genome reference FASTA file and an aligned BAM file as input for variant calling and analyzed the resulting VCF file to interpret the identified genetic variants.
:::
---
### 📚 Credits
This assignment is based on:
* **Eric Garrison’s variant calling tutorial** for **Evomics 2025**
* The [YouTube walkthrough](https://www.youtube.com/watch?v=VooqrSM_Rqk&t=358s)
---
### Other resources on variant calling:
[Genome Browsing IGV](https://genviz.org/module-01-intro/0001/05/01/GenomeBrowsingIGV/)
[Short read mapping & variants on IGV](https://avantonder.github.io/Ghana_course/materials/05-short_read_mapping/5.1-short_read_mapping.html)
[Structural variants on IGV](https://help.dragen.illumina.com/product-guides/dragen-v4.4/dragen-dna-pipeline/sv-calling/sv-igv-tutorial)
[Long read structural variant calling](https://www.melbournebioinformatics.org.au/tutorials/tutorials/longread_sv_calling/longread_sv_calling/)
[BCFtools cheatsheet](https://gist.github.com/elowy01/93922762e131d7abd3c7e8e166a74a0b)