GT-Seq Data Preparation

# GT-seq Data Preparation In this workflow, we focus on **Step 2: Preparing raw sequencing data for variant calling** Each script’s name is prefixed with a two-digit number indicating its order in the workflow. The scripts operate sequentially, transforming data and passing it to the next stage. We will discuss the scripts in order: * **00\_checksum.sh** – Verify raw data integrity (BCL archive checksum). * **00\_untar.sh** – Extract raw sequencing output (BCL files). * **01\_demultiplex.sh** – Convert BCL to FASTQ and assign reads to samples (demultiplexing). * **02\_trimmomatic.sh** – Trim adapter sequences and low-quality bases from reads. * **03\_flash.sh** – Merge overlapping paired-end reads into longer fragments. * **04\_bwa.sh** – Align reads to reference loci and generate alignment files (BAM). * **05\_extractpos81.sh** – Extract coverage depth at a specific target base position (quality control). * **06\_submit\_Rsummarize.sh** – Run R scripts to summarize coverage and depth statistics. * **07\_prep\_for\_GATK.sh** – Prepare alignments for GATK by adding read groups, sorting, and indexing BAMs. * **08\_GATK\_haplotypecaller.sh** – Call genomic variants per sample using GATK HaplotypeCaller (producing GVCFs). * **09\_GATK\_import.sh** – Import all sample GVCFs into a GenomicsDB (database for joint genotyping). * **10\_GATK\_joingeno.sh** – Jointly genotype all samples at once using GenotypeGVCFs (producing final VCFs). ## 00\_checksum.sh – Verifying Raw Data Integrity **Purpose:** This initial script verifies the integrity of the raw sequencing data file received from the sequencing provider (in this case, NovaGene). Typically, raw data comes as a large compressed archive (e.g., a `.tar` file containing BCL data from an Illumina sequencer). The script computes an MD5 checksum of the file to ensure it wasn’t corrupted during transfer. ```bash #!/bin/bash #SBATCH -J "bclchecksum" #SBATCH -o ./logs/log_%j%x #SBATCH -c 2 #SBATCH -p cpu #SBATCH --time=24:00:00 #SBATCH --mem=50G md5sum /nese/meclab/path/to/BCL.tar ``` **Explanation:** The script uses the Linux `md5sum` utility on the BCL tar archive. The lines beginning with `#SBATCH` are SLURM directives scheduling this job on a compute cluster: * `-J "bclchecksum"` gives the job a name (“bclchecksum”). * `-o ./logs/log_%j%x` directs output logs to a file (with job ID and name). * `-c 2` and `-p cpu` request 2 CPU cores on a CPU partition. * `--time=24:00:00` and `--mem=50G` allocate up to 24 hours and 50 GB RAM (computing a checksum on a large file might be I/O bound but can be slow). `md5sum /nese/meclab/path/to/BCL.tar` computes the MD5 hash of the file. This hash can be compared against a provided checksum from the sequencing center to confirm that the file is intact. A mismatch would indicate a corrupted or incomplete download. This step is crucial before proceeding, since all downstream analyses rely on uncorrupted raw data. ## 00\_untar.sh – Extracting the BCL Archive **Purpose:** This script unpacks the compressed BCL archive delivered by the sequencing provider. Illumina sequencers output **BCL (BaseCall) files** – binary files containing per-cycle base call data and quality scores. Typically, these are delivered as a `.tar` archive containing a directory of BCL files and associated metadata (like `RunInfo.xml`, `SampleSheet.csv`, etc.). The `00_untar.sh` script uses `tar` to extract these files to a working directory for further processing. ```bash #!/bin/bash #SBATCH -J "untar" #SBATCH -o ./logs/log_%j%x #SBATCH -c 4 #SBATCH -p cpu #SBATCH --time=12:00:00 #SBATCH --mem=50G BCL=/nese/meclab/path/to/BCL.tar # path to input BCL tarball OUT=/scratch/workspace/path/to/01_fastq # target directory for FASTQ output SAMPLES=/scratch/workspace/path/to/bcl2fastq_samplesheet.csv # sample sheet (CSV) THREAD=20 tar xvf $BCL -C /nese/meclab/path/to/usftp21.novogene.com-BCL_file ``` **Explanation:** * **BCL**: the file path to the `.tar` archive of BCL files. * **OUT**: intended output directory for FASTQ files (named `01_fastq` as part of the pipeline’s directory structure). * **SAMPLES**: path to the sample sheet CSV (more on this in the demultiplexing step). * **THREAD**: number of threads (20) to use for later steps (likely for `bcl2fastq`). The main command is: ```bash tar xvf $BCL -C /path/to/extract_directory ``` This uses the Unix `tar` utility with flags `x` (extract), `v` (verbose output), and `f` (filename). It extracts `$BCL` into the specified directory (`-C` option sets the target directory). The result is an **extracted run folder** containing subfolders like `Data/Intensities/BaseCalls/` with the raw BCL files, `RunInfo.xml`, etc., exactly as generated by the sequencer. After running 00\_untar.sh, we have the raw BCL files ready on disk for conversion to FASTQ. *Why this matters:* Downstream tools (like Illumina’s `bcl2fastq`) cannot read from the tar archive directly – they need the files extracted to the proper run folder structure. This script simply prepares the raw data for the next step. ## 01\_demultiplex.sh – BCL to FASTQ Conversion (Demultiplexing) **Purpose:** This script converts the raw Illumina BCL data into FASTQ files and separates reads by sample, a process known as **demultiplexing**. Illumina’s software `bcl2fastq` is used for this conversion. The BCL files contain base calls and quality scores for every sequencing cycle and for every cluster (sequenced DNA fragment). Demultiplexing means sorting those reads according to their index barcodes (sample indices) so that each sample’s reads end up in its own FASTQ file. ```bash #!/bin/bash #SBATCH -J "01_demultiplexbcl" #SBATCH -o ./logs/log_%j%x #SBATCH -c 10 #SBATCH -p cpu #SBATCH --time=4:00:00 #SBATCH --mem=200G module load conda/latest conda activate bcl2fastq BCL=/nese/meclab/path/to/BCL_folder OUT=/scratch/workspace/path/to/01_fastq SAMPLES=/scratch/workspace/path/to/bcl2fastq_samplesheet_RC.csv THREAD=20 bcl2fastq -R $BCL -o $OUT \ --sample-sheet $SAMPLES \ --barcode-mismatch 0 \ -r 4 -p 10 -w 4 conda deactivate ``` This script runs the Illumina **bcl2fastq** conversion software. In one step, `bcl2fastq` reads the base call files (BCLs) and the **Sample Sheet** to output gzipped FASTQ files for each sample, organized by lane and sample name. ### Sample Sheet Format for Demultiplexing (`bcl2fastq_samplesheet.csv`) The sample sheet is a **CSV file** used during demultiplexing to assign reads to individual samples based on their unique index combinations. It must be correctly formatted or **demultiplexing will fail**, leading to large numbers of **unassigned ("unknown") barcodes**. **Required Columns (basic format):** ``` Lane,SampleID,index,index2 ``` - `Lane`: The sequencing lane number (specified in data from Novagene) - `SampleID`: A unique identifier for each sample - `index`: The **i7** barcode (forward index) *Reverse Compliment* - `index2`: The **i5** barcode (reverse index) *Reverse Compliment* **Example:** ``` Lane,SampleID,index,index2 1,UniqueID,AAGAATT,AAACGG ``` In this case: - The sample `"UniqueID"` was sequenced on **Lane 1** - It uses **AAGAATT** as the **i7** index - It uses **AAACGG** as the **i5** index > ⚠️ **Important:** i7 (`index`) and i5 (`index2`) **must not be swapped.** Mixing them up will cause improper read assignment and can drastically increase the number of unassigned reads. Key points: * `bcl2fastq -R $BCL` sets the run folder (the directory containing `Data/Intensities/BaseCalls`). Here `$BCL` is not a file but the directory path (note: in 00\_untar, they extracted to something like `/path/to/usftp21.novogene.com-BCL_file`, which should be set as `$BCL` here). * `-o $OUT` sets the output directory for FASTQ files (we use the pipeline’s `01_fastq` directory). * `--sample-sheet $SAMPLES` provides the sample sheet CSV which maps indexes to sample IDs. This tells bcl2fastq how to demultiplex. * `--barcode-mismatch 0` is an option specifying how many mismatches to allow in index barcode matching. Here, 0 mismatches means only perfect index matches will be accepted for assigning reads to samples (strict demultiplexing; reads with any index errors go to “Undetermined”). * `-r 4 -p 10 -w 4` are resource options for bcl2fastq: * `-r 4`: use 4 reader threads (for reading BCLs), * `-p 10`: use 10 processing threads (for demultiplexing/writing FASTQ), * `-w 4`: use 4 writer threads (to write output FASTQs). After `bcl2fastq` finishes, the script deactivates the conda environment. The output will be FASTQ files, usually one pair of FASTQ gz files per sample (one for Read 1 and one for Read 2 of paired-end data). They often follow a naming convention like: ``` <SampleName>_S<sample-number>_L<lane>_R1_001.fastq.gz <SampleName>_S<sample-number>_L<lane>_R2_001.fastq.gz ``` where the lane and read (R1/R2) are indicated. The pipeline’s output directory likely has these FASTQs named with `_R1_` and `_R2_` for paired reads. Important steps internally: 1. **Base call reading:** For each cluster and each cycle, bcl2fastq pulls the base and quality from BCL files. 2. **Index decoding:** It compiles the index reads (e.g., i7 and i5 reads) for the cluster, then looks up which sample that index combination corresponds to (via the sample sheet). 3. **Sorting reads:** It appends the read sequences and qualities to the FASTQ file for the identified sample. (If no match, it appends to an `Undetermined_S0.fastq.gz` file). 4. **Output:** Writes FASTQ files (usually compressed .gz) to `$OUT`, creating subdirectories per lane if multiple lanes. Because bcl2fastq also handles adapter trimming (if configured) and quality filtering, it can do some additional processing, but in our command above we did not specify adapter trimming here (we’ll do trimming in the next step via Trimmomatic). **Output structure:** After this, the `01_fastq` directory should contain FASTQ files for each sample. The pipeline can now treat each sample separately (or in parallel) for read cleaning and alignment. ## 02\_trimmomatic.sh – Trimming Adapters and Low-Quality Bases **Purpose:** After obtaining per-sample FASTQ files, the next step is to **clean the reads** by removing any lingering adapter sequences and trimming low-quality bases. This script uses **Trimmomatic**, a flexible read trimming tool for Illumina NGS data, to perform these tasks. In targeted amplicon sequencing like GT-seq, reads may still contain sequencing adapter remnants or primer sequences, especially if reads run past the actual genomic insert. Also, base quality tends to drop near the ends of reads, so trimming those can improve overall data quality for alignment. ```bash #!/bin/bash #SBATCH -J "trimmomatic" #SBATCH -o ./logs/log_%j%x #SBATCH -c 16 #SBATCH -p cpu #SBATCH --time=10:00:00 #SBATCH --mem=100G module load trimmomatic/0.39 for filename in *_R1_*.fastq.gz do base=$(basename $filename .fastq.gz) baseR2=${base/_R1_/_R2_} java -jar /modules/apps/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 -threads 16 \ ${base}.fastq.gz ${baseR2}.fastq.gz \ ../02_trimmed_fastq/${base}.qc.fq.gz s1_se \ ../02_trimmed_fastq/${baseR2}.qc.fq.gz s2_se \ ILLUMINACLIP:/nese/meclab/path/to/NexteraPE-PE.fa:2:30:10 \ LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:75 gzip -9c s1_se s2_se >> ../02_trimmed_fastq/orphans.qc.fq.gz rm -f s1_se s2_se done ``` * The SLURM header requests 16 cores and 100GB of memory for up to 10 hours, trimming is CPU-intensive (16 threads) and will process many FASTQ files. * The script loops over every file matching `*_R1_*.fastq.gz` in the current directory. This pattern targets the FASTQ files for Read 1 of each sample: * `base=$(basename $filename .fastq.gz)` strips the directory and `.fastq.gz` suffix, giving a base name (e.g., `SampleA_S1_L001_R1_001` -> base becomes `SampleA_S1_L001_R1_001`). * `baseR2=${base/_R1_/_R2_}` constructs the corresponding Read 2 filename by replacing `_R1_` with `_R2_` in the base name (e.g., `SampleA_S1_L001_R2_001`). * The core Trimmomatic command is the long `java -jar ... trimmomatic-0.39.jar PE -phred33 -threads 16 ...` line, which runs Trimmomatic in **paired-end mode (PE)** with 16 threads, assuming Phred+33 quality scores (standard for Illumina). The arguments are: 1. Input read pair: `${base}.fastq.gz` (R1 file) and `${baseR2}.fastq.gz` (R2 file). 2. Output files for paired and unpaired reads after trimming: * `../02_trimmed_fastq/${base}.qc.fq.gz` – trimmed Read1 output (paired). * `s1_se` – a temporary file for singleton reads from R1 (if its mate gets dropped). * `../02_trimmed_fastq/${baseR2}.qc.fq.gz` – trimmed Read2 output (paired). * `s2_se` – a temporary file for singleton reads from R2. Trimmomatic in PE mode requires four output files: two for surviving paired reads and two for any reads that become unpaired (if their mate is removed due to quality). Here they direct unpaired outputs to `s1_se` and `s2_se` (temporary files in the current directory), and paired outputs to the `02_trimmed_fastq` folder with `.qc.fq.gz` suffix (perhaps “quality controlled”). 3. Trimming steps (parameters): * `ILLUMINACLIP:/nese/meclab/path/to/NexteraPE-PE.fa:2:30:10` – this tells Trimmomatic to perform adapter clipping using the Nextera adapter sequences. The format is `ILLUMINACLIP:<fastaWithAdapters>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>`. * The provided FASTA (`NexteraPE-PE.fa`) contains adapter sequences used in library prep. * `2:30:10` means: allow up to 2 mismatches in the initial matching seed, and use a threshold of 30 for palindrome mode (for detecting adapter dimers in paired reads) and 10 for simple mode matches. These thresholds are scores; essentially, 30 is a stricter requirement for paired-read “palindrome” adapter detection, and 10 for normal adapter matches. This tries to ensure partial adapter sequences are removed. * `LEADING:5` – trim leading bases from each read if they are below quality 5. This will remove low-quality bases at the start of reads. * `TRAILING:5` – trim trailing bases below quality 5 (i.e., remove low-quality ends). * `SLIDINGWINDOW:4:15` – perform a sliding window quality trim: scan through the read with a window of 4 bases, and cut the read once the average quality in the window falls below 15. This adaptively removes low-quality regions (often the tail of the read where quality drops). * `MINLEN:75` – after all other trimming, discard any read that is now shorter than 75 bp. This ensures very short reads (perhaps largely adapter or too low quality) are dropped, as they may not align reliably. * After Trimmomatic runs, for each pair the script executes: ```bash gzip -9c s1_se s2_se >> ../02_trimmed_fastq/orphans.qc.fq.gz rm -f s1_se s2_se ``` This takes any unpaired reads (`s1_se` or `s2_se` produced when one mate was discarded) and appends them (with high compression `-9`) into a combined file `orphans.qc.fq.gz`. Then it removes the temporary files. This way, any singleton reads are not lost; they are stored in `orphans.qc.fq.gz` (the pipeline might not actively use them, but they’re kept if needed). Often, downstream paired-end pipelines ignore orphan reads; however, saving them can be useful for other analyses or QC. **Result:** The `02_trimmed_fastq` directory now contains, for each sample, a `*.qc.fq.gz` file for R1 and R2 that are trimmed and filtered. These will be used in the merging step next. The total number of reads might be slightly less than input (some reads dropped due to quality or being too short). Adapter sequences (from primers or Nextera adapters) should be removed, preventing them from causing false alignments later. **Why this matters:** Adapter contamination can lead to reads aligning incorrectly or not at all. Low-quality tails can produce mismatches or alignment errors. Trimming increases the accuracy of the alignment and variant calling. ## 03\_flash.sh – Merging Overlapping Paired-end Reads **Purpose:** GT-seq targets are often short amplicons (e.g., around 150 bp). Illumina paired-end sequencing might produce, say, 2 × 150 bp reads, which overlap over the same amplicon. **FLASH** (Fast Length Adjustment of Short reads) is a tool that merges paired-end reads when they overlap, reconstructing the original DNA fragment as a single read. Merging improves read length and can reduce redundancy (one fragment becomes one longer read instead of two). This script uses FLASH to merge reads that overlap by up to 135 bp. ```bash #!/bin/bash #SBATCH -c 8 #SBATCH -p cpu #SBATCH --time=2:00:00 #SBATCH --mem=80G #SBATCH -J "flashPEreads" #SBATCH -o ../logs/log_%j%x module load conda/latest conda activate gtseq FQDIR=/scratch3/workspace/path/to/02_trimmed_fastq for FILE in $FQDIR/*_L001_R1_001.qc.fq do baseR1=$(basename $FILE .qc.fq) baseR2=${baseR1/_R1_/_R2_} base=$(basename $FILE _L001_R1_001.qc.fq) flash --max-overlap 135 --output-prefix $base $baseR1.qc.fq $baseR2.qc.fq done conda deactivate ``` **Explanation:** * The SLURM header requests 8 cores and 80 GB for 2 hours. Merging is not extremely heavy, but this allows a decent amount of memory (to hold many reads in memory while merging). * The loop `for FILE in $FQDIR/*_L001_R1_001.qc.fq` goes through each R1 trimmed file. Inside: * `baseR1=$(basename $FILE .qc.fq)` removes the `.qc.fq` extension. Example: `SampleA_S1_L001_R1_001.qc.fq` -> `SampleA_S1_L001_R1_001`. * `baseR2=${baseR1/_R1_/_R2_}` forms the corresponding R2 file name (`SampleA_S1_L001_R2_001`). * `base=$(basename $FILE _L001_R1_001.qc.fq)` strips the lane/read part too, leaving `SampleA_S1` or similar as a base identifier (this becomes the output prefix). * The FLASH command: ```bash flash --max-overlap 135 --output-prefix $base $baseR1.qc.fq $baseR2.qc.fq ``` This calls `flash` on the two input files (R1 and R2) with an output prefix equal to the sample’s base name (like `SampleA_S1`). Important parameters: **What merging does:** FLASH takes a pair of reads: for example, Read1: `ACTG...` (from the start of the amplicon) and Read2: `...CAGT` (reverse complement from the end). If these reads overlap on the amplicon, FLASH finds the overlapping sequence and stitches the reads into one longer read covering the entire amplicon. For example, imagine a 250 bp amplicon and 2×150 bp reads with a 50 bp overlap: ``` Amplicon (250 bp): [------------------------------------------------------] Read1 (150 bp): [------------------------------> ] Read2 (150 bp): [<------------------------------] Overlap (~50 bp): ^^^^^^^^^^^^^^^^^^^^^ ``` After merging, you get one sequence \~250 bp long. If the reads do not overlap (amplicon longer than combined read lengths minus some threshold), they remain separate in the output (in notCombined files). In GT-seq, since the targets are known and relatively short, most pairs **will overlap** significantly and thus merge into high-confidence consensus reads. **Output:** The merged reads (`*.extendedFrags.fastq`) are the primary interest, as they represent complete amplicon sequences per read pair. These will be used for alignment. Unmerged reads (if any) could be aligned separately as well, but often in amplicon workflows, a high overlap means merging success is high. **Quality consideration:** FLASH also does a good job at handling the quality in the overlap by averaging where the reads overlap (to reduce errors). Merging can increase accuracy because any discrepancies in the overlap region can be resolved (like error correction). ## 04\_bwa.sh – Aligning Reads to Reference (BWA MEM) **Purpose:** At this stage, we have high-quality reads (merged or unmerged) for each sample. The next step is to **align** these reads to a reference sequence of the target loci. GT-seq panels have a set of reference amplicon sequences or a reference genome containing those loci. This script uses **BWA** (Burrows-Wheeler Aligner), specifically the `bwa mem` algorithm, to align reads and produce sorted BAM files (alignments) for each sample. It also uses SAMtools for some post-processing (sorting, flagstat, depth). ```bash #!/bin/bash #SBATCH -c 10 #SBATCH -p cpu #SBATCH --time=18:00:00 #SBATCH --mem=100G #SBATCH -J "02_bwa" #SBATCH -o ../logs/log_%j%x module load bwa/0.7.17 module load samtools/1.19.2 REFDIR=/nese/meclab/path/to/reference REF=targetamps11.21.fa for file in *.extendedFrags.fastq do base=$(echo "$file" | awk -F "." '{print $1}') bwa mem -t 10 -M \ -R "@RG\tID:${base}\tLB:amplicon\tPL:ILLUMINA\tSM:${base}" \ $REFDIR/$REF \ ./$base.extendedFrags.fastq | \ samtools sort -O bam -o ../03_alignments/$base.qc.flashed.bam - samtools flagstat ../03_alignments/$base.qc.flashed.bam > ../03_alignments/alnstats/$base.qc.flashed.flagstat.txt samtools depth ../alignments/$base.bam > ../coverage/$base.coverage samtools view -f 4 alignments/$base.bam > unmapped_reads/$base.unmapped.sam done ``` **Breakdown:** * SLURM requests 10 cores, 100G, and up to 18 hours. Alignment of many samples can be memory-intensive especially if reference indexes are large; here references might be just the amplicon sequences (which are small), but many reads across samples can accumulate. * `REFDIR` and `REF` define the reference genome file (`targetamps11.21.fa` contains all target amplicon sequences or a subset of genome relevant to targets). * The script loops over `*.extendedFrags.fastq` files (each merged read file from FLASH for each sample). * `base=$(echo "$file" | awk -F "." '{print $1}')` extracts the base name before the first dot. If files are named like `SampleA.extendedFrags.fastq`, this would yield `SampleA` as base (the sample name). This is used to name outputs. * The `bwa mem` command with options: * `-t 10` uses 10 threads for alignment (matching SLURM cores). * `-M` marks shorter split hits as secondary * `-R "@RG\tID:${base}\tLB:amplicon\tPL:ILLUMINA\tSM:${base}"` sets the **read group** for the alignments on the fly. The read group string includes: * `ID:${base}` – read group identifier, set to the sample name (unique per sample). * `LB:amplicon` – library name, here simply "amplicon" for all (could be more specific if needed). * `PL:ILLUMINA` – platform, Illumina. * `SM:${base}` – sample name, using the base (sample ID). These tags are crucial for downstream GATK steps – they identify which sample each read belongs to in a multi-sample context. BWA will embed this read group in each alignment record. * `$REFDIR/$REF` is the reference fasta file path. * `./$base.extendedFrags.fastq` is the input reads. The script uses the merged reads as input. If some reads were not merged, ideally they would also be aligned (the pipeline script as written only loops over extendedFrags, ignoring notCombined reads). * The output of `bwa mem` (which is SAM format by default to stdout) is piped into `samtools sort`. The command `samtools sort -O bam -o ../03_alignments/$base.qc.flashed.bam -` takes the SAM input and outputs a sorted BAM (`-O bam` sets output format, `-o` sets filename). The `-` at the end tells samtools to read from stdin (the pipe from BWA). * As a result, the alignment for sample `${base}` is saved in `../03_alignments/$base.qc.flashed.bam`. Include `.qc.flashed` in the name to indicate this BAM comes from QC’d, flashed (merged) reads. * After the pipe, three standalone `samtools` commands are executed: 1. `samtools flagstat ../03_alignments/$base.qc.flashed.bam > ../03_alignments/alnstats/$base.qc.flashed.flagstat.txt` – This generates alignment statistics. `flagstat` reports total reads, how many aligned, how many properly paired, duplicates, etc., for quick QA. The output is saved in an `alnstats/` subfolder for each sample. 2. `samtools depth ../alignments/$base.bam > ../coverage/$base.coverage` – This computes the per-base depth (coverage) of the aligned reads in the BAM and writes it to a file. 3. `samtools view -f 4 alignments/$base.bam > unmapped_reads/$base.unmapped.sam` – This extracts reads with the SAM flag 4 (unmapped reads) and saves them as a SAM file of unmapped reads. This can help diagnose if certain reads didn’t map to the reference (for example, off-target amplifications or contaminants). After the loop, each sample has: * A sorted BAM of alignments (`*.bam`), * A flagstat report (`*.flagstat.txt`), * A coverage file (`*.coverage`), * An unmapped reads file (`*.unmapped.sam`). ## 05\_extractpos81.sh – Extracting Coverage at Amplicon Center (Position 81) **Purpose:** This is a small utility script to parse the coverage files from the previous step and extract the coverage depth at a specific position – position 81 of each amplicon. Why position 81? It is the target SNP or the center of the 161 bp amplicon (for example, if primers were designed to flank a SNP at the center). This script pulls out the depth at position 81 for each amplicon per sample, to assess if the target site was sufficiently covered. ```bash #!/bin/bash for file in *.coverage do awk '$2 == 81' $file | sort -uk1,1 > $file.pos81 done ``` **Explanation:** This shell script doesn’t use SLURM (just run in a quick job). It iterates over all `*.coverage` files (from step 4, which contain lines like `<contig> <position> <depth>` for every covered position). For each coverage file: * `awk '$2 == 81' $file` filters the lines where the second column (\$2, which is the position) equals 81. This extracts all entries at position 81. In a coverage file, the first column might be the contig (amplicon ID or reference name), second is position, third is depth. If the reference has multiple contigs (amplicons), this will get the depth at position 81 for each contig that has that position. * `| sort -uk1,1` sorts these lines by the first column (contig name) and keeps only the unique contig entries (`-u` with `-k1,1` meaning unique on key field 1). This is useful in case the coverage file had multiple entries for position 81 (it shouldn’t, unless perhaps paired reads counted twice, but depth should count each base once; maybe this is just to be safe). * `> $file.pos81` redirects output to a new file with `.pos81` extension. The result is: for each sample’s coverage file, we get a corresponding `.coverage.pos81` file that lists each reference contig and the depth at position 81 for that contig, for that sample. Essentially a distilled coverage report at the SNP positions of interest (assuming center = SNP position). For example, if a sample has 100 target loci (amplicons), each `.pos81` file might have up to 100 lines, each like: ``` Amplicon1 81 57 Amplicon2 81 42 ... ``` meaning 57 reads covered position 81 of Amplicon1, 42 reads at Amplicon2, etc. These files will feed into the R script summarization (step 6) or just be used to quickly check if each locus has coverage. A typical QC might be to ensure each sample has a minimum coverage (e.g., ≥10 reads) at each target SNP. ## 06\_submit\_Rsummarize.sh – Summarizing Coverage and Depth (R Script Submission) **Purpose:** This step triggers an R script that read the coverage data (especially the extracted position 81 depths) and produce summary statistics. The pipeline indicates it *“Runs R scripts to summarize coverage and depth stats”* and outputs per-sample summary files. It calculates how many loci are well-covered per sample, depth distributions, etc., which helps decide if a sample passes QC or if certain loci failed consistently. ```bash #!/bin/bash #SBATCH -J "Rprimeranalysis" #SBATCH -o log_%x.%j #SBATCH -c 6 #SBATCH -p cpu #SBATCH --time=15:00:00 #SBATCH --mem=50G module load r-rocker-ml-verse/4.4.0+apptainer Rscript scripts/summarize.depth.at.target.SNPs.R ``` **Explanation & Expectations:** In absence of exact content, we infer: * It collects the `.pos81` data for all samples and creates a matrix of coverage (samples x loci). ## 07\_prep\_for\_GATK.sh – Read Group Addition, Sorting & Indexing **Purpose:** This script prepares the alignment files (BAMs) for variant calling with GATK. GATK requires that each BAM file have proper **read group information**, and typically be coordinate-sorted and indexed. If read groups weren’t added earlier or need standardization, this is done via Picard (or GATK’s AddOrReplaceReadGroups). Additionally, the BAMs might be re-sorted if not already, and a BAM index (`.bai`) file is created for each. From the pipeline snippet, the main step is adding read groups using Picard’s `AddOrReplaceReadGroups` tool: ```bash #!/bin/bash #SBATCH -J prepgatk #SBATCH -o ./logs/%x_%A_%a.log #SBATCH -e ./logs/%x_%A_%a.err #SBATCH -p cpu #SBATCH -t 4:00:00 #SBATCH -c 4 # Number of Cores per Task #SBATCH --threads-per-core=1 #SBATCH --mem=10000 # Requested Memory #SBATCH --mail-type=ALL module load samtools/1.19.2 module load uri/main picard/2.25.1-Java-11 ALIGNDIR=/scratch/workspace/earguetaherr_umass_edu-shared/brazil_gtseq/03_alignments # This is where your bam files are OUTDIR=/scratch/workspace/earguetaherr_umass_edu-shared/brazil_gtseq/05_genotyping # This is the folder your alignments will output to ###### 1. sort and add read group headers to bam files, plus indem them cd $ALIGNDIR for file in *.qc.flashed.bam do date echo $file sample=$(basename $file .qc.flashed.bam ) echo $sample java -jar $EBROOTPICARD/picard.jar AddOrReplaceReadGroups I= $file O= "$sample"_RG.bam RGID= $sample RGLB= $sample RGPL=illumina RGPU=unit1 RGSM= $sample done #sort and index new bam files for file in *_RG.bam do sample=$(basename $file _RG.bam ) echo $sample samtools sort $file -o "$sample"_RG_sort.bam samtools index "$sample"_RG_sort.bam done #make new bam file list ls *_RG_sort.bam > bamlist.txt ``` After adding read groups, they mention “Sorts and indexes for GATK use”. Possibly the script also runs `samtools sort` (if not sorted yet) and `samtools index` on the new `${sample}_RG.bam`. Or Picard could sort while adding read groups (Picard usually keeps order, but sometimes one might resort just in case). **Explanation:** * This script loops over each BAM from step 4. * For each, it runs the `AddOrReplaceReadGroups` command as shown: * `I=$file` is input BAM, * `O=${sample}_RG.bam` is output BAM with read group tags added, * `RGID, RGLB, RGPL, RGPU, RGSM` are read group fields: * RGID = sample (unique ID, often flowcell.lane or sample; here just using sample name), * RGLB = sample (library, using sample name; a real pipeline might use library prep ID if multiple libraries per sample), * RGPL = illumina (platform), * RGPU = unit1 (platform unit; often flowcell-barcode.lane, but “unit1” is a placeholder, acceptable if unique per RGID in combined data), * RGSM = sample (sample name). The final outcome: a coordinate-sorted, indexed BAM per sample with proper read groups. This is the input for GATK HaplotypeCaller (next step). ## 08\_GATK\_haplotypecaller.sh – Calling Variants per Sample (GVCF mode) **Purpose:** Use GATK’s HaplotypeCaller to identify variants (SNPs/indels) in each sample’s reads. Importantly, this is done in **GVCF mode** to produce an intermediate genomic VCF (gVCF) per sample, which records not only variant sites but reference (non-variant) sites with coverage, along with genotype likelihoods. This gVCF per sample is necessary for the next step of joint genotyping across samples. The script loops over each prepared BAM and runs GATK HaplotypeCaller: ```bash #!/bin/bash #SBATCH -J SNPcall #SBATCH -o ./logs/%x_T1.log #SBATCH -e ./logs/%x_T1.err #SBATCH -p cpu #SBATCH -t 1:00:00 #SBATCH -c 4 # Number of Cores per Task #SBATCH --threads-per-core=1 #SBATCH --mem=10000 # Requested Memory #SBATCH --array=1-987 # Number of samples #SBATCH --mail-type=ALL module load uri/main GATK/4.3.0.0-GCCcore-11.3.0-Java-11 # This is a per individual version of the SNP calling step (not arrayed by chromosome). You've been warned REFDIR=/scratch/workspace/earguetaherr_umass_edu-shared/scripts/SNPdisco # You need to download your reference genome to here REF=targetamps11.21.fa ALIGNDIR=/scratch/workspace/earguetaherr_umass_edu-shared/brazil_gtseq/03_alignments # This is where your bam files are SAMPLE=$(sed -n ${SLURM_ARRAY_TASK_ID}'p' /scratch/workspace/earguetaherr_umass_edu-shared/brazil_gtseq/03_alignments/bamlist.txt) # redo samples that didn't run corr> OUTDIR=/scratch/workspace/earguetaherr_umass_edu-shared/brazil_gtseq/05_genotyping # This is the folder your alignments will output to gatk \ HaplotypeCaller \ --java-options "-Xms4G -Xmx10G -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \ -R ${REFDIR}/${REF} \ -ERC GVCF \ -mbq 20 \ -I ${ALIGNDIR}/${SAMPLE} \ -O ${OUTDIR}/${SAMPLE%_RG_sort.bam}.g.vcf # the characters after % remove from the SAMPLE name ``` **Explanation of key options:** * `--java-options "-Xms4G -Xmx10G"` gives Java some memory parameters (initial 4G, max 10G heap) – GATK is Java-based, and variant calling can use a lot of memory for assembly. * `-R ${REFDIR}/${REF}` specifies the reference genome FASTA (same reference used for alignment). * `-ERC GVCF` enables **Emission of Reference Confidence** in GVCF mode. This means instead of outputting only variant sites, HaplotypeCaller will output a gVCF with blocks of reference calls and variant calls. Essentially, each position in the targets is summarized so that joint genotyping can later consider all positions. * `-mbq 20` sets a minimum base quality of 20 for bases to be considered in variant calling. Bases below Q20 will be ignored in calling (this helps reduce false positives from low-quality tails, etc.). * `-I ${ALIGNDIR}/${SAMPLE}` is the input BAM file (with read groups, sorted, etc.). `${ALIGNDIR}` and `${SAMPLE}` likely refer to, for example, `03_alignments/SampleA_RG_sorted.bam`. * `-O ${OUTDIR}/${SAMPLE%_RG_sort.bam}.g.vcf` sets the output filename. The notation `${SAMPLE%_RG_sort.bam}` suggests they strip the suffix `_RG_sort.bam` from the input name to create a base (e.g., `SampleA`) and then add `.g.vcf`. So output might be `SampleA.g.vcf` (or `.g.vcf.gz` if compressed). GVCF is typically text (VCF) but can be block-gzipped. **Process:** HaplotypeCaller will perform a local re-assembly of reads at regions, determine variant likelihoods, and output for each variant site a record, and for each non-variant region a reference block record in the gVCF. Each sample run is independent here. Given GT-seq targets are specific loci, they might restrict HaplotypeCaller to intervals (the particular amplicon regions) to speed it up. However, the script as shown doesn’t list `-L` intervals. Possibly `REF` is a fasta of just those amplicons, so HaplotypeCaller inherently is limited to those contigs. **Output:** A `.g.vcf` file per sample containing that sample’s variant data. This file includes genotype likelihoods for reference and variant calls. It’s the basis for joint genotyping. **Why GVCF mode:** Best practices from Broad Institute emphasize calling each sample independently in GVCF mode, then performing joint genotyping on all samples together. This workflow improves sensitivity and avoids repeated handling of each sample’s data for every combination of samples. It also allows adding or reprocessing samples incrementally (the so-called “N+1” problem solution). **Performance:** They ran with 10GB per sample. If many samples, one might scatter jobs. The script might be written to submit an array job where each sample’s HaplotypeCaller runs separately. The snippet suggests it might be looped or one by one. We will assume it loops or is launched per sample. **Line-by-line considerations:** Nothing too complex beyond the parameter choices above. The `-ERC GVCF` is crucial, as is providing the correct reference and input. The output naming is just a string manipulation to drop the file extension. After this step, we do **not yet have a final VCF** – we have one GVCF *per sample*. Next, we combine them. ## 09\_GATK\_import.sh – Importing GVCFs into GenomicsDB **Purpose:** When dealing with many samples, GATK offers a tool `GenomicsDBImport` to combine individual gVCF files into a database (called GenomicsDB). This is more efficient for joint genotyping than combining gVCFs with traditional CombineGVCFs. Essentially, it creates a repository of all per-sample data at each genomic position of interest. ```bash module load uri/main GATK/4.3.0.0-GCCcore-11.3.0-Java-11 SAMPLE_DIR=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/00_info OUTDIR=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/05_genotyping/FlashedReads SAMPLE_FILE=nofailHI_genomicsDBimport.map SAMPLE_NAME=${SAMPLE_FILE%_genomicsDBimport.map} INTERVAL_FILE=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/00_info/ampnames.txt INTERVAL=$(sed -n ${SLURM_ARRAY_TASK_ID}'p' /scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/00_info/ampnames.txt) echo "Using samples from ${SAMPLE_FILE} with the interval listed in the file ${INTERVAL}. Output is written to ${OUTDIR} with the folder name ${SAMPLE_NAME}." cd ${OUTDIR} # creating genomicsDB per chromosome mkdir -p ${SAMPLE_NAME} #make main overarching directory if doesn't exist gatk --java-options "-Xms4g -Xmx4g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport \ --sample-name-map ${SAMPLE_DIR}/${SAMPLE_FILE} \ --genomicsdb-workspace-path ${OUTDIR}/${SAMPLE_NAME}/${INTERVAL} \ -L $INTERVAL \ --batch-size 50 \ --reader-threads 10 ``` **Explanation:** * `GenomicsDBImport` takes a list of GVCFs (or a map of sample names to GVCF paths) and imports them. * `--sample-name-map ${SAMPLE_DIR}/${SAMPLE_FILE}`: GATK requires a map file listing each sample and its GVCF. This is a text file where each line is like `SampleA\t/path/to/SampleA.g.vcf`. The script uses variables for the location of this file. * `--genomicsdb-workspace-path ${OUTDIR}/${SAMPLE_NAME}/${INTERVAL}`: This sets the output directory (workspace) for the GenomicsDB. They likely separate by intervals; `${SAMPLE_NAME}` might be a cohort name or project name, and `${INTERVAL}` could be a region name. Each interval (like each amplicon or a group of them) can be imported into a separate database for parallelism. For instance, if INTERVAL is a chromosome or contig (like an amplicon ID), it will create a folder for that interval’s database. * `-L $INTERVAL`: This specifies the genomic interval to import. They might run this import per target contig. If \$INTERVAL corresponds to one amplicon (reference contig) or a set of them, GenomicsDBImport will only consider that interval from each GVCF, reducing memory. GT-seq might have hundreds of contigs, so they could either: * Import all into one DB (with a interval list covering all contigs), * Or (more likely from the script) import interval by interval, using `--sample-name-map` for all samples but `-L` for one target. Perhaps this script itself is run as an array job for each \$INTERVAL. * `--batch-size 50`: This option limits memory usage by importing 50 samples at a time before merging (tuning parameter). If there are hundreds of samples, they batch to avoid using too much memory at once. * `--reader-threads 10`: Use 10 threads for reading input files (parallel I/O). **What it does:** It reads each sample’s GVCF at positions in \$INTERVAL and builds a database that knows, for each position in that interval, each sample’s possible genotype information. It’s essentially a combined multi-sample representation but stored in a tiled database format (GenomicsDB is based on TileDB). **Output:** A directory (workspace) that stores the combined data for that interval. This is not human-readable, but GATK’s GenotypeGVCFs can query it. **Preparation:** The user must create the sample map file (`--sample-name-map`). The pipeline may have created this in a prior step (maybe listing all .g.vcf files with sample IDs). The variables `${SAMPLE_NAME}` and such are a bit unclear; possibly `${SAMPLE_NAME}` might be the project or panel name, not an individual sample. For example, OUTDIR might be something like `04_GVCF_GenomicsDB/` and inside that, `${INTERVAL}` as subfolder for each locus DB. **Line-by-line:** This command as shown is basically one long command with arguments. The script probably just calls it for each interval. If one wanted to run for all intervals in one go, one could provide a list of intervals to GenomicsDBImport, but splitting can be easier for concurrency. **Why not just CombineGVCFs?** For many samples, GenomicsDBImport is preferred for speed and to avoid extremely large file handling. It’s scalable to large cohorts. It effectively does the job of combining gVCFs but in a more database-like way. ## 10\_GATK\_joingeno.sh – Joint Genotyping (GenotypeGVCFs) **Purpose:** The final step takes the combined gVCF information (in the GenomicsDB or combined gVCF) and performs joint genotyping to produce a final VCF of variants across all samples. GATK’s **GenotypeGVCFs** tool is used to convert reference confidence data into concrete genotype calls for each sample at each site. ```bash module load uri/main GATK/4.3.0.0-GCCcore-11.3.0-Java-11 REFDIR=/nese/meclab/Jamie/RAD_GTHI/cm_RAD_2/6_SNPdisco # You need to download your reference genome to here REF=targetamps11.21.fa DIR=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/05_genotyping/FlashedReads OUTDIR=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/05_genotyping/FlashedReads SAMPLE_DB=nofailHI INTERVAL=$(sed -n ${SLURM_ARRAY_TASK_ID}'p' /scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/00_info/reformat_bedforgatk4.txt) AMPNAME=$(echo ${INTERVAL} | awk -F ":" '{print $1 ":" $2}') cd ${OUTDIR} gatk --java-options "-Xms4G -Xmx4G -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenotypeGVCFs \ -R ${REFDIR}/${REF} \ -V gendb://${SAMPLE_DB}/${AMPNAME} \ -O ${OUTDIR}/${SAMPLE_DB}_${AMPNAME}.vcf.gz ``` `-R ${REFDIR}/${REF}` specifies the reference FASTA file used for alignment and variant calling. `-V gendb://${SAMPLE_DB}/${AMPNAME}` indicates the GenomicsDB workspace generated from imported gVCFs. This URI syntax tells GATK to query the database directly. `-O ${OUTDIR}/${SAMPLE_DB}_${AMPNAME}.vcf.gz` defines the output file, naming it based on the sample database and the amplicon or interval. Function: GenotypeGVCFs reads in the GenomicsDB database for a specific interval (amplicon) and performs joint genotyping across all samples at each position. It emits a finalized multi-sample VCF containing genotype calls and associated metadata (e.g., allele depths, genotype quality). If run in an array across intervals, each task calls variants for one amplicon or locus. The output files are gzipped VCFs per interval, containing variants and genotypes for all samples. **Completing the Pipeline:** At this point, we have gone from raw sequence data to a final set of variant calls, which was the goal of Step 2 (data preparation and variant calling). ### Conclusion and File Organization Throughout the pipeline, intermediate files are organized by stage (as noted in the pipeline notes: raw > trimmed > aligned > GVCF > VCF). This helps manage the workflow and troubleshoot if needed. To recap: * **Raw Data**: BCL files (tar archive from sequencer). * **FASTQ**: After demultiplexing (01\_fastq directory) – the raw reads per sample. * **Trimmed FASTQ**: After Trimmomatic (02\_trimmed\_fastq) – cleaned reads. * **Merged reads**: After FLASH – extended reads (still in 02 or a flash output directory). * **Alignments**: BAM files after BWA (03\_alignments) + coverage and stats. * **Coverage stats**: coverage/ and unmapped\_reads/ directories, plus extracted pos81 coverage per sample (for QC). * **Prepared BAMs**: possibly in a directory for GATK input (sorted, read-grouped BAMs, could be same as 03\_alignments but updated). * **GVCFs**: per sample gVCF files (in a GVCF dir, maybe 04\_gvcf). * **GenomicsDB**: a directory structure (05\_GenDB) containing combined data. * **VCFs**: final joint-called VCFs (06\_vcf). By following each script, one ensures that data flows correctly: integrity-checked -> extracted -> demultiplexed -> cleaned -> merged -> aligned -> checked -> labeled -> variant-called -> combined -> genotyped. Each script addresses a specific need in the pipeline: * Integrity & extraction (00 scripts) * Conversion to analysis-ready reads (01-03) * Alignment and QC (04-06) * Final variant calling steps (07-10) This design makes it easier to rerun parts if needed (e.g., if adapter trimming parameters change, just rerun 02 onward). The final output VCF(s) can now be used for downstream analysis such as identifying genotypes, running kinship or population structure analyses, etc.