Inferring genotypes from RNA sequencing data

# Inferring genotypes from RNA sequencing data Downloaded files from: https://www.ebi.ac.uk/ena/browser/text-search?query=1000%20genomes https://www.ebi.ac.uk/ena/browser/view/PRJEB3365 This RNA sequencing data set of 465 human lymphoblastoid cell line samples from the CEU, FIN, GBR, TSI and YRI populations from the 1000 Genomes sample collection was created by the Geuvadis consortium (www.geuvadis.org, http://www.geuvadis.org/web/geuvadis/our-rnaseq-project). I consider only 21 samples (YRI and CEU pop), 11 YRI and 10 CEU | Name | Sample| Pop| | -------- | -------- | ------| |ERR187488.fastq.gz|NA07000|CEU| |ERR187489.fastq.gz|NA19197|YRI| |ERR187491.fastq.gz|NA18908|YRI| |ERR187492.fastq.gz|NA12383 | CEU| |ERR187493.fastq.gz|NA19117| YRI| |ERR187495.fastq.gz|NA12717| CEU| |ERR187497.fastq.gz|NA12812| CEU| |ERR187501.fastq.gz|NA12890|CEU| |ERR187507.fastq.gz|NA11893|CEU| |ERR187510.fastq.gz|NA19099|YRI| |ERR187514.fastq.gz|NA12763|CEU| |ERR187517.fastq.gz|NA07357|CEU| |ERR187518.fastq.gz|NA12762|CEU| |ERR187522.fastq.gz|NA11920| CEU| |ERR187523.fastq.gz|NA19095|YRI| |ERR187526.fastq.gz|NA19107| YRI| |ERR187527.fastq.gz|NA18508|YRI| |ERR187528.fastq.gz|NA18907|YRI| |ERR187530.fastq.gz|NA18511|YRI| |ERR187535.fastq.gz|NA19200|YRI| |ERR187536.fastq.gz|NA19175|YRI| ## 1. Run Bowtie (https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) Bowtie is a tool used for aligning sequencing reads to a reference genome. In the context of RNA sequencing (RNA-seq), it is used to align reads generated from RNA fragments to a reference genome and it can also handle reads that contain splice junctions. script `bowtie.sh` ``` #!/bin/bash # List of sample names samples=(ERR187488 ERR187489 ERR187491 ERR187492 ERR187493 ERR187495 ERR187497 ERR187501 ERR187507 ERR187510 ERR187514 ERR187517 ERR187518 ERR187522 ERR187523 ERR187526 ERR187527 ERR187528 ERR187530 ERR187535 ERR187536) # Run the command for each sample for sample in "${samples[@]}" do bowtie --no-unal -p 20 -x /lizardfs/flaviav/rna-seq/index_name "/lizardfs/flaviav/rna-seq/${sample}.fastq.gz" -S "/lizardfs/flaviav/rna-seq/output_${sample}.sam" done ``` `--no-unal`: Suppress SAM records for reads that failed to align. `-p`: set thred ## 2. Change the mapping quality column Usually the mapping quality column has these values: 0 = maps to 5 or more locations 1 = maps to 3-4 locations 3 = maps to 2 locations 255 = unique mapping Bowtie doesn't calculate Maq-like mapping quality values, so (per the SAM spec) it prints 255 there if the read aligns or 0 otherwise. It was impossible for GATK do the variant calling so I change the mapping quality column to 60. As an alternative I need to check if it is possible menage this with bowtie. https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/36196-bowtie-mapped-reads-all-unique#post240828, this is a problem of all. ``` for file in *.sam do sed -i 's/\t255\t/\t60\t/g' "$file" done ``` ## 2. Sort/index SAM files and convert to BAM files ``` for samfile in *.sam; do samtools sort -@ 4 -O BAM -o ${samfile/.sam/.bam} $samfile samtools index ${samfile/.sam/.bam} done ``` ## 4. Add reads groups script `fix.sh` The aim of the command is to add read group information to each input BAM file, which is a necessary step before calling variants in RNA-seq data. Read groups provide information about the sequencing library preparation and sequencing platform used to generate the data, as well as any other relevant details such as sample ID. `for file in /lizardfs/flaviav/rna-seq/merged/*.bam; do java -jar /lizardfs/flaviav/picard.jar AddOrReplaceReadGroups -I "$file" -O "${file%.*}_withRG.bam" -SORT_ORDER coordinate -RGID foo -RGLB bar -RGPL illumina -RGSM Sample1 -CREATE_INDEX True -RGPU machine; done` The read groups are specified using the following parameters: RGID: read group ID (arbitrary value, set to "foo" in this case) RGLB: read group library (arbitrary value, set to "bar" in this case) RGPL: platform/technology used to produce the reads (set to "illumina" in this case) RGSM: sample name (set to "Sample1" in this case) RGPU: platform unit (arbitrary value, set to "machine" in this case) ## 5. Variant calling with GATK script `vc.sh` - Loop through all the input files: this step involves iterating over all the input BAM files containing aligned reads of RNA-seq samples. - Get the sample name from the file name: the sample name is extracted from the file name, usually by removing the extension (".bam"). - Add read groups to the BAM file: read groups are added to the BAM file to uniquely identify the RNA-seq sample, as well as any other relevant information such as the sequencing platform. - Index the BAM file with read groups: the BAM file is indexed using the read groups. - Call variants with HaplotypeCaller and output GVCF file: The HaplotypeCaller tool from GATK is used to call variants on each input BAM file. The output is in the form of a GVCF (Genomic Variant Call Format) file, which contains information on all variants at a given position in the genome. N.B. I use these two parameters to facilitate the variant calling: ``` --dont-use-soft-clipped-bases --stand-call-conf ``` In RNA-seq data, reads often contain soft-clipped bases, which are bases at the end of a read that do not perfectly match the reference genome and have been clipped off by the alignment software. These soft-clipped bases can provide valuable information for identifying transcript isoforms and detecting fusion genes, but they can also introduce errors in variant calling. The `--dont-use-soft-clipped-bases `option tells the variant caller to ignore soft-clipped bases when making variant calls. This can be useful in RNA-seq data, where soft-clipping is common and variant calling can be challenging. The` --stand-call-conf` option in GATK's HaplotypeCaller is used to set the minimum level of confidence required to make a variant call. In RNA-seq data, it is important to balance sensitivity (the ability to detect true variants) and specificity (the ability to avoid false positives) when calling variants. I setted the `--stand-call-conf 20` that tells the variant caller to make calls with a Phred-scaled confidence score of at least 20 (corresponding to a 1 in 100 chance of error), which may be more appropriate for RNA-seq data. - Combine GVCF files into a GenomicsDB database: the GVCF files from all RNA-seq samples are combined into a GenomicsDB database. - Jointly genotype GVCF files from the GenomicsDB database: the GenomicsDB database is used to jointly genotype the GVCF files, generating a VCF file containing all variants called across all RNA-seq samples. ## 6. Statistics BAM FILES I check the percentage of mapping and unmapping reads. ``` for file in *.sam; do sample=$(basename "$file" .sam) total_seqs=$(samtools stats "$file" | grep ^SN | awk -F '\t' '{if ($1 == "raw total sequences:") print $2}') echo "Sample: ${sample}" > "${sample}_stats.txt" samtools stats "$file" | grep ^SN | cut -f 2- >> "${sample}_stats.txt" done ``` Following by R plot. ![](https://i.imgur.com/3dFCgI0.png) ## 6b.1000Genomes DNA-sequencing https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=varRep&hgta_track=tgpPhase3&hgta_table=tgpPhase3&hgta_doSchema=describe+table+schema http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ Extact only the samples that are interesting and reheader. `bcftools view -S samplelist ALL.chr20.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz > 21samples.vcf` 993,880 markers ## 7. Statistics VCF (chr20) bcftools stats on DNA-RNA files. | Variants | | | -------- | ------| | DNA | 1,817,492 | | RNA | 661 | | Variants types | DNA | RNA | | -------- | -------- | -------- | | SNPs | 1,706,442 |657 | INDELs | 111,050 |5 ``` bcftools isec -p overlap dna.vcf.gz rnavcf.gz ``` OR ``` # find overlapping positions between the two files bcftools isec -n~11 -c all dnaseq.vcf.gz rnaseq.vcf.gz > sharedPositions.txt # filter vcf files for the overlapping positions bcftools view -T sharedPositions.txt dnaseq.vcf.gz-Oz > dna.sharedPositions.vcf.gz bcftools view -T sharedPositions.txt rnaseq.vcf.gz -Oz > rna.sharedPositions.vcf.gz ``` shared variants: 84,0.04 % ## 8. Pop Gen Analyses (let's check if all works well) - ### 8a. PCA (we have only few samples, only two populations and only 611 SNPs markers!!) I removed multiallelics and retain only SNPs variants. ``` bcftools view multi_sample.vcf.gz grch38#chr20 >chr20.vcf bcftools view --max-alleles 2 --exclude-types indels chr20.vcf > chr20.onlysnps.vcf bcftools annotate --set-id '%CHROM\_%POS\_%REF\_%ALT' chr20.onlysnps.vcf -o chr20.pca.vcf sed -i.bak 's/^grch38#//g' chr20.pca.vcf ./plink2 --vcf chr20.pca.vcf --freq ./plink2 --vcf chr20.pca.vcf --make-bed ./plink2 --bfile plink2 --read-freq plink2.afreq --pca N.B. PCA with only two components is empty, is due by few variation. ``` Following R code. ![](https://i.imgur.com/RkWv7ly.png) ![](https://i.imgur.com/KFK7o5P.png) - ### 8b. ADMIXTURE ANALYSES https://owensgl.github.io/biol525D/Topic_8-9/ as guide. ONLY CHROMOSOME 20. - Lets first filter the VCF to a smaller set. I am going to use bcftools for processing and filtering VCF files. Here are what we want to filter for: At least 9/10 samples genotyped (which is 18 alleles since they’re diploid). Only one alternate allele. No indels. At least 2 copies of the alternate allele ``` bcftools view multi_sample.vcf.gz grch38#chr20 > chr20.vcf bcftools view \ -c 2 \ -i 'INFO/AN >= 18' \ -m2 \ -M2 \ -v snps \ chr20.vcf\ -O z > chr20.filtered.vcf.gz ``` - I have to convert our VCF to the bed format. I am going to use plink and fix chromosome name (plink accepts 20 not chr20) ``` tabix -p vcf chr20.filtered.vcf.gz zcat chr20.filtered.vcf.gz |\ sed s/^grch38#chr//g |\ gzip > chr20.filtered.numericChr.vcf.gz ./plink2 --make-bed \ --vcf chr20.filtered.numericChr.vcf.gz \ --out chr20.filtered.numericChr \ --set-missing-var-ids @:# \ --double-id \ --allow-extra-chr ``` This produces files with the suffix .nosex, .log, .fam, .bim, .bed. We can use these in Admixture. ``` echo "K CV_error" > chr20_log_CVerror.txt for K in {1..13} do echo "running K = "$K admixture_linux-1.3.0/dist/admixture_linux-1.3.0/admixture32 --cv chr20.filtered.numericChr.bed $K | tee chr20_log_K${K}.out #print cv error estimations, lowest ones are best estimates y=$(grep -h CV chr20_log_K"$K".out | awk '{print $4}') echo $K $y >> chr20_log_CVerror.txt done ``` **The best K value for Admixture is typically the K value with the lowest cross-validation (CV) error**. The CV error are in the .out files we saved. One easy way to look at all those scores is to print all the .out files and then keep only the lines that include “CV” using grep. ``` cat *out | grep CV ``` CV error (K=1): 1.16630 CV error (K=2): 1.43266 CV error (K=3): 1.43576 CV error (K=4): 1.50122 CV error (K=5): 0.73594 CV error (K=6): 0.77360 CV error (K=7): 0.92013 CV error (K=8): 0.63637 CV error (K=9): 0.6995 CV error (K=10): 0.31699 CV error (K=11): 0.48489 CV error (K=12): 0.24646 CV error (K=13): 0.26958 This shows that the lowest CV error is with K=13. Let's check k=13, all seems ok, but there is something strange with the last sample. ![](https://i.imgur.com/D9tO28k.png) ##### CODE R PLOTS 1. BAM STATISTICS ``` # List all the files in the current directory files <- list.files(pattern = "\\.txt$") # Initialize an empty data frame to store the results df <- data.frame(sample = character(), mapped = numeric(), unmapped = numeric(), stringsAsFactors = FALSE) # Loop over the files and extract the relevant information for (file in files) { # Read the lines from the file lines <- readLines(file) # Extract the sample name sample <- gsub("Sample: ", "", lines[1]) # Extract the mapped and unmapped counts mapped <- as.numeric(gsub("reads mapped: *", "", lines[grep("reads mapped:", lines)])) unmapped <- as.numeric(gsub("reads unmapped: *", "", lines[grep("reads unmapped:", lines)])) # Add the results to the data frame df <- rbind(df, data.frame(sample = sample, mapped = mapped, unmapped = unmapped)) } # Aggregate the results by sample name df_agg <- aggregate(cbind(mapped, unmapped) ~ sample, df, sum) # Aggregate the results by sample name df_agg <- aggregate(cbind(mapped, unmapped) ~ sample, df, sum) # Load the required packages library(tidyr) library(ggplot2) # Convert the data from wide to long format df_long <- gather(df_agg, key = "category", value = "count", mapped, unmapped) # Plot the data using ggplot2 ggplot(df_long, aes(x = sample, y = count, fill = category)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Sample", y = "Read count", fill = NULL) + theme_minimal() ``` 2. PCA code ``` library(RColorBrewer) library(ggplot2) #I include the first 20 PCs: myd<-read.csv("plink2.eigenvec", sep = "\t", header =T) pop<-read.table("samples.map.txt", sep = "\t", header = T) my_pca <- prcomp(myd[, 3:ncol(myd)], scale. = TRUE, center = TRUE, n = 20) my_pca_df <- as.data.frame(my_pca$x) my_pca_df <- cbind(pop, my_pca_df) ggplot(my_pca_df, aes(x = PC1, y = PC2, color = factor(POP))) + geom_point() + geom_text(aes(label = INDIVIDUALS), hjust = 0, vjust = 0, size = 1) + scale_color_manual(values = c("red", "blue", "green")) + labs(title = "PCA YRI-CEU", x = "PC1", y = "PC2") + theme_bw() ``` 3. ADMIXTURE PLOT ``` samplelist <- read.table("samples.txt", sep="\t", col.names = c("sample", "pop")) all_data <- tibble(sample = character(), k = numeric(), Q = character(), pop = character(), value = numeric()) for (k in 1:13){ data <- read.delim(paste0("chr20.filtered.numericChr.",k,".Q"), col.names = paste0("Q",seq(1:k)), sep = " ") data[nrow(data) + 1, ] <- 1 data$sample[nrow(data)] <- samplelist$sample[k] data$sample <- samplelist$sample data$pop <- samplelist$pop # This step converts from wide to long. data %>% gather(Q, value, -sample, -pop, paste0("Q", seq(1:k))) -> data data$k <- k all_data <- rbind(all_data,data) } # Sort by k and then by pop all_data = all_data %>% arrange(k, pop) # Define the desired order of labels on the x-axis label_order <- c("ERR187488", "ERR187492", "ERR187495", "ERR187497", "ERR187501", "ERR187507", "ERR187514", "ERR187517", "ERR187518", "ERR187522", "ERR187489", "ERR187491", "ERR187493", "ERR187510", "ERR187523", "ERR187526", "ERR187527", "ERR187528", "ERR187530", "ERR187535") # Convert sample column to factor with the desired order of levels all_data$sample <- factor(all_data$sample, levels = label_order) ggplot(all_data, aes(x = sample, y = value, fill = factor(Q))) + geom_bar(stat = "identity") + labs(title = "RNA-seq", x = "Sample", y = "Value") + theme_bw() + scale_fill_discrete(name = "Q") + facet_wrap(~k, ncol=1)+ theme(axis.text.x = element_text(angle = 60, hjust = 1)) +ylim(0,1) + theme_bw() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + scale_fill_manual(values = cols, name = "K", labels = seq(1:13)) + guides(color = guide_legend(nrow = 2, byrow = TRUE)) ```