# Inferring genotypes from RNA sequencing data
Downloaded files from:
https://www.ebi.ac.uk/ena/browser/text-search?query=1000%20genomes
https://www.ebi.ac.uk/ena/browser/view/PRJEB3365
This RNA sequencing data set of 465 human lymphoblastoid cell line samples from the CEU, FIN, GBR, TSI and YRI populations from the 1000 Genomes sample collection was created by the Geuvadis consortium (www.geuvadis.org, http://www.geuvadis.org/web/geuvadis/our-rnaseq-project).
I consider only 21 samples (YRI and CEU pop), 11 YRI and 10 CEU
| Name | Sample| Pop|
| -------- | -------- | ------|
|ERR187488.fastq.gz|NA07000|CEU|
|ERR187489.fastq.gz|NA19197|YRI|
|ERR187491.fastq.gz|NA18908|YRI|
|ERR187492.fastq.gz|NA12383 | CEU|
|ERR187493.fastq.gz|NA19117| YRI|
|ERR187495.fastq.gz|NA12717| CEU|
|ERR187497.fastq.gz|NA12812| CEU|
|ERR187501.fastq.gz|NA12890|CEU|
|ERR187507.fastq.gz|NA11893|CEU|
|ERR187510.fastq.gz|NA19099|YRI|
|ERR187514.fastq.gz|NA12763|CEU|
|ERR187517.fastq.gz|NA07357|CEU|
|ERR187518.fastq.gz|NA12762|CEU|
|ERR187522.fastq.gz|NA11920| CEU|
|ERR187523.fastq.gz|NA19095|YRI|
|ERR187526.fastq.gz|NA19107| YRI|
|ERR187527.fastq.gz|NA18508|YRI|
|ERR187528.fastq.gz|NA18907|YRI|
|ERR187530.fastq.gz|NA18511|YRI|
|ERR187535.fastq.gz|NA19200|YRI|
|ERR187536.fastq.gz|NA19175|YRI|
## 1. Run Bowtie
(https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)
Bowtie is a tool used for aligning sequencing reads to a reference genome. In the context of RNA sequencing (RNA-seq), it is used to align reads generated from RNA fragments to a reference genome and it can also handle reads that contain splice junctions.
script `bowtie.sh`
```
#!/bin/bash
# List of sample names
samples=(ERR187488 ERR187489 ERR187491 ERR187492 ERR187493 ERR187495 ERR187497 ERR187501 ERR187507 ERR187510 ERR187514 ERR187517 ERR187518 ERR187522 ERR187523 ERR187526 ERR187527 ERR187528 ERR187530 ERR187535 ERR187536)
# Run the command for each sample
for sample in "${samples[@]}"
do
bowtie --no-unal -p 20 -x /lizardfs/flaviav/rna-seq/index_name "/lizardfs/flaviav/rna-seq/${sample}.fastq.gz" -S "/lizardfs/flaviav/rna-seq/output_${sample}.sam"
done
```
`--no-unal`: Suppress SAM records for reads that failed to align.
`-p`: set thred
## 2. Change the mapping quality column
Usually the mapping quality column has these values:
0 = maps to 5 or more locations
1 = maps to 3-4 locations
3 = maps to 2 locations
255 = unique mapping
Bowtie doesn't calculate Maq-like mapping quality values, so (per the SAM spec) it prints 255 there if the read aligns or 0 otherwise. It was impossible for GATK do the variant calling so I change the mapping quality column to 60.
As an alternative I need to check if it is possible menage this with bowtie.
https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/36196-bowtie-mapped-reads-all-unique#post240828, this is a problem of all.
```
for file in *.sam
do
sed -i 's/\t255\t/\t60\t/g' "$file"
done
```
## 2. Sort/index SAM files and convert to BAM files
```
for samfile in *.sam; do
samtools sort -@ 4 -O BAM -o ${samfile/.sam/.bam} $samfile
samtools index ${samfile/.sam/.bam}
done
```
## 4. Add reads groups
script `fix.sh`
The aim of the command is to add read group information to each input BAM file, which is a necessary step before calling variants in RNA-seq data. Read groups provide information about the sequencing library preparation and sequencing platform used to generate the data, as well as any other relevant details such as sample ID.
`for file in /lizardfs/flaviav/rna-seq/merged/*.bam; do java -jar /lizardfs/flaviav/picard.jar AddOrReplaceReadGroups -I "$file" -O "${file%.*}_withRG.bam" -SORT_ORDER coordinate -RGID foo -RGLB bar -RGPL illumina -RGSM Sample1 -CREATE_INDEX True -RGPU machine; done`
The read groups are specified using the following parameters:
RGID: read group ID (arbitrary value, set to "foo" in this case)
RGLB: read group library (arbitrary value, set to "bar" in this case)
RGPL: platform/technology used to produce the reads (set to "illumina" in this case)
RGSM: sample name (set to "Sample1" in this case)
RGPU: platform unit (arbitrary value, set to "machine" in this case)
## 5. Variant calling with GATK
script `vc.sh`
- Loop through all the input files: this step involves iterating over all the input BAM files containing aligned reads of RNA-seq samples.
- Get the sample name from the file name: the sample name is extracted from the file name, usually by removing the extension (".bam").
- Add read groups to the BAM file: read groups are added to the BAM file to uniquely identify the RNA-seq sample, as well as any other relevant information such as the sequencing platform.
- Index the BAM file with read groups: the BAM file is indexed using the read groups.
- Call variants with HaplotypeCaller and output GVCF file: The HaplotypeCaller tool from GATK is used to call variants on each input BAM file. The output is in the form of a GVCF (Genomic Variant Call Format) file, which contains information on all variants at a given position in the genome.
N.B. I use these two parameters to facilitate the variant calling:
```
--dont-use-soft-clipped-bases
--stand-call-conf
```
In RNA-seq data, reads often contain soft-clipped bases, which are bases at the end of a read that do not perfectly match the reference genome and have been clipped off by the alignment software. These soft-clipped bases can provide valuable information for identifying transcript isoforms and detecting fusion genes, but they can also introduce errors in variant calling.
The `--dont-use-soft-clipped-bases `option tells the variant caller to ignore soft-clipped bases when making variant calls. This can be useful in RNA-seq data, where soft-clipping is common and variant calling can be challenging.
The` --stand-call-conf` option in GATK's HaplotypeCaller is used to set the minimum level of confidence required to make a variant call. In RNA-seq data, it is important to balance sensitivity (the ability to detect true variants) and specificity (the ability to avoid false positives) when calling variants. I setted the `--stand-call-conf 20` that tells the variant caller to make calls with a Phred-scaled confidence score of at least 20 (corresponding to a 1 in 100 chance of error), which may be more appropriate for RNA-seq data.
- Combine GVCF files into a GenomicsDB database: the GVCF files from all RNA-seq samples are combined into a GenomicsDB database.
- Jointly genotype GVCF files from the GenomicsDB database: the GenomicsDB database is used to jointly genotype the GVCF files, generating a VCF file containing all variants called across all RNA-seq samples.
## 6. Statistics BAM FILES
I check the percentage of mapping and unmapping reads.
```
for file in *.sam; do
sample=$(basename "$file" .sam)
total_seqs=$(samtools stats "$file" | grep ^SN | awk -F '\t' '{if ($1 == "raw total sequences:") print $2}')
echo "Sample: ${sample}" > "${sample}_stats.txt"
samtools stats "$file" | grep ^SN | cut -f 2- >> "${sample}_stats.txt"
done
```
Following by R plot.

## 6b.1000Genomes DNA-sequencing
https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=varRep&hgta_track=tgpPhase3&hgta_table=tgpPhase3&hgta_doSchema=describe+table+schema
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/
Extact only the samples that are interesting and reheader.
`bcftools view -S samplelist ALL.chr20.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz > 21samples.vcf`
993,880 markers
## 7. Statistics VCF (chr20)
bcftools stats on DNA-RNA files.
| Variants | |
| -------- | ------|
| DNA | 1,817,492 |
| RNA | 661 |
| Variants types | DNA | RNA |
| -------- | -------- | -------- |
| SNPs | 1,706,442 |657
| INDELs | 111,050 |5
```
bcftools isec -p overlap dna.vcf.gz rnavcf.gz
```
OR
```
# find overlapping positions between the two files
bcftools isec -n~11 -c all dnaseq.vcf.gz rnaseq.vcf.gz > sharedPositions.txt
# filter vcf files for the overlapping positions
bcftools view -T sharedPositions.txt dnaseq.vcf.gz-Oz > dna.sharedPositions.vcf.gz
bcftools view -T sharedPositions.txt rnaseq.vcf.gz -Oz > rna.sharedPositions.vcf.gz
```
shared variants: 84,0.04 %
## 8. Pop Gen Analyses (let's check if all works well)
- ### 8a. PCA (we have only few samples, only two populations and only 611 SNPs markers!!)
I removed multiallelics and retain only SNPs variants.
```
bcftools view multi_sample.vcf.gz grch38#chr20 >chr20.vcf
bcftools view --max-alleles 2 --exclude-types indels chr20.vcf > chr20.onlysnps.vcf
bcftools annotate --set-id '%CHROM\_%POS\_%REF\_%ALT' chr20.onlysnps.vcf -o chr20.pca.vcf
sed -i.bak 's/^grch38#//g' chr20.pca.vcf
./plink2 --vcf chr20.pca.vcf --freq
./plink2 --vcf chr20.pca.vcf --make-bed
./plink2 --bfile plink2 --read-freq plink2.afreq --pca
N.B. PCA with only two components is empty, is due by few variation.
```
Following R code.


- ### 8b. ADMIXTURE ANALYSES
https://owensgl.github.io/biol525D/Topic_8-9/ as guide.
ONLY CHROMOSOME 20.
- Lets first filter the VCF to a smaller set. I am going to use bcftools for processing and filtering VCF files. Here are what we want to filter for:
At least 9/10 samples genotyped (which is 18 alleles since they’re diploid).
Only one alternate allele.
No indels.
At least 2 copies of the alternate allele
```
bcftools view multi_sample.vcf.gz grch38#chr20 > chr20.vcf
bcftools view \
-c 2 \
-i 'INFO/AN >= 18' \
-m2 \
-M2 \
-v snps \
chr20.vcf\
-O z > chr20.filtered.vcf.gz
```
- I have to convert our VCF to the bed format. I am going to use plink and fix chromosome name (plink accepts 20 not chr20)
```
tabix -p vcf chr20.filtered.vcf.gz
zcat chr20.filtered.vcf.gz |\
sed s/^grch38#chr//g |\
gzip > chr20.filtered.numericChr.vcf.gz
./plink2 --make-bed \
--vcf chr20.filtered.numericChr.vcf.gz \
--out chr20.filtered.numericChr \
--set-missing-var-ids @:# \
--double-id \
--allow-extra-chr
```
This produces files with the suffix .nosex, .log, .fam, .bim, .bed. We can use these in Admixture.
```
echo "K CV_error" > chr20_log_CVerror.txt
for K in {1..13}
do
echo "running K = "$K
admixture_linux-1.3.0/dist/admixture_linux-1.3.0/admixture32 --cv chr20.filtered.numericChr.bed $K | tee chr20_log_K${K}.out
#print cv error estimations, lowest ones are best estimates
y=$(grep -h CV chr20_log_K"$K".out | awk '{print $4}')
echo $K $y >> chr20_log_CVerror.txt
done
```
**The best K value for Admixture is typically the K value with the lowest cross-validation (CV) error**. The CV error are in the .out files we saved. One easy way to look at all those scores is to print all the .out files and then keep only the lines that include “CV” using grep.
```
cat *out | grep CV
```
CV error (K=1): 1.16630
CV error (K=2): 1.43266
CV error (K=3): 1.43576
CV error (K=4): 1.50122
CV error (K=5): 0.73594
CV error (K=6): 0.77360
CV error (K=7): 0.92013
CV error (K=8): 0.63637
CV error (K=9): 0.6995
CV error (K=10): 0.31699
CV error (K=11): 0.48489
CV error (K=12): 0.24646
CV error (K=13): 0.26958
This shows that the lowest CV error is with K=13.
Let's check k=13, all seems ok, but there is something strange with the last sample.

##### CODE R PLOTS
1. BAM STATISTICS
```
# List all the files in the current directory
files <- list.files(pattern = "\\.txt$")
# Initialize an empty data frame to store the results
df <- data.frame(sample = character(),
mapped = numeric(),
unmapped = numeric(),
stringsAsFactors = FALSE)
# Loop over the files and extract the relevant information
for (file in files) {
# Read the lines from the file
lines <- readLines(file)
# Extract the sample name
sample <- gsub("Sample: ", "", lines[1])
# Extract the mapped and unmapped counts
mapped <- as.numeric(gsub("reads mapped: *", "", lines[grep("reads mapped:", lines)]))
unmapped <- as.numeric(gsub("reads unmapped: *", "", lines[grep("reads unmapped:", lines)]))
# Add the results to the data frame
df <- rbind(df, data.frame(sample = sample, mapped = mapped, unmapped = unmapped))
}
# Aggregate the results by sample name
df_agg <- aggregate(cbind(mapped, unmapped) ~ sample, df, sum)
# Aggregate the results by sample name
df_agg <- aggregate(cbind(mapped, unmapped) ~ sample, df, sum)
# Load the required packages
library(tidyr)
library(ggplot2)
# Convert the data from wide to long format
df_long <- gather(df_agg, key = "category", value = "count", mapped, unmapped)
# Plot the data using ggplot2
ggplot(df_long, aes(x = sample, y = count, fill = category)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Sample", y = "Read count", fill = NULL) +
theme_minimal()
```
2. PCA code
```
library(RColorBrewer)
library(ggplot2)
#I include the first 20 PCs:
myd<-read.csv("plink2.eigenvec", sep = "\t", header =T)
pop<-read.table("samples.map.txt", sep = "\t", header = T)
my_pca <- prcomp(myd[, 3:ncol(myd)], scale. = TRUE, center = TRUE, n = 20)
my_pca_df <- as.data.frame(my_pca$x)
my_pca_df <- cbind(pop, my_pca_df)
ggplot(my_pca_df, aes(x = PC1, y = PC2, color = factor(POP))) +
geom_point() +
geom_text(aes(label = INDIVIDUALS), hjust = 0, vjust = 0, size = 1) +
scale_color_manual(values = c("red", "blue", "green")) +
labs(title = "PCA YRI-CEU", x = "PC1", y = "PC2") +
theme_bw()
```
3. ADMIXTURE PLOT
```
samplelist <- read.table("samples.txt", sep="\t", col.names = c("sample", "pop"))
all_data <- tibble(sample = character(), k = numeric(), Q = character(), pop = character(), value = numeric())
for (k in 1:13){
data <- read.delim(paste0("chr20.filtered.numericChr.",k,".Q"),
col.names = paste0("Q",seq(1:k)),
sep = " ")
data[nrow(data) + 1, ] <- 1
data$sample[nrow(data)] <- samplelist$sample[k]
data$sample <- samplelist$sample
data$pop <- samplelist$pop
# This step converts from wide to long.
data %>% gather(Q, value, -sample, -pop, paste0("Q", seq(1:k))) -> data
data$k <- k
all_data <- rbind(all_data,data)
}
# Sort by k and then by pop
all_data = all_data %>% arrange(k, pop)
# Define the desired order of labels on the x-axis
label_order <- c("ERR187488", "ERR187492", "ERR187495", "ERR187497", "ERR187501",
"ERR187507", "ERR187514", "ERR187517", "ERR187518", "ERR187522",
"ERR187489", "ERR187491", "ERR187493", "ERR187510", "ERR187523",
"ERR187526", "ERR187527", "ERR187528", "ERR187530", "ERR187535")
# Convert sample column to factor with the desired order of levels
all_data$sample <- factor(all_data$sample, levels = label_order)
ggplot(all_data, aes(x = sample, y = value, fill = factor(Q))) +
geom_bar(stat = "identity") +
labs(title = "RNA-seq", x = "Sample", y = "Value") +
theme_bw() +
scale_fill_discrete(name = "Q") +
facet_wrap(~k, ncol=1)+
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +ylim(0,1) +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_fill_manual(values = cols, name = "K", labels = seq(1:13)) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
```