---
title: Mapping ddRAD reads to a reference genome
breaks: false
tags: RADseq, Nora
---
# Mapping ddRAD reads to a reference genome
### Preparing the reference:
Good idea to make a new directory for your reference files. The reference needs to be uncompressed for this analysis. Unzip the reference:
```
mkdir ./reference
mv CAJRAG01.1.fsa_nt.gz ./reference/
gunzip CAJRAG01.1.fsa_nt.gz
```
Index the reference genome:
```
bwa index CAJRAG01.1.fsa_nt
```
This will create a few other files with the same basename. They all need to be in the same directory with the .fsa_nt file in order for mapping and such to work.
### Map the reads:
Run from the directory with the read files, this loop will align the reads for each pair of read files, generating a .sam file
```
bwa mem -M -t 4 /home/bioinfo8/reference_mapping/reference_files/CAJRAG01.1.fsa_nt sampleID.1.fq.gz sampleID.2.fq.gz > sampleID.sam
```
For references that might be more distantly related, you might get low percentage of reads mapping to the reference. You can check the percent mapping to the reference with:
```
samtools flagstat sampleID.sam
```
Good mapping percentages are in the 90's. You can mess around with the -K values of bwa mem to see if you get a higher percent mapping to the reference. This is a tradeoff because the lower the -K, that means that the fewer exact basepairs need to match the reference in order to start mapping a read. Default is 19 when bwa mem is run without a -K value. It can be a good idea to run bwa mem on one sample with several values of -K : default, -K 16 ; -K 14; -K 12 ; -K 10. Evaluate which maps the most reads and then use the loop for the rest of the samples:
```
for i in *.1.fq.gz; do
NM=${i//.1.fq.gz/}
REVERSE=${i//.1.fq.gz/.2.fq.gz}
bwa mem -M -t 4 -K 10 /home/bioinfo8/reference_mapping/reference_files/CAJRAG01.1.fsa_nt $i $REVERSE > $NM.sam ; done
```
Now convert the .sam files into .bam files:
```
for i in *.sam; do
NM=${i//.sam/};
samtools view $i -b -o $NM.bam; done
```
and then sort the .bam files:
```
for i in *.bam; do NM=${i//.bam/}; samtools sort $i -o $NM.sort.bam; done
```
As long as your sorted .bam files are looking good, then you can delete the .sam files because they are quite large and take up a lot of space. I put all my sorted .bams into their own folder.
### Build stacks loci from aligned reads:
Good idea to do this in screen
```
gstacks -I ~/sorted_bams/ -M ~/popmap -O ~/reference_stacks/ -t 8
```
#note from May 2024: maybe should not use --rm-pcr-ducplicates https://groups.google.com/g/stacks-users/c/qmAxcH7cbus
and now run populations, also in screen
```
populations -P ~/reference_stacks/ -M ~/popmap -r 0.65 --vcf --genepop --fstats --smooth --ordered-export --hwe -t 8
```
This is where I stopped as far as looking for amplicon loci with the reference mapped stacks, because there were many SNPS from the same loci. From what I can gather, the flag --write-random-snp doesn't work with reference aligned stacks, and it is suggested instead to used --ordered-export (also note that this is --ordered_export in the STACKS documentation, but it runs only with --ordered-export with "-" instead of "_" go figure)
## Summarizing results:
To calculate average missingness in the filtered vcf
```
vcftools --gzvcf engleri_filtered1.recode-1.vcf --missing-indv --out dp
vcftools --gzvcf engleri_filtered1.recode-1.vcf --depth --out dp
```