Mapping ddRAD reads to a reference genome

--- title: Mapping ddRAD reads to a reference genome breaks: false tags: RADseq, Nora --- # Mapping ddRAD reads to a reference genome ### Preparing the reference: Good idea to make a new directory for your reference files. The reference needs to be uncompressed for this analysis. Unzip the reference: ``` mkdir ./reference mv CAJRAG01.1.fsa_nt.gz ./reference/ gunzip CAJRAG01.1.fsa_nt.gz ``` Index the reference genome: ``` bwa index CAJRAG01.1.fsa_nt ``` This will create a few other files with the same basename. They all need to be in the same directory with the .fsa_nt file in order for mapping and such to work. ### Map the reads: Run from the directory with the read files, this loop will align the reads for each pair of read files, generating a .sam file ``` bwa mem -M -t 4 /home/bioinfo8/reference_mapping/reference_files/CAJRAG01.1.fsa_nt sampleID.1.fq.gz sampleID.2.fq.gz > sampleID.sam ``` For references that might be more distantly related, you might get low percentage of reads mapping to the reference. You can check the percent mapping to the reference with: ``` samtools flagstat sampleID.sam ``` Good mapping percentages are in the 90's. You can mess around with the -K values of bwa mem to see if you get a higher percent mapping to the reference. This is a tradeoff because the lower the -K, that means that the fewer exact basepairs need to match the reference in order to start mapping a read. Default is 19 when bwa mem is run without a -K value. It can be a good idea to run bwa mem on one sample with several values of -K : default, -K 16 ; -K 14; -K 12 ; -K 10. Evaluate which maps the most reads and then use the loop for the rest of the samples: ``` for i in *.1.fq.gz; do NM=${i//.1.fq.gz/} REVERSE=${i//.1.fq.gz/.2.fq.gz} bwa mem -M -t 4 -K 10 /home/bioinfo8/reference_mapping/reference_files/CAJRAG01.1.fsa_nt $i $REVERSE > $NM.sam ; done ``` Now convert the .sam files into .bam files: ``` for i in *.sam; do NM=${i//.sam/}; samtools view $i -b -o $NM.bam; done ``` and then sort the .bam files: ``` for i in *.bam; do NM=${i//.bam/}; samtools sort $i -o $NM.sort.bam; done ``` As long as your sorted .bam files are looking good, then you can delete the .sam files because they are quite large and take up a lot of space. I put all my sorted .bams into their own folder. ### Build stacks loci from aligned reads: Good idea to do this in screen ``` gstacks -I ~/sorted_bams/ -M ~/popmap -O ~/reference_stacks/ -t 8 ``` #note from May 2024: maybe should not use --rm-pcr-ducplicates https://groups.google.com/g/stacks-users/c/qmAxcH7cbus and now run populations, also in screen ``` populations -P ~/reference_stacks/ -M ~/popmap -r 0.65 --vcf --genepop --fstats --smooth --ordered-export --hwe -t 8 ``` This is where I stopped as far as looking for amplicon loci with the reference mapped stacks, because there were many SNPS from the same loci. From what I can gather, the flag --write-random-snp doesn't work with reference aligned stacks, and it is suggested instead to used --ordered-export (also note that this is --ordered_export in the STACKS documentation, but it runs only with --ordered-export with "-" instead of "_" go figure) ## Summarizing results: To calculate average missingness in the filtered vcf ``` vcftools --gzvcf engleri_filtered1.recode-1.vcf --missing-indv --out dp vcftools --gzvcf engleri_filtered1.recode-1.vcf --depth --out dp ```