# Pairwise Sequentially Markovian Coalescent (PSMC) model By Tanya Lama These scripts were initially run for my dissertation along with other pop gen metrics, but the analysis was eventually abandoned because I didn't have time/support in interpreting the results at the time. This analysis has become much more common among popgen studies so there are likely better resources out there, but if you need to get scripts running in a pinch this should do it for you. Essentially, step one is to take the fasta Canada lynx genome file, align it to a reference (in this case we have Canada lynx but previously had also used Felis Catus) and use the resultant bam file for the PSMC workflow. Only have one genome (e.g., just the reference?) No problem! Just align the raw reads from your assembly to the assembly itself, and use that bam file for the PSMC. I beliee Bentley et al did this for this paper in PNAS and I bet the scripts are decent too: [here](https://www.pnas.org/doi/10.1073/pnas.2201076120) #This may be easier than converting our .maf to a bam? # ssh on ssh tl50a@ghpcc06.umassrc.org ## load your env module load anaconda2/4.4.0 module load GATK/3.5 module load java/1.8.0_77 module load gnuplot/5.2.0 module load psmc/0.6.5 source activate PSMC cd /project/uma_lisa_komoroske/Tanya/scripts/PSMC ## Dependencies We'll use the module system and a conda environment to control software versions samtools/bcftools conda create --name PSMC conda install: samtools==1.2 bcftools==1.4.1 picard==2.17.8 gatk==3.5 gnuplot==5.0.4 ## Download the Canada lynx reference genome (p1 assembly) from NCBI (RefSeq not GenBank) wget -m ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/007/474/595/GCF_007474595.1_mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/ All files need to be in the /refs/ directory ``` mv chrX.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/chrX.fna.gz ``` #### Create indices and dictionaries for each chromosome We also need to create a sequence dictionary for our reference genome using PICARD CreateSequenceDictionary GATK uses two files to access reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to use a Fasta file as reference. Use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.#Do these files need to be gunzipped???? ``` bsub -q short -W 1:00 -R rusage[mem=16000] -n 4 -R span\[hosts=1\] "picard CreateSequenceDictionary R=/project/uma_lisa_komoroske/Tanya/download/refs/chrF2.fa O=/project/uma_lisa_komoroske/Tanya/download/refs/chrF2.dict" ``` #### Creating the fasta index file We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file. This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. ``` bsub -q short -W 0:30 -R rusage[mem=16000] -n 4 -R span\[hosts=1\] "samtools faidx /project/uma_lisa_komoroske/Tanya/download/refs/chrF2.fa" ``` output is:chrA1.fa.fai # Requirements: We are starting from a BAM file that contains reads already mapped to the Canada lynx reference genome. #### Calling consensus sequence (samtools -> bcftools -> vcfutils.pl) Working directory is /project/uma_lisa_komoroske/Tanya/scripts/PSMC/ ##### Our mapped reads are aligned to mLynCan4.fa... Starting from mapped reads, the first step is to produce a consensus sequence in FASTQ format. To do this, we will use the samtools/bcftools suite. The basic idea behind generating a consensus sequence is to first use *samtools mpileup* to take the mapped reads and produce a VCF file. The consensus sequence is then generated by bcftools with the original consensus calling model, and converted to fastq (with some additional filtering) by *vcfutils.pl* A few notes: We will only analyze autosomes, relying on the fact that the n autosomes are named chr1A - chrF2 in the reference. This means we can easily use a slurm job array to separately process each chromosome. We'll use Unix pipes and the ability of samtools to work with streaming input and output to run the whole pipeline (samtools -> bcftools -> vcfutils.pl) as one command. First, let's set up some job parameters for each chromosome: #SBATCH -n 1 #each chromosome will be processed on a single core (-n 1) #SBATCH -R 1 #on one machine (-R 1) #SBATCH -q long #we'll use the serial_requeue queue (I don't think we have one of these) #SBATCH -W 0-12:00 #most runs should finish in 3-4 hours but we'll a bit extra #SBATCH --mem 2000 #most runs should be under 1 GB but we'll add a buffer #SBATCH -N *mpileup* #a basic name for this job We know in advance that we want to run this on the 27 autosomes, so we'll specify the job array size in our script, like so: #SBATCH --array = 1-27 #### Now, on to the actual code. First, we'll load the necessary software using the module system: source new-modules.sh Now, since we've launched 27 jobs with the job array, we need to tell each separate job which chromosome to work on. We'll do that using a bash variable. We'll also set a bash variable to store the location of the directory with the bam files and reference genome: CHR="chr${SLURM_ARRAY_TASK_ID}" DATAPATH="/n/regal/informatics/workshops/PSMC_20150911" Note: the variable $SLURM_ARRAY_TASK_ID stores the array id (1,2,3,4, ... ,27) for each of the separate array jobs. Finally, we run our: # Consensus calling pipeline, consisting of a linked set of samtools, bcftools, and vcfutils.pl commands: Note: May 2020 #We've already generated the required fasta index (.fai) and dictionary (.dict) for our newest genome reference located at /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/ ``` bsub -q long -W 48:00 -R rusage[mem=1000] -n 1 "samtools mpileup -Q 30 -q 30 -u -v \ -f /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4/mLynCan4.fa /project/uma_lisa_komoroske/Tanya/analyses/bams/cb42_RemoveBadReads.bam | bcftools call -c | vcfutils.pl vcf2fq -d 3 -D 500 -Q 30 > cb42.fq" #changed span parameter to 1 host -- much faster #trying this with our new reference genome -- worked successfully. Moving on bsub -q long -W 200:00 -R rusage[mem=2000] -n 1 -R span\[hosts=1\] "samtools mpileup -Q 30 -q 30 -u -v \ -f /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa /project/uma_lisa_komoroske/Tanya/analyses/bams_mLynCan4_v1/a794_RemoveBadReads.bam | bcftools call -c | vcfutils.pl vcf2fq -d 3 -D 500 -Q 30 > a794.fq" /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa ``` ###final output is located in scripts instead of analyses (oops) This takes as input an aligned bam file and a reference genome, generates an mpileup using samtools, calls the consensus sequence with bcftools, and then filters and converts the consensus to fastq format, writing the results for each chromosome to a separate fastq file. > [samtools: -Q and -q in mpileup determine the cutoffs for baseQ and mapQ, respectively -v tells mpileup to produce vcf output, and -u says that should be uncompressed -f is the reference fasta used (needs to be indexed) -r is the region to call the mpileup for (in this case, a particular chromosome based on the array task id) P964.bam is the bam file to use bcftools: call -c calls a consensus sequence from the mpileup using the original calling method vcfutils.pl: -d 5 and -d 34 determine the minimum and maximum coverage to allow for vcf2fq, anything outside that range is filtered -Q 30 sets the root mean squared mapping quality minimum to 30] # Running Consensus Sequence separately 1) Samtools mpileup ``` samtools mpileup -Q 30 -q 30 -u -v \ -f $DATAPATH/loxAfr4.fa -r $CHR $DATAPATH/P964.bam ``` 2) BCFtools call 3) VCFutils.pl vcfutils.pl vcf2fq -d 3 -D 500 -Q 30 > /project/uma_lisa_komoroske/Tanya/scripts/PSMC/all.fq vcfutils.pl varFilter [options] <in.vcf> Options: -d 5 and -D 34 determine the minimum and maximum coverage to allow for vcf2fq, anything outside that range is filtered -Q 30 sets the root mean squared mapping quality minimum to 30] # Running PSMC PSMC takes the consensus fastq file, and infers the history of population sizes. Although it takes a variety of parameters to control the details of the model fitting, we will follow Palkopoulou et al and use the defaults. First, let's set up some slurm parameters for the PSMC run: #SBATCH -n 1 #SBATCH -N 1 #SBATCH -p serial_requeue #SBATCH -t 0-16:00 #SBATCH --mem 6000 #SBATCH -J psmc source new-modules.sh The first thing we need to do is merge all our single-chromosome fastq files into a single consensus sequence, which we'll do using the unix tool cat. cat P964.chr*.fq > P964.consensus.fq #### Convert the fastq file to the input format for PSMC: The contents of the utils directory have been moved to **/share/pkg/psmc/0.6.5/utils** This is super quick and might not even need to be run on the cluster, only takes a few minutes ``` bsub -q short -W 0:15 -R rusage[mem=1000] -n 1 "/share/pkg/psmc/0.6.5/utils/fq2psmcfa b23.fq > b23.psmcfa" ``` #### Run psmc using the default options Note that we specify the -p parameter as the defaults reported in the paper differ from the current defaults: ``` bsub -q long -W 8:00 -R rusage[mem=8000] -n 24 "psmc -p "4+25*2+4+6" -o a494.psmc a494.psmcfa" ``` #### Make the PSMC plot Using the per-generation mutation rate -u and the generation time in years -g reported in the paper. Because the paper does not give exact parameters for how they produced the plot, this likely will look a bit different than the figure, but hopefully it will be pretty close. Usage is: psmc_plot.pl -M "sample1=0.1,sample2=0.2" prefix sample1.psmc sample2.psmc ``` bsub -q short -W 0:15 -R rusage[mem=16000] -n 4 "/share/pkg/psmc/0.6.5/utils/psmc_plot.pl -u 2.5e-08 -g 35 -p plotb cb42.psmc" bsub -q short -W 0:15 -R rusage[mem=16000] -n 4 "/share/pkg/psmc/0.6.5/utils/psmc_plot.pl -g 1 -p lynx1.0 a109.psmc a182.psmc a202.psmc a33.psmc a475.psmc a697.psmc a772.psmc a803.psmc BOBCAT1.psmc BOBCAT2.psmc" Options: -u FLOAT absolute mutation rate per nucleotide [$opts{u}] -u 2.5e-08 -s INT skip used in data preparation [$opts{s}] -X FLOAT maximum generations, 0 for auto [0] -x FLOAT minimum generations, 0 for auto [$opts{x}] -Y FLOAT maximum popsize, 0 for auto [$opts{Y}] -m INT minimum number of iteration [$opts{m}] -n INT take n-th iteration (suppress GOF) [$opts{n}] -M titles multiline mode [null] -f STR font for title, labels and tics [$opts{f}] -g INT number of years per generation [$opts{g}] -w INT line width [$opts{w}] -P STR position of the keys [$opts{P}] -T STR figure title [null] -N FLOAT false negative rate [0] -S no scaling -L show the last bin -p convert to PDF (with epstopdf) -R do not remove temporary files -G plot grid ``` APPENDIX II: PSMC adjustments for low coverage genomes ======================================== For diploid genomes sequenced to low coverage, heterozygotes will be randomly lost due to the lack of coverage of both alleles. This has the same effect as smaller mutation rate and can be corrected. If you know the fraction of hets missed due to low coverage, you can generate the PSMC plot with: psmc_plot.pl -M "sample1=0.1,sample2=0.2" prefix sample1.psmc sample2.psmc This says that sample1 has 10% false negative rate (FNR) on hets and sample2 has 20%. The plotting script does not correct FNR for bootstrapping. If you want to plot the result with your own scripts, you can increase \theta_0 to \theta_0/(1-FNR). ### Converting output to pdf ``` epstopdf lynx1.0.eps ``` ### Download the pdf to local computer scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/scripts/PSMC/lynx1.0.pdf /Users/tanyalama/Desktop/plots/lynx1.0.pdf scp -r /Users/tanyalama/Desktop/fastqc_html/LIC46bpsmc.html tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/scripts/PSMC/consensus_plot.eps consensus_plot -g 10 -u ### Parameterization 1) consensus_plot 2) -u 3.83e-08 -g 31 **seems to be the most detailed plot 4) .20 missing and -u 3.83e-08 -g 50 Plot$ and Details plot5 -u 3.83e-08 -g 50 0 missing het plot6 -u 2.5e-08 -g 50 0 missing het plot7 -u 2.5e-08 -g 100 0 missing het *total flatline* plot8 -u 2.5e-08 -g 20 0 missing het plot9 -u 2.5e-08 -g 30 0 missing het *accidentally deleted plot9. rerun if you want it, looked good plot10 -u 2.5e-08 -g 35 0 missing het *best looking plot so far. great detail Mutation Rates: 2.5e-08 (human) 3.83e-08 ### A note about plotting I know that there's been some difficulty getting multiple PSMC lines on a single plot. Ask David Ray if you need the solution for this (which I believe needs to be plotted in R or something annoying like that). good luck!