Preliminary analyses of lynx SNP data

# Preliminary analyses of lynx SNP data By: Tanya Lama 12/24/2019 This is preliminary analysis of the first ~26 lynx genomes from Maine. The total dataset is ~70 genomes encompassing all of Maine and several eastern Canadian provinces, and including one bobcat/lynx hybrid and several bobcat. The bioinformatic pipeline used to generate the SNP dataset used in these analyses can be viewed on HackMD or GitHub (see NGSPipeline). Briefly, our goal here is to *1) Explore the lynx SNP dataset (e.g. depth of coverage of a joint .vcf -- what else can this package do in terms of data exploration?)* *1a) What is our average sequence depth?* * Ranges ~5 to ~14x *1b) How many total SNPs?* * Starting with 695060 SNPs *2) Prune/ further reduce the snpset* * We use a couple of different methods here for reducing the number of SNPs (ld-based pruning, bi-allelic, removing monoSNPs etc). This script is best run in an interactive R session on the cluster, because the raw .vcf is a lot to handle for desktop R. *3) Describe population structure among lynx in Maine using the sNMF method, one PCA-based method and confirm the result with two identity by descent relatedness analyses (MoM and MLE)* * NULL hypothesis assume no structure * Evidence fails to support the null hypothesis * K=4 *4) Confirm relatedness among individuals* * Individuals are not related * If structure exists, calculate admixture among all individuals* * To Do: Plot admixture pie-plots on a map of our study are (Maine) to see where structure is on the landscape* *5) QUAST: genome assembly comparison among all felid assemblies on NCBI (complete)* ## Part 1: Data management ##Data in VCF format and some basic summary stats Selecting/dropping individuals from this dataset must be done using *vcftools* on the command line and cannot be accomplished in R. Instructions for doing so can be found on my HackMD or GitHub. No individuals will be dropped from the data explored here. ```{r warning=FALSE} #Read in the vcf and proceed vcf<- read.vcfR("/Users/tanyalama/Box/Lynx Whole Genomes/Analyses/onlylynx.vcf", verbose = FALSE) #this file should have a .vcf suffix head(vcf) ##CHROM POS ID REF ALT QUAL FILTER INFO FORMAT #scaffold_11 43536 . T C 5028.07 PASS AC=33;AF=0.635;AN=52; GT:AD:DP:GQ:PL dp<- extract.gt(vcf, element = 'DP', as.numeric=TRUE) # DP= depth of coverage. The average coverage for this resequencing data was 4-14x #what are the column names of dp #Plot Read Depth (DP) per individual boxplot(dp, las=3, col=c("#C0C0C0", "#808080"), ylab="Read Depth (DP)",las=2, cex=0.4, cex.axis=0.5) #adjust column names #zoom to smaller values boxplot(dp, las=3, col=c("#C0C0C0", "#808080"), ylab="Read Depth (DP)", las=2, cex=0.4, cex.axis=0.5, ylim=c(0,50)) abline(h=8, col="red") ``` ![](https://i.imgur.com/jfnzbQP.png) **Fig1.** MQ should all be >40. Looks good. ![](https://i.imgur.com/qnOroOx.png) **Fig2.** Read depth box plots. Focus on the median, not the outliers ![](https://i.imgur.com/R5oH59J.png) **Fig3.** Coverage between 5x and 14x. Red line indicated 8x coverage. We needed at least 3x coverage to run our whole genome pipeline, which uses a genotype likelihood (gatk HaplotypeCaller) method to call SNPs from low-coverage data. ## Part 2: Reduce SNP dataset *** depending on what the data exploration says/using an R session on the cluster we may not need to do this We have 695,060 SNPs. We will reduce our SNPset using LD-based SNP pruning and eliminating monoSNPs (held by a single individual) and multi-allelic SNPs. It is suggested to use a pruned set of SNPs which are in approximate linkage equilibrium with each other to avoid the strong influence of SNP clusters in principal component analysis and relatedness analyses. Note: our NGS pipeline did a lot of filtering for quality and clustering, so we are being pretty stringent here. ```{r Data Prep, message=FALSE, warning=FALSE, paged.print=FALSE} #Clear cache if necessary using (do this before you run the R Markdown knittr) #closeAllConnections() #snpgdsClose(genofile) #in case you need to close or re-write a gds file #closefn.gds(test1.gds) #Load our vcf file (SNP data for 26 lynx) vcf.fn <- ("/Users/tanyalama/Box/Lynx Whole Genomes/Analyses/onlylynx.vcf") #Reformat VCF --> SNP GDS (genomic data structure) #gdsfmt package provides the genome data structure (GDS) file format for array-oriented bioinformatic data, which is a container for annotation and SNP genotypes. #human translation: apply the function snpgdsVCF2GDS to input file vcf.fn. Ouput file is test.gds. Select only biallelic SNPs snpgdsVCF2GDS(vcf.fn, "test.gds", method="biallelic.only") #this will only select biallelic SNPs, which is another pruning method because we have a lot of multiallelic SNPs in our dataset ``` ### Summary: 695060 SNPs in 26 individuals ``` snpgdsSummary("test.gds") snpgdsOpen("/Users/tanyalama/Desktop/test.gds") #Open our GDS formatted file as "genofile". The file name comes from the Summary you just read. genofile <- snpgdsOpen("/Users/tanyalama/Desktop/test.gds", readonly= FALSE) #Add population information to your gds genofile. Under NULL hypothesis assume no structure, population for all =1 add.gdsn(genofile, "pop_code", c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)) #all identified to 1 population for starters #change to number of individuals in lynx dataset #get sample id information sample.id <- read.gdsn(index.gdsn(genofile, "sample.id")) #this may not be the same in lynx dataset ``` ### Proceed to snpgdsLDpruning Note: look up best practices for ld.threshold (using default here) ```{r message=FALSE, warning=FALSE} set.seed(1) #should be 1000, reduced for rmarkdown #Try different LD thresholds for sensitivity analysis. #Need to include autosome.only = FALSE clause to run. snpset <- snpgdsLDpruning(genofile, autosome.only = FALSE, remove.monosnp = FALSE, ld.threshold=0.1, verbose = TRUE) #n/n markers are selected in total with .1 ld. ``` ### SNP pruning based on LD: Working space: 26 samples, 695,060 SNPs using 1 (CPU) core sliding window: 500,000 basepairs, Inf SNPs |LD| threshold: 0.1 method: composite Chromosome scaffold_11_arrow_ctg1: 1.28%, 375/29,312 Chromosome Super_Scaffold_6: 1.20%, 413/34,499 Chromosome Super_Scaffold_7: 1.23%, 352/28,538 Chromosome Super_Scaffold_8: 1.46%, 397/27,249 Chromosome Super_Scaffold_10: 2.39%, 370/15,472 Chromosome scaffold_17_arrow_ctg1_1: 1.67%, 4/239 Chromosome scaffold_17_arrow_ctg1: 1.31%, 302/23,055 Chromosome scaffold_18_arrow_ctg1: 1.31%, 271/20,738 Chromosome scaffold_18_arrow_ctg1_1: 1.22%, 1/82 Chromosome Super_Scaffold_11: 1.28%, 267/20,862 Chromosome Super_Scaffold_4: 1.44%, 869/60,195 Chromosome Super_Scaffold_5: 1.38%, 696/50,293 Chromosome scaffold_21_arrow_ctg1: 1.17%, 177/15,169 Chromosome scaffold_26_arrow_ctg1: 1.89%, 2/106 Chromosome scaffold_2_arrow_ctg1: 1.38%, 828/59,960 Chromosome scaffold_35_arrow_ctg1: 0.59%, 10/1,681 Chromosome scaffold_36_arrow_ctg1: 0.83%, 7/848 Chromosome Super_Scaffold_1: 1.37%, 931/67,749 Chromosome scaffold_42_arrow_ctg1: 0.87%, 4/459 Chromosome scaffold_44_arrow_ctg1: 3.45%, 1/29 Chromosome scaffold_45_arrow_ctg1: 1.02%, 4/392 Chromosome scaffold_47_arrow_ctg1: 4.76%, 4/84 Chromosome scaffold_49_arrow_ctg1: 42.86%, 3/7 Chromosome Super_Scaffold_3: 1.34%, 598/44,738 Chromosome scaffold_51_arrow_ctg1: 1.59%, 1/63 Chromosome scaffold_52_arrow_ctg1: 16.67%, 1/6 Chromosome scaffold_53_arrow_ctg1: 50.00%, 1/2 Chromosome scaffold_54_arrow_ctg1: 10.00%, 3/30 Chromosome scaffold_55_arrow_ctg1: 33.33%, 1/3 Chromosome scaffold_57_arrow_ctg1: 14.29%, 2/14 Chromosome scaffold_58_arrow_ctg1: 16.67%, 1/6 Chromosome scaffold_59_arrow_ctg1: 100.00%, 1/1 Chromosome Super_Scaffold_2: 1.48%, 561/37,941 Chromosome scaffold_61_arrow_ctg1: 2.08%, 1/48 Chromosome scaffold_62_arrow_ctg1: 11.11%, 2/18 Chromosome scaffold_63_arrow_ctg1: 14.29%, 1/7 Chromosome scaffold_64_arrow_ctg1: 20.00%, 1/5 Chromosome scaffold_65_arrow_ctg1: 4.88%, 2/41 Chromosome scaffold_66_arrow_ctg1: 25.00%, 3/12 Chromosome scaffold_67_arrow_ctg1: 11.11%, 2/18 Chromosome scaffold_68_arrow_ctg1: 20.00%, 3/15 Chromosome scaffold_69_arrow_ctg1: 14.29%, 1/7 Chromosome Super_Scaffold_13: 1.45%, 519/35,826 Chromosome scaffold_70_arrow_ctg1: 22.22%, 4/18 Chromosome scaffold_72_arrow_ctg1: 11.11%, 2/18 Chromosome scaffold_74_arrow_ctg1: 2.56%, 1/39 Chromosome scaffold_75_arrow_ctg1: 100.00%, 1/1 Chromosome scaffold_76_arrow_ctg1: 1.85%, 1/54 Chromosome scaffold_77_arrow_ctg1: 14.29%, 1/7 Chromosome scaffold_78_arrow_ctg1: 20.00%, 1/5 Chromosome scaffold_79_arrow_ctg1: 25.00%, 1/4 Chromosome Super_Scaffold_9: 1.39%, 543/39,133 Chromosome scaffold_80_arrow_ctg1: 25.00%, 1/4 Chromosome scaffold_81_arrow_ctg1: 11.11%, 1/9 Chromosome scaffold_82_arrow_ctg1: 25.00%, 2/8 Chromosome scaffold_83_arrow_ctg1: 40.00%, 2/5 Chromosome scaffold_84_arrow_ctg1: 5.08%, 3/59 Chromosome scaffold_85_arrow_ctg1: 100.00%, 1/1 Chromosome scaffold_86_arrow_ctg1: 6.67%, 1/15 Chromosome scaffold_87_arrow_ctg1: 50.00%, 1/2 Chromosome scaffold_88_arrow_ctg1: 22.22%, 2/9 Chromosome scaffold_89_arrow_ctg1: 1.53%, 3/196 Chromosome Super_Scaffold_12: 1.36%, 601/44,192 Chromosome scaffold_90_arrow_ctg1: 9.09%, 1/11 Chromosome Super_Scaffold_14: 1.31%, 466/35,451 ### Summary: 9,633 markers are selected in total. ``` # Create an object to hold our ~9,633k selected SNPs snpset.id <- unlist(snpset) # Take our selected snps from snpset.id and generate a .ped file for further analyses snpgdsGDS2PED(genofile, "ped.fn", sample.id=NULL, snp.id=snpset.id, use.snp.rsid=TRUE, format=c("A/G/C/T", "A/B", "1/2"), verbose=TRUE) # Convert the .ped file to .geno file for sNMF ped2geno("/Users/tanyalama/Desktop/ped.fn.ped", output.file = ("/Users/tanyalama/Desktop/ped.fn.ped.geno"), force = TRUE) # Our final .geno file: ped.fn.ped.geno ``` - number of detected individuals: 26 - number of detected loci: 9633 ## Part 3: Population structure ### Principal Component Analysis (PCA) The functions in SNPRelate for PCA include calculating the genetic covariance matrix from genotypes, computing the correlation coefficients between sample loadings and genotypes for each SNP, calculating SNP eigenvectors (loadings), and estimating the sample loadings of a new dataset from specified SNP eigenvectors. ```{r PCA, echo=TRUE, warning=FALSE} # Run PCA using the ldpruned set of 9k SNPs (snpset.id) instead of the full set. pca <- snpgdsPCA(genofile, sample.id = NULL, snp.id=snpset.id, autosome.only = FALSE, num.thread=2, verbose= TRUE) ``` Principal Component Analysis (PCA) on genotypes: Excluding 92 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN) Working space: 26 samples, 9,541 SNPs using 2 (CPU) cores ``` #The following code shows how to calculate the percent of variation that is accounted for by the top principal components. It is clear to see the first two eigenvectors hold the largest percentage of variance among the population, although the total variance accounted for is still less the one-quarter of the total. #update with lynx results # variance proportion (%) pc.percent <- pca$varprop*100 head(round(pc.percent, 2)) #get population information pop_code<- read.gdsn(index.gdsn(genofile, "pop_code")) head(cbind(sample.id, pop_code)) # make a data.frame tab <- data.frame(sample.id = pca$sample.id, pop = factor(pop_code)[match(pca$sample.id,sample.id)], EV1 = pca$eigenvect[,1], # the first eigenvector EV2 = pca$eigenvect[,2], # the second eigenvector stringsAsFactors = FALSE) tab #view the dataframe plot(tab$EV2, tab$EV1, xlab="eigenvector 2", ylab="eigenvector 1") #three distinct groups? ``` ## Plot our PCA results ```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=TRUE} #Draw plot.new() plot(tab$EV2, tab$EV1, xlim = c(-0.3, 0.7), ylim = c(-0.2, 0.57), col=as.factor(tab$sample.id), xlab="eigenvector 2", ylab="eigenvector 1") textxy(tab$EV2, tab$EV1, labs = pca$sample.id[1:24], cex = 0.45, pos = 4) legend("bottomright", legend=as.factor(tab$sample.id), cex = 0.6,pch="o", col=as.factor(tab$sample.id)) ``` ![](https://i.imgur.com/wN0MiiA.png) **Fig4.** LIC24 and LIC54 are very clearly clustering together, but it's hard to say what is going on with the rest of the samples. ![](https://i.imgur.com/Gl3gpol.png) **Fig5.** This is a very similar pattern to what we see from previous HapMap results with hybrid and bobcats (not included here). ## sNMF: a program to estimate ancestry coefficients sNMF is a fast and efficient method for estimating individual ancestry coefficients based on sparse non-negative matrix factorization algorithms. In [1], the performances of sNMF were compared to the likelihood algorithm implemented in the computer program ADMIXTURE. Without loss of accuracy, sNMF computed estimates of ancestry coefficients with run-times approximately 10 to 30 times shorter than those of ADMIXTURE. ```{r echo=T, warning=FALSE, include= TRUE} #Not used: commands to load functions and import files in STRUCTURE format #source("http://membres-timc.imag.fr/Olivier.Francois/Conversion.R") #source("http://membres-timc.imag.fr/Olivier.Francois/POPSutilities.R") #Importing input files #Our files need to be convertable to .geno or .lmff format used by LEA. The LEA package has a couple of very helpful functions to convert vcf files to these formats in R. If you can't use these functions, you may need to convert your vcf to STRUCTURE format outside of R using PGDSpider (downloaded at ). #We found that using the vcf2geno or vcf2lmff functions were much easier than converting our vcf dataset to STRUCTURE format and then using the struct2geno function. #identify your input file (use the new revised vcf) input.file <- "/Users/tanyalama/Box/Lynx Whole Genomes/Analyses/onlylynx.vcf" #vcf2geno function. The output.file should be same as the input file ending in .geno. force = TRUE if you want to overwrite an existing file under that name. vcf2geno(input.file, output.file = "/Users/tanyalama/Box/Lynx Whole Genomes/Analyses/onlylynx.vcf.geno", force = FALSE) #vcf2lfmm(input.file, output.file = "/Users/tanyalama/file.lfmm", force = TRUE) #this command was not used ``` Now that we have the correct .geno file format (see snpset and .gds->.ped->.geno above), we can run a population structure analysis that estimates k=n clusters. This can be done by using the sNMF function of the LEA package. The input.file here should end in .geno. The **null hypothesis** is that the Maine lynx population is panmictic (k=1)/ barriers to dispersal/gene flow are not present in the landscape and lynx in maine operate as a single contiguous population. ***We are already seeing results from the PCA and HapMap that do not support this hypothesis*** ### Trying K clusters 1-10, and running 1 iteration per K for expediency, normally run ~1k iterations ```{r snmf, echo=T, warning=FALSE, include= T} #Trying with our new .geno based on our snpset obj.snmf = snmf("/Users/tanyalama/Desktop/ped.fn.ped.geno", K = 1:10, project = "new", repetitions = 1, tolerance = 0.00001, entropy=TRUE, ploidy = 2) #Running with our original (unfiltered 600k SNPset). Looking for differences in result #obj.snmf = snmf("/Users/tanyalama/Box/Lynx Whole Genomes/Analyses/onlylynx.vcf.geno", K = 1:10, project = "new", repetitions = 10, tolerance = 0.00001, entropy=TRUE, ploidy = 2) ``` ### Example output for k=10 ``` summary of the options: -n (number of individuals) 26 -L (number of loci) 9633 -K (number of ancestral pops) 10 -x (genotype file) /Users/tanyalama/Desktop/ped.fn.ped.geno -q (individual admixture) /Users/tanyalama/Desktop/ped.fn.ped.snmf/K10/run1/ped.fn.ped_r1.10.Q -g (ancestral frequencies) /Users/tanyalama/Desktop/ped.fn.ped.snmf/K10/run1/ped.fn.ped_r1.10.G -i (with masked genotypes) /Users/tanyalama/Desktop/ped.fn.ped.snmf/masked/ped.fn.ped_I.geno - diploid Cross-Entropy (all data): 0.486377 Cross-Entropy (masked data): 1.88852 ``` ### Examine the results. The new results show k=4 is optimal ```{r echo= T, include = T} #Plot cross-entropy criterion of all runs of the project export.snmfProject(obj.snmf, "/Users/tanyalama/Box/Lynx Whole Genomes/Analyses/R_sNMF") #path to BOX plot(obj.snmf, cex = 1.2, col = "lightblue", pch = 19) ``` ![](https://i.imgur.com/CNcfcpe.png) **Fig6**. The "elbow" of the c-e plot helps us identify the number of clusters to identify as the most likely for our population (looks like k=4) but this is a much weaker signal than I like to see on c-e plots... ``` # get the cross-entropy value for each run at K = 4 ce <- cross.entropy(obj.snmf, K=4) ce # select the run with the lowest cross-entropy value for K = 4 best <- which.min(ce) # the best run is #n at k=4 best #At the end of the run, the qmatrix object contains the matrix of ancestry coefficients for each individual and for K = 4 clusters. The Q-matrix has 26 rows and 4 columns, and it is traditionally displayed using a barplot representation. For this representation, we just use the barplot function of R (Figure 1). qmatrix = Q(obj.snmf, K = 4, run = best) qmatrix setwd("/Users/tanyalama/Box/Lynx Whole Genomes/Analyses/R_sNMF") write.csv(qmatrix, "qmatrixlynx.csv") #export the qmatrix to use for mapping ``` ![](https://i.imgur.com/BUWzp2o.png) **Fig7.** We see again that LIC54 and LIC24 and clustering together as well as possibly LIC23 and LIC27B. This pattern holds for the PCA and HapMap results as well. It is possible that these individuals are clustering together due to relatedness? Let's confirm this is not the case. ## Part 4: Confirm relatedness is not the cause of clustering Using an IDB Descent analysis (MLE) ```{r} set.seed(1) #should be 1000 ibd <- snpgdsIBDMLE(genofile, sample.id = NULL, snp.id=snpset.id,autosome.only = FALSE, remove.monosnp= TRUE, maf=0.05, missing.rate=0.05, kinship = TRUE, num.thread=2, verbose = TRUE) ibd.coeff <- snpgdsIBDSelection(ibd) plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1), col=as.factor(ibd.coeff$ID2), xlab="k0", ylab="k1", main="sample relatedness (MLE)") lines(c(0,1), c(1,0), col="red", lty=2) textxy(ibd.coeff$k0, ibd.coeff$k1, labs = ibd.coeff$ID2, cex = 0.45, pos = 3) legend("left", legend=as.factor(ibd.coeff$ID2), cex = 0.4,pch="o", col=as.factor(ibd.coeff$ID2)) #legend optional ``` Identity-By-Descent analysis (MLE) on genotypes: Excluding 4,735 SNPs (monomorphic: TRUE, MAF: 0.05, missing rate: 0.05) Working space: 26 samples, 4,898 SNPs using 2 (CPU) cores MLE IBD: the sum of all selected genotypes (0,1,2) = 140365 MLE IBD: Wed Feb 12 11:25:10 2020 0% MLE IBD: Wed Feb 12 11:25:11 2020 100% ![](https://i.imgur.com/eCZZcYy.png) **Fig8.** Our closest clustering samples LIC54 and LIC24 are not related. The IBD coefficients table shows that the kinship coefficient does not >0.10 for any pair of individuals ID1 ID2 k0 k1 niter **kinship** 1 L155 LIC11 0.9750412 1.077873e-05 135 **0.0124767037** ## Next up: 1. Ask Kris W for code on resistance surface model...look at the new popgenomics package. Can we borrow a genetic map from there? Re-run pipeline w/ reference that has a genetic map or ask Giulio if we can generate a simple genetic map for lynx using the bionano physical mapping data...surely there are other genome teams who are wanting access to a genetic map for their work. 2. Annotate SNPs render("/Users/tanyalama/Box Sync/Squirrel/sNMF_revised.Rmd", output_format = , output_file = "/Users/tanyalama/Box Sync/Squirrel/sNMF_revised") ## Part 5: QUAST Quality Assessment Tool for Genomes By Tanya Lama QUAST can evaluate assembly quality even without a reference genome, so that researchers can assess the quality of assemblies of new species that do not yet have a finished reference genome. In addition, QUAST is rather fast, and its most time-consuming steps are parallelized; therefore, it can be effectively run on multi-core processors. Here, are are assessing the quality of the Canada lynx genome assembly compared with other felid genomes available through NCBI: acinonyx jubatus felis catus lynx canadensis puma concolor lynx pardinus panthera onca panthera leo panthera tigris altaica prionailurus_bengalensis_euptilurus This is a "reference-free" comparison. ## Result: ![](https://i.imgur.com/rDd5xUB.png) **Figure 1.** A QUAST (Gurevich et al., 2013) Nx plot of contig lengths (Mbp) as a percent of total scaffolds for felid genome assemblies available through NCBI. ## Step 1. Set up our workspace ``` ssh tl50a@ghpcc06.umassrc.org module load anaconda2/4.4.0 module load gatk module load java/1.8.0_77 module load gcc/8.1.0 source activate QUAST cd /project/uma_lisa_komoroske/Tanya/scripts/ #Install QUAST wget https://downloads.sourceforge.net/project/quast/quast-5.0.2.tar.gz tar -xzf quast-5.0.2.tar.gz cd quast-5.0.2 To run QUAST in QUAST-LG mode you should just add --large option to your running command. WITHOUT a reference: In most assembly projects a prior reference is not available and the assembly qulity assessment must rely on other sources of information. We use de novo eukaryote-targeted completeness measures to make reference-free comparisons The default QUAST package does not include: * GRIDSS (needed for structural variants detection) * SILVA 16S rRNA database (needed for reference genome detection in metagenomic datasets) * BUSCO tools and databases (needed for searching BUSCO genes) -- works in Linux only! To be able to use those, please run quast-download-gridss quast-download-silva quast-download-busco Some libraries failed to se had to download manually: /home/tl50a/.conda/envs/QUAST/lib/python2.7/site-packages/quast_libs/busco https://busco-archive.ezlab.org/v2/datasets/bacteria_odb9.tar.gz https://busco-archive.ezlab.org/v2/datasets/eukaryota_odb9.tar.gz https://busco-archive.ezlab.org/v2/datasets/fungi_odb9.tar.gz And unpacked using: tar -xzf eukaryota_odb9.tar.gz ``` ## Usage is: ``` "python /path/to/quast.py --large /path/to/genome/genome.fna.gz -o /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_genomic.fna.gz" #### location of quast.py: /home/tl50a/.conda/envs/QUAST/bin/quast.py #### location of fasta assembly files: /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_genomic.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/LYPA_genomic.fna.gz ``` ## Running on a single genome assembly bsub -q short -W 1:00 -R rusage[mem=16000] -n 4 "python /home/tl50a/.conda/envs/QUAST/bin/quast.py --large /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_genomic.fna.gz -o /project/uma_lisa_komoroske/Tanya/scripts/QUAST" ## Running on multiple genomes bsub -q short -W 4:00 -R rusage[mem=64000] -n 16 "python /home/tl50a/.conda/envs/QUAST/bin/quast.py --large /project/uma_lisa_komoroske/Tanya/download/refs/felidae/lynx_pardinus.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/lynx_canadensis.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/acinonyx_jubatus.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/felis_catus.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/felis_nigripes.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/panthera_leo.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/panthera_onca.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/panthera_pardus.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/panthera_tigris_altaica.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/prionailurus_bengalensis_euptilurus.fna.gz /project/uma_lisa_komoroske/Tanya/download/refs/felidae/puma_concolor.fna.gz -o /project/uma_lisa_komoroske/Tanya/scripts/QUAST/felidae2" ## Download outputs to your local computer/home directory for viewing ``` scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/scripts/QUAST/felids felids ``` ## Notes 1) If you don't have any assembly but still want to perform read alignments (against the reference genome) and get some stats, you can "cheat" and specify the reference twice -- as a reference (-r option) and as an assembly (without option). So, the beginning of your running command will look like: ./quast.py -r ~/e/Planaria_10X/Fastq/For_10x_Denovo_Data/PG2103_03BE5/Fastq/longranger-indexed/tigmint/DjScaff_fnl20141213.fa ~/e/Planaria_10X/Fastq/For_10x_Denovo_Data/PG2103_03BE5/Fastq/longranger-indexed/tigmint/DjScaff_fnl20141213.fa -o ~/e/Planaria_10X/quast/ 2) all of your reference files need to be in a single folder (e.g. all in felids) to run for comparison 3) Note: QUAST will NOT work on Genbank assemblies -- I think we fixed this. the Iberian lynx genome was from Genbank. We had some "permissions" issues with getting it to run ## Note: DELETE FILES ON CLUSTER WHEN DONE WITH QUAST: /home/tl50a/.conda/envs/QUAST/lib/python2.7/site-packages/quast_libs/busco they're taking up a lot of space