owned this note
owned this note
Published
Linked with GitHub
CRISPR Analysis
===
## Current Directions
* Cluster repeats + spacers
* Need to account for non-crispr direct repeats
* Make graph to illustrate heterogeneity
* other ways to quantify heterogeneity?
* Visualize alignments
* explore gap regions.
* Annotate cas genes.
## Possible Directions
Initial analysis based on Lizzy's data:
* focus on a single loci/repeat and figure out an analysis pipeline). Build pipeline to be able to analyze long reads using repeats already found by Lizzy.
* Steps for single loci:
* identify long reads with crispr arrays.
* extract spacers.
* map reads to assembled genome to verify origin.
* map spacers to metagenomic data to determine connections between spacers.
* perhaps use crisprDetect to use short reads for the specific repeat.
* visualize reads?
* Way to summarize data.
* Build pipeline to analyze all repeats found by Lizzy.
Find additional crispr arrays
* Determine good crispr finding programs that we will be able to use.
* read documentation to understand methods.
* Run crispr finder on pacbio data and moleculo data.
* Compare results to data that Lizzy has assembled.
* How does high error result affect results?
* Analyze data similar to moleculo reads and compare.
Metagenomic assembly of crispr array from short reads.
* Read aboud crisprDetect.
* Are there other similar tools?
* Do you feed in the repeats?
* construct network of transitions from one viral spacer to the next in the array. (Won't get more linkage info).
* Determine viral contigs in the metagenome
* Map viral spacers to contigs to identify viral contigs.
* Use viral contigs to determine if different spacers correspond to the same virus. Can use this info to refine long read analysis.
## Questions
* Understand basic bio of CRISPR
* What is unique about crispr arrays and can be used for identification or verification?
* How do you define a single crispr loci?
* Size of crispr arrays, number of arrays per genome, different types of crisprs,
* wild population data on crisprs?
* how variable are crispr regions in a population?
* is recombination of crispr arrays important? [paper](https://academic.oup.com/gbe/article/7/7/1925/631120/The-Contribution-of-Genetic-Recombination-to)
* Finding CRISPR arrays in data
* Have crispr arrays in metagenomic data been studied before? How did they do it?
* Methods to detect crisprs?
* For crispr-finding programs: how are outputs given? What about the directionality of the crispr-array? How can i check this?
* Data questions
* Do the reads need to be trimmed or filtered?
* Formulate research questions
* Can crisprs help identify pop structure or coalescence?
* Compare metagenomic detection of crispr to long reads.
## CRISPR basics
Repeat Spacer arrays:
* repeats: typically 28-37 bp, range 22 - 55
* some have palindrominic regions (associated with hairpins) but not necessary
* spacers: typical 32-38 bp, range 22-70bp
* array size: usually <50
Cas genes:
* encode for proteins that cleave foreign dna and incorporate it into the array
* flank the repeat-spacer array
* many different types of crispr/cas systems. Distinguished by a signature protein or by phylogeny of cas1 protein.
* organisms can contain multiple such systems
* cas 1 and cas2 are universal across system types
Spacer acquisition:
* cas1 + cas2 form complex that incorporates protospacer DNA into the complex.
* New spacers added to leader end of the array
* Is this the 3' or 5' end? What does that mean??
* Protospacers (sequence in foreign DNA that will become a spacer) often appear next to PAMs (protospacer adjacent motif) which are important to excising the spacer sequence.
* imp for crispr/cas system type I and II but not necessarily III
* PAMs are 3-5bp in length. PAM may be evolutionarily linked to ca1 and leader sequence.
* In type I-E system, first repeat is copied and then spacer is inserted between repeats.
* In one species, one crispr loci has been found to incorporate spacers randomly throughout the array. [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3468723/)
* some crispr loci contain many spacers corresponding to the same phage. Spacers already targeting a phage (even with mismatches) 'prime' the acquisition machinery. Mechanism not well understood. [paper](https://www.nature.com/articles/ncomms1937)
*
## CRISPR analysis background
Banfield papers
* [Metagenomic reconstructions of bacterial CRISPR loci constrain population histories](https://www.nature.com/ismej/journal/v10/n4/full/ismej2015162a.html), Sun 2016
* [New CRISPR–Cas systems from uncultivated microbes](https://www.nature.com/nature/journal/v542/n7640/abs/nature21059.html), Burstein 2017
* [Rapidly evolving CRISPRs implicated in acquired resistance of microorganisms to viruses](http://onlinelibrary.wiley.com/doi/10.1111/j.1462-2920.2007.01444.x/full), Tyson + Banfield 2007
* [Paez-Espino 2013](https://www.nature.com/articles/ncomms2440), a few spacers (derived from virus) dominate during infection over evo experiment.
Examples of analysis
* [Rho 2012](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1002441#s4) Diverse CRISPRs Evolving in Human Microbiomes
* [Sorokin 2010](http://aem.asm.org/content/76/7/2136.short), evo dynamics of crispr for ocean metagenome.
* [Gogleva 2014](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-202), analysis of crispr cassettes in human gut microbiome
* Using crispr to differentiating otherwise clonal individuals in a clinal setting.
*
Reviews
* [Westra 2016](http://www.annualreviews.org/doi/abs/10.1146/annurev-ecolsys-121415-032428) review of evo and eco of crispr
Other papers
* Viral Diversity Threshold for Adaptive Immunity in Prokaryotes. [paper](http://mbio.asm.org/content/3/6/e00456-12.short)
* Dealing with the Evolutionary Downside of CRISPR Immunity: Bacteria and Beneficial Plasmids [paper](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003844)
* Study of sequence similarity between crispr repeats. [paper](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2007-8-4-r61)
* CRISPR loci diversity seen in Ecoli [paper](http://mic.microbiologyresearch.org/content/journal/micro/10.1099/mic.0.036046-0#tab2)
## Finding CRISPR arrays
List of Programs:
* [CRF(CRISPR Finder by Random Forest)](https://peerj.com/articles/3219/)
* [CRISPRdigger](https://www.nature.com/articles/srep32942?WT.feed_name=subjects_genetics) -- more for genomes. [Project](https://github.com/greyspring/CRISPRdigger)
* [CRISPR detection from overlap reads](http://online.liebertpub.com/doi/abs/10.1089/cmb.2015.0226) -- no code available
* [CRISPR Recognition Tool (CRT)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-209)
* [Crass](https://academic.oup.com/nar/article/41/10/e105/1074020/Crass-identification-and-reconstruction-of-CRISPR)
* [CrisprDetect](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2627-0)
* MetaCrast
* Minced
CRISPR databases
* [crispr db 2007](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-172)
### CrisprDetect [paper](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2627-0) [project](http://brownlabtools.otago.ac.nz/CRISPRDetect/predict_crispr_array.html)
Overview:
* finds crispr arrays, corrects the direction of arrays, and annotates different types of sequence variations (e.g. insertion/deletion) in near identical repeats.
* claims to account better for crispr biology: e.g. mutations or indels can occur in repeat and be propagated forward in the array so important to account for direction.
* repeats at trailing end (older) can accumulate mutations which means that this part of the array may end up going undetected.
### CRASS [paper](https://academic.oup.com/nar/article/41/10/e105/1074020/Crass-identification-and-reconstruction-of-CRISPR) [project](http://ctskennerton.github.io/crass/)
Overview:
* locates individual reads that contain direct repeats (DRs) and clusters them together based on DR type.
* reconstruct CRISPR arrays using a graph approach
* can output assembled contigs of individual strains using external assembly software such as Velvet.
Possible Issues:
* needs exact matches. finding crisprs uses exact matches.
* Can accomodate some seq errors in the graph construction during the clean up phase.
* clusters similar DR types into single DR but not sure how exactly this is done.
* Does not use quality scores. This may mean that reads should be quality filtered.
* Does not use pair end info
Algorithm:
* Short Read algorithm (<178bp):
* looks for two copies of a kmer (k=length of DR length) in the same read separated by an appropriate distance.
* how well does this handle seq errors?
* long read algorithm (>178bp):
* similar to CRT, searches for kmers (default 8) separated by approp distance, then extends search.
* QC of DR. Probably cutoffs based on distributions from crispr db for:
* based on size of repeats + spacers.
* similarity of spacer-spacer and spacer-repeat
* Clusters DRs together if they share six 7mer with another DR or is a substring of the other(?)
* searches for reads with single copy of DR.
* Graph construction:
* use first and last k bases to set up graph. "inner edges" connect first to last bases and "outer edges" connect spacer to next spacer.
* Initial graph is cleaned up by removing hanging edges. Final result is a graph of only the outer edges.
CrisprDetect
## Data
[github overview](https://github.com/rhine3/pink-berries/blob/master/data-guide.md)
Bioinformatics guides
* [Biopython documentation](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc77)
### Audra's files
`moleculo_reads_analysis`
* `crispr_containing_reads.txt`, `unique_crispr_containing_reads.txt`: files containing
* Do these have the same reads? Just sorted?
* Are these combined from multiple programs?
* are results for all the data or the `subsample_random.fa`
* `out_allSpacers.fa`: spacers of arrays labeled as "array0spacer1" for example.
* matches data lizzy previously sent me.
* contains some duplicates. E.g. array0 and array47 are likely part of the same array.
## Finding reads with crisprs.
**BLASTn approach**
Treat the fasta file of moleculo reads as a BLAST database and search for sequences with regions of similarity to known crispr repeats.
To make database of fasta file:
```makeblastdb -in moleculo.fasta -parse_seqids -dbtype nucl```
To run Blast for a set of sequences in `repeats.fasta` and put results into `blast_results`:
```blastn -db moleculo.fasta -query repeats.fasta -out blast_results```
Options for BLAST (found [here](https://www.ncbi.nlm.nih.gov/books/NBK279675/)):
* `-task "blastn-short"` for reads <30bp
* can change word size (length of initial exact match), reward/penalty for matches/mismatches, percent identity cutoff
This method can find identify reads containing the crispr repeats.
* How can I check that results make sense?
* How can make a pipeline to get data into a nice format.
## Alignment
* Tried aligning with different alignment tools. Output aligned some reads well together but failed to align crispr regions for most of the reads even when high local alignment could be achieved.
## Clustering Repeats
* found that some reads look to be likely repeats, e.g. 'mol-32-18d0-021982'. But these reads have quite short arrays.
* Interestingly found that two long arrays have repeats that have a stretch of identical base pairs up to a few mutations.
* would be interesting to see if these arrays map to the same region of the genome.
* sequence: "TGACCCGATCTAAAGGGGATTAAGAC". Perhaps blast against all moleculo sequences.
* Did not find any additional repeats besides the ones used.
Blast approach:
1. Blast all repeats and spacers against each other.
2. Collapse repeats together. Take the repeat of the longest array as the official repeat.
* if any repeats are a reverse complement, then reverse array and change 's150' to 's150rev'.
* Algorithm
1. Find valid pairs. Criteria: evalue<1e-8 and aligned_length > .9 x repeat_length
3. Collapse spacers together. Take the spacer of the longest array as the official spacer.
* correct any reversals
4. Check for lingering reversals. If any left, simply make a new dictionary element with the reverse complement sequence.
## Pipeline
1. Get long reads of interest
* Identify long reads with crisprs using blastn and the given repeat sequences.
* Make fasta file with reads
2. Use CRISPRDetect to extract repeat and spacers.
3. Process CRISPRDetect output
1.
## References
### [Andersson + Banfield 2008](http://science.sciencemag.org/content/320/5879/1047)
* Goal: use crispr spacers to fish out viral dna from metagenomes and possibly reconstruct whole virus genomes.
* Sample from biofilm growing in acid mine drainage
* Stesp of analysis:
* Found crispr repeat sequences in contigs.
* CRISPR-bearing reads were identified as those having a 20-mer sequence repeated at least three times, with a periodicity of 40 to 82 bp, with unique 8-mer sequences upstream each 20-mer (to avoid direct repeats)
* CRISPR repeat was defined by first assigning it to the 20-mer with highest frequency on the read, and then, in steps of one, increasing the oligomer size and assigning the repeat to the oligomer with highest frequency, until the frequency dropped below 80% of the most frequent 20-mer (80% to allow for some degeneracy in the repeat sequences).
* Found spacer-containing non-crispr (SNC) contigs of putative viral origin.
* Clustered SNC contigs based on 4-mer frequencies. Used these clusters to group SNC contigs into viral populations.
* Could identify SNPs within a viral population. Linkage disequilibrium decayed quickly over span of 25bp. (How good are frequencies inferred by analysis?)
* Claim that spacers often target more conserved regions of the viral genomes.
* Could this be artefact of the analysis?
### [Sun + Banfield 2016](https://www.nature.com/ismej/journal/v10/n4/full/ismej2015162a.html)
* The crRNA silencing requires identity with targeted sequences, and immunity may be lost by mutation in either the target region or an associated proto-spacer adjacent motif (PAM) that is required for CRISPR function
* Proto-spacers are the regions in phage and plasmid sequences flanked by PAMs that give rise to spacer sequences during adaptation; they also refer to the regions targeted by the CRISPR spacers during interference.
* studies in a certain type of CRISPR-Cas system have shown that mutations in the proto-spacer, nearest the PAM, allows the phage to escape, whereas mutations in other regions of the proto-spacer have no impact on immunity
* Approach to identifying crispr-case system followed [Makarova 2011](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380444/): found cas genes and compared to phylogenetic tree to determine which cas system it is.
* Sampling: targeted regions immediately next to repeat-spacer array for PCR amplification. PCR fragments were not sized selected so possibility of preferential sequencing of smaller fragments.
* Trimmed and filtered reads because analysis is at read level and not on assembled data.
* After clustering (needed because of seq errors) identified a total of 3933 unique groups from Leptospirillum group II and 296 unique groups from Leptospirillum group III across all samples
* rarefaction curves show no saturating suggesting high spacer diversity. Sampled ~400k spacers.
* locus reconstruction: align sequences of spacers. Sequence of spacers at the older end of crispr loci were well conserved. Some sequences shared across crispr loci, but with some loci-specific spacers.
* found possible loci recombination: individuals with spacers in a sequence corresponding to one variant followed by a different variant's sequence.
* detected PAM sequences in viral dna next to protospacers.
* Matches to non-crispr, non-genome dna in samples (corresponds to viral origin):
* spacers in trailer end (older) are enriched for imperfect matches without PAM (since viruses have likely evolved since then).
* spacers in leader end have greater amount of perfect matches with PAM
* divide spacers into new, older, oldest. New and older spacers in same locus tend to target the same phage population.
* For one viral population with a reconstructed genome, they found 1400 matches to PAM. This represents max number of spacers (up to complications due to rapid sequence evolution)