--- tags: NCCUProject --- # BlastFrost/Barcodes Figures ## BlastFrost detail ``` Figure 2, 3 in BlastFrost ``` 1. BlastFrost can account for these potential gaps by extending the k-mer hit results, and produce a subgraph for each successful k-mer query. 2. BlastFrost greedily completes a path by traversing non-branching paths of the same color within the graph. 3. The subgraph is then used to reconstruct the corresponding sub-sequence of each color from the path in the Bifrost graph. 4. test the accuracy of BlastFrost for detecting sequence variants of a large number of query sequences in a large number of related genomes ## Fig 2 in BlastFrost - Indexing by building Bifrost graph Bifrost created a graph of 926 representative Salmonella genomes, in less than 24 min and required less than 5 GB of memory. - Bifrost graph info The graph occupies 2.3 GB of disk space, and it contains more than 33 million unitigs. - Querying with BlastFrost method We ran BlastFrost (parameter d = 1 to support inexact searches) on Bifrost graph with 21,065 query sequences , consisting of one representative allele for each locus, and extracted all allelic variants from the corresponding subgraphs. - Query Standard: We bin query hits by the nucleotide identity between the query and the EnteroBase allele, or the nucleotide identity between the query and the search result if an allele is not stored in EnteroBase. --- ``` Figure 4, 5 in Barcode ``` ## Figure 4 - Barcodes in feature space Q: do different classes of genomes have their unique characteristics in their bar- codes? A: Yes, as shown in Figure 4, one of which measures the overall frequency variation for all 4-mers across the genome's barcode, and the other measures the overall similarity level among all the M-bp fragments of the genome, each considered as a vector of 4-mer frequencies. [ Detail ] - X-axis: average of variations of the 4-mer frequencies across a whole genome across all 4-mers. - Y-axis measures the similarity level among all 1000-bp partitioned fragments of the genome, each represented as a 136-dimensional vector of 4-mer frequencies. [ Summary ] Figure 4 suggests that barcodes also capture a higher-level similarity beyond individual genome sequence similarities through the textures of their images, which are the common and unique characteristics of different classes of genomes. Barcodes are not just a simple visualization tool, instead they have captured some fairly basic information about genomes! From application point of view, we believe that this feature will prove to be useful to metagenome analyses as fragments from different classes of genomes such as eukaryotes, prokaryotes or different organelle genomes, have different characteristics in their barcode images. ![](https://i.imgur.com/4Q8rmI8.png) P.S. The barcode analyses in this paper are mainly based on data from prokaryotes --- ## Figure 5 - Distribution of ratios between barcode variations of all prokaryotic genomes and their corresponding randomly generated nucleotide sequences Q: do all nucleotide sequences have the barcode property like genome sequences have? A: No, as shown in Figure 5, barcodes for genomes and the randomly generated nucleotide sequences have different characteristics. [ Detail ] - barcode variation: standard deviation of the list of the averaged frequencies of all the k-mers along the genome - corresponding random nucleotide sequenc: random sequence of the same length and with the same mono-nucleotide frequencies as those of the genome, generated using a zeroth order Markov chain model ![](https://i.imgur.com/2HugpQs.png) --- ## 明則 vs barcode, genome similarity barcode: ![](https://i.imgur.com/VE8qOPa.png) mm10: ![](https://i.imgur.com/znJngUO.png)