Amplicon visualization prototypes

--- tags: GeneLab title: Amplicon visualization prototypes --- # Amplicon visualization prototypes [toc] Just a page to help serve as a checklist of what we've done so far and what we have planned --- ## Bar plots of some basic summary stats - static - [x] Number of starting reads - This is the “starting_reads” column in the “*read-count-tracking.tsv” file - [x] Number of quality-filtered reads - This is the “dada2_filtered” column in the “*read-count-tracking.tsv” file - [x] Number of unique ASVs/OTUs - each unique one of these sequences may have been seen in many identical copies, so here we are just counting how many unique sequences (ASVs) were recovered - this info isn’t explicitly captured, but it is the “*-taxonomy-and-counts.tsv” table - the total number of unique ASVs across the whole dataset would be the number of rows in that table (not counting the header) - the total number of unique ASVs in a given sample would be the number of rows for which the count value is greater than 0 for that sample (i.e., that unique ASV was detected in some abundance in that sample) - I don’t know which of these two ways would be better, one is a view of each individual sample, the other is a view of the dataset as a whole; this might be something we should just do both - [x] Total counts of ASVs/OTUs retained - this would be the sum of all the copies of all the unique ASVs/OTUs - e.g., ASV_1 (which again represents some exact DNA sequence) only counts as 1 unique ASV, but it might have been observed 100 times. So in that case we’re talking about 1 unique ASV, but the “counts” for that ASV would be 100 - this can also be done per sample or across all ![](https://i.imgur.com/NzH3r9Y.png) This is across all samples (interactive version) --- ## Taxonomic summaries > Quick note on taxonomy: any individual, unique ASV may or may not be successfully taxonomically classified (this depends on the method, the reference database, and the ASV sequence of course). Any given ASV may be classified down to the genus level (which is a relatively more specific taxonomic designation), while another might only be classified to the domain level (a much less specific taxonomic designation). Also, taxonomy is arbitrary and highly flawed, but still very practically useful and not inherently evil. We just shouldn’t trust anything we get as the “right” answer, rather it is what we get when we do what we did – which still has a lot of practical utility :) - [x] Number of unique ASVs classified at specific taxonomic ranks - Nick shared an example of this in his suggestions document (attached in the initial email), it is this figure 1D he points to: - <img src="https://i.imgur.com/gbJo9du.png" width="50%"> - This was done across the whole dataset, here for the ranks of Family, Genus, and Species (and including those not classified to those levels). In our case for general level, we’d probably want to start at Domain and go down to Genus, so Domain, Phylum, Class, Family, Order, Genus, and Unclassified (which would be everything not classified to the Domain level). We don’t classify down to the species level (because it’s frequently impossible to have that level of resolution with the short fragment of DNA we are working with as our amplicon). Also don’t mind that that paper refers to them as “ESVs” (exact sequence variants), instead of ASVs (amplicon sequence variance). There’s a funny story behind that I’ll share if wanted/remembered when we chat next – but yea, terminology is a mess - [x] Counts of the copies of all the unique ASVs that were detected at specific taxonomic ranks - Another view Nick points to is figure 1C from that same paper: - <img src="https://i.imgur.com/FDFiTyU.png" width="50%"> - Here we again are talking about total counts of each unique ASV that was successfully classified to a particular taxonomic rank. Same as above in that we should do it for all ranks down to genus. - [x] Phylum and class-level plots - [ ] individual samples depicted - [ ] Sunburst plots **07/20 UPDATE:** I found a colorblind palette that I think looks nice. Below are two examples using only this palette. The threshold is now only on the deepest level (it's actually set to 0 on these examples). **Is there anything else that needs to be modified on those plots?** ![](https://i.imgur.com/9V70nEl.png) ![](https://i.imgur.com/4joq8q5.png) *Old remarks* ![](https://i.imgur.com/xtQNCoj.png) Sunburst plot with the 'parent' way to present the data, shows up to class, threshold on class at 95% --> Only shows unique ASV, doesn't handle possible duplicates if same subcat name but different parents ![](https://i.imgur.com/SAcvHT1.png) ![](https://i.imgur.com/OYxBPQs.png) Sunburst plots with the 'path' way to present the data, shows up to class, threshold on class at 95%, unique ASVs (up) and total counts (down) --> Colors not great (HOW SHOULD WE HANDLE COLORS?), handles duplicates, possibility to count total and unique ASVs with simple boolean, separates 'Unclassified' data and 'Others' (= below threshold) (GOOD IDEA?) --- ## Beta diversity > See [here](https://astrobiomike.github.io/amplicon/dada2_workflow_ex#beta-diversity) for some text on what beta diversity means in this context - [ ] ordination based on a normalized count table - we can just normalize by percent to start, but then work in a more appropriate one (maybe median ratio, i have an implementation in python already in [this script](https://github.com/AstrobioMike/bit/blob/master/bit-normalize-table) starting at [line 84](https://github.com/AstrobioMike/bit/blob/696a8f1287969b9a2b5900ad3c5aa65a08429902/bit-normalize-table#L84)) - could get euclidean distances of normalized table then make a PCoA (like done in my amplicon tutorial [here](https://astrobiomike.github.io/amplicon/dada2_workflow_ex#beta-diversity), though that code is in R) **07/20 UPDATE:** Here is the new version of the PCo plot (compared to the old one below). The background is now white, and the points are colored depending on where they were used, and the shape of the symbol depends on when they were. ![](https://i.imgur.com/aCkrA6C.png) *Previous version* ![](https://i.imgur.com/LR3AynL.png) - [ ] hierarchical clustering - could get euclidean distances of normalized table like above, then do Ward.D2 clustering **07/20 UPDATE:** I have not tested the to_tree conversion on the data yet (to create a Plotly graph but with a Sklearn clustering which we trust more). I will soon and update this! --- ## Alpha diversity > See [here](https://astrobiomike.github.io/amplicon/dada2_workflow_ex#alpha-diversity) for some text on what alpha diversity means in this context > - [ ] shannon diversity **07/20 UPDATE:** Here are 3 Shannon diversity plots. The first one is just the index value for each sample, with the same color/symbol method as for the PCoA plot. ![](https://i.imgur.com/Od3QvU7.png) The second and third ones are sort of equivalent to the second one on your website: the samples are ground by 'category'. **I wasn't sure which category was the most relevant but it's really easy to change.** On the first one I kept the symbol changed based on the 'time' factor value, it may not be necessary. I feel like the legend may be a bit busy, I can have less colors and also not show the names of the samples. Maybe changed colors based on time with always the same symbol shape would be better. What do you think? It would look more like the third plot, where I grouped the samples based on when rather than where. ![](https://i.imgur.com/LD5aLEm.png) ![](https://i.imgur.com/EdqL00B.png) - [ ] inverse simpson - not sure the best way to visualize these, maybe points with error bars (which are not a part of the examples i have down from [here](https://astrobiomike.github.io/amplicon/dada2_workflow_ex#alpha-diversity) a little bit) **07/20 UPDATE:** For the inverse Simpson plot, the remarks are the same as for the Shannon one. I can play with colors/shapes in the same way. One 'concern' I have is I didn't find a function returning the inverse Simpson index ($\frac{1}{sum(p_i)^2}$) so I cheated and used the Simpson index ($1 - sum(p_i)^2$) and then I calculated the inverse by hand (checking that there is no division by 0). It may not be the cleanest way to do it though... ![](https://i.imgur.com/T1LQ04l.png) *Old* Ideas: https://docs.onecodex.com/en/articles/4136553-alpha-diversity --- ## Questions & remarks - In your website, you plotted Chao1 and Shannon next to each other. Do you want me to do the same with Shannon and inv. Simpson (= have both graphs on a single plot)? Or is it not necessary since the user will be able to have them next to each other on the dashboard? - Would you prefer dotted lines for the grid? Or any layout change? - I am still working on the Django app, I hope to have a first example ready by the end of the week, I'll keep you updated. - I am also going to put my code on Github (I completly forgot), but you will probably not have time to look at it this week. - If you have other plots you would like me to make, I'd be happy to! Also, if you have a bigger list of studies to test my plots on it would be great (for now I'm working on GLDS-26, 170 and 200). --- ## Presentation - if you get into a specific dashboard or anything --- ## Storing code - Maybe you should make a github for your code? Or could be something else if you have something you prefer --> Github OK + have moved most of the code to .py files and created separated files for modules (ex: taxonomy_tools which builds all the taxonomy graphs) - Should also include a README that explains how to set up an environment to be able to run things, conda environment would be easiest ```bash cp file . ``` ## Access local server - Click right on Finder --> Connect to server... --> paste: http://127.0.0.1:8050/ --> connect ## Github CLI notes > Quick example of how to send changes perfomed on our local computer to github. This needs to be in the github repository on our local computer (so after, e.g., doing `git clone` to get it on our computer). ### tl;dr ```bash git status git add * git commit -m "message about what we changed/added goes here" git push ``` This example is with my happy belly github. ### Checking status `git status` checks if there have been any changes from how it was pulled from github. This is before I made any changes: ```bash git status ``` ![](https://i.imgur.com/IDVHOsk.png) After making a change: ![](https://i.imgur.com/r4X9ysM.png) ### Adding changes to be committed Things can be added as individual files if wanted, or all changes can be added like so: ```bash git add * ``` ```bash git status ``` ![](https://i.imgur.com/IugSgTc.png) ### Committing changes The `commit` needs to have a message attached to it, useful to describe what we're changing or adding. That an be provided with the `-m` flag as shown here: ```bash git commit -m "small change for git CLI example" ``` ![](https://i.imgur.com/Fuf2TDF.png) ```bash git status ``` ![](https://i.imgur.com/Fjz5QOI.png) Now the status says we are ahead of "origin/master" (the GitHub repository) by 1 commit. ### Pushing changes to GitHub ```bash git push ``` ![](https://i.imgur.com/CJ8lqh8.png) And now the changes are updated in GitHub, and anyone can access the latest, e.g: ![](https://i.imgur.com/1C5C1bi.png)