1.6. Graphics with Pandas and Python II

--- tags: BMMB554, Block1 --- # 1.6. Graphics with Pandas and Python II [![hackmd-github-sync-badge](https://hackmd.io/DQDeRE6sSku3yfaqCY-xcw/badge)](https://hackmd.io/DQDeRE6sSku3yfaqCY-xcw) > Building a mini genome browser # Warm up Before diving in go through the following notebook and complete as many exercises as possible: Open [this notebook](https://colab.research.google.com/github/nekrut/BMMB554/blob/master/2021/ipynb/Plotting2_try.ipynb) and complete all exercises. Type your code in cells containing exercise comment: ``` # EXERCISE ``` # Building a custoM "browser" When working with SARS-CoV-2 data we discovered an interesting problem. There is a region of the genome that is completely void of any variation in intrahost samples. It is highlighted by an orange box in the figure below: ![](https://i.imgur.com/H0K85Zt.png) What is that? One possible explanation is that this region is for some reason difficult to sequence. To check what is going on we will build a complex plot that would show different types on data in one unified figure. We will be plotting four types of data: 1. Coverage distribution computed from BAM files. 2. Location of variants identified in our study. 3. Coordinates of SARS-CoV-2 genes. First, let's look at each dataset separately. ## Coverage data Coverage data is computed by taking sequencing reads aligned to genome (usually in the form of [BAM](https://academic.oup.com/bioinformatics/article/25/16/2078/204688) files) by iterating over genomic coordinates and computing how many reads cover each position (usually done with [Samtools](http://www.htslib.org/), [BEDTools](https://bedtools.readthedocs.io/en/latest/), etc.). :::success :point_right: We will learn about BAM format, read mappers and other essential tools in the next block of this course. ::: let's import coverage data computed using `samtools depth` tool from Galaxy: ```python= cvrg = pd.read_csv('https://usegalaxy.org/datasets/bbd44e69cb8906b528225f578c3a071b/display?to_ext=tabular',header=None,sep='\t') ``` the data looks like this (here we are showing first 6 columns, but the actual dataset here contains over 600): | | 0 | 1 | 2 | 3 | 4 | 5 | |---:|:------------|------:|----:|-----:|-----:|-----:| | 0 | NC_045512.2 | 15000 | 193 | 6396 | 6329 | 7719 | | 1 | NC_045512.2 | 15001 | 181 | 6245 | 6243 | 7679 | | 2 | NC_045512.2 | 15002 | 182 | 6200 | 6194 | 7640 | | 3 | NC_045512.2 | 15003 | 181 | 6183 | 6165 | 7596 | | 4 | NC_045512.2 | 15004 | 181 | 6187 | 6176 | 7601 | this data is in wide format, which is not what we need. Let's reshape it: ```python= # Melt cvrg=cvrg.melt(id_vars=[1],value_vars=cvrg.columns[2:],var_name='samples',value_name='cvrg') # Name position column properly cvrg.rename({1:'position'},axis=1,inplace=True) ``` The data now looks like this: | | position | samples | cvrg | |---:|-----------:|----------:|-------:| | 0 | 15000 | 2 | 193 | | 1 | 15001 | 2 | 181 | | 2 | 15002 | 2 | 182 | | 3 | 15003 | 2 | 181 | | 4 | 15004 | 2 | 181 | This is a very long table: ```python len(cvrg) 6390639 ``` Now we will aggregate coverage data by computing `min`, `max`, `mean` at each position across all samples: ```python= stat = cvrg.sort_values(by=['position']).groupby(['position'])['cvrg'].agg([np.amin, np.std, np.amax]) ``` Now let's look at at: ```python= p = figure( plot_height=400, plot_width = 1200, y_axis_label='Coverage (lines)', x_range=Range1d(start=18000, end=23000,bounds=(0, 30000)) ) p.vline_stack(['amin','std','amax'], x='position', source=stat, line_width=3, color=colors[0:3], legend_label=['min','std','max']) p.legend.title = 'Read coverage' show(p) ``` ![](https://i.imgur.com/U2xU9nX.png) because range of Y-values is so large we can use log scale by adding `y_axis_type='log': ```python= p = figure( plot_height=400, plot_width = 1200, # Log scale for y-axis y_axis_type='log', y_axis_label='Coverage (lines)', x_range=Range1d(start=18000, end=23000,bounds=(0, 30000)) ) p.vline_stack(['amin','std','amax'], x='position', source=stat, line_width=3, color=colors[0:3], legend_label=['min','std','max']) p.legend.title = 'Read coverage' show(p) ``` ![](https://i.imgur.com/PcPNYfN.png) ## Dealing with annotations We can extract gene coordinates directly from [SARS-CoV-2 GenBank file](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512). However, this would require an introduction to [BioPython](https://biopython.org/), which we will have later in this course. For now, let's use this dict: ```python= annot = { 'start': [0, 265, 805, 2719, 8554, 10054, 10972, 11842, 12091, 12685, 13024, 13441, 16236, 18039, 19620, 20658, 13441, 13475, 13487, 21562, 25392, 26244, 26522, 27201, 27393, 27755, 27893, 28273, 29557, 29608, 29628, 29674, 29727], 'end': [265, 805, 2719, 8554, 10054, 10972, 11842, 12091, 12685, 13024, 13441, 16236, 18039, 19620, 20658, 21552, 13480, 13503, 13542, 25384, 26220, 26472, 27191, 27387, 27759, 27887, 28259, 29533, 29674, 29644, 29657, 29903, 29768], 'func': ['.', 'leader', 'nsp2', 'nsp3', 'nsp4', '3Cpro', 'nsp6', 'nsp7', 'nsp8', 'nsp9', 'nsp10', 'RdRp', 'helicase', 'ExoN', 'endoR', 'MethTr', 'nsp11', 'FS SL1', 'FS SL2', 'S', 'orf3a', 'E', 'M', 'orf6', 'orf7a', 'orf7b', 'orf8', 'N', 'orf10', '3-UTR SL1', '3-UTR SL2', '.', 'S2M'], } ``` First, we create a Pandas dataframe out of this dict: ```python= annot = pd.DataFrame.from_dict(annot)[['start','end','func']].sort_values(by=['start']).reset_index() ``` generating this: | | index | start | end | func | |---:|--------:|--------:|------:|:-------| | 0 | 0 | 0 | 265 | . | | 1 | 1 | 265 | 805 | leader | | 2 | 2 | 805 | 2719 | nsp2 | | 3 | 3 | 2719 | 8554 | nsp3 | | 4 | 4 | 8554 | 10054 | nsp4 | For better visual representation we would want adjacent genes to be shifted relative to one another. So, let's add necessary info the DataFrame: ```python= annot['top']= annot.index % 2 annot['bottom'] = annot['top']-1 annot.loc[annot['top'] == 0, 'color'] = colors[4] annot.loc[annot['top'] != 0, 'color'] = colors[5] ``` resulting in this: | | index | start | end | func | top | bottom | color | |---:|--------:|--------:|------:|:-------|------:|---------:|:--------| | 0 | 0 | 0 | 265 | . | 0 | -1 | #ff7f00 | | 1 | 1 | 265 | 805 | leader | 1 | 0 | #ffff33 | | 2 | 2 | 805 | 2719 | nsp2 | 0 | -1 | #ff7f00 | | 3 | 3 | 2719 | 8554 | nsp3 | 1 | 0 | #ffff33 | | 4 | 4 | 8554 | 10054 | nsp4 | 0 | -1 | #ff7f00 | Here `top` and `bottom` will be used a bounds for glyphs representing genes and `color` will give color then in alternating fashion. Now let's look at at: ```python= r = figure( plot_height=200, plot_width = 1200, y_axis_label='Genes', x_axis_label='Position in genome', y_range=Range1d(start=-1, end=2,bounds=(-1, 2)), ) gene_coord = ColumnDataSource(annot) labels = LabelSet(x='start', y=1.5, text='func', level='glyph', x_offset=5, y_offset=-5, source=gene_coord, render_mode='canvas') genes = Quad(left="start", bottom='bottom', right='end', top='top',line_color='black',fill_color='color') r.add_glyph(gene_coord, genes) r.add_layout(labels) r.yaxis.visible = False r.ygrid.grid_line_color = None show(r) ``` ![](https://i.imgur.com/7zuURGb.png) ## Variant data First, upload list of called variants from GitHub: ```python= var = pd.read_csv("https://github.com/galaxyproject/SARS-CoV-2/raw/master/data/var/bos_by_var.tsv.gz",sep='\t') var = var[['POS','countunique(Sample)']] var = var.rename(columns={'countunique(Sample)':'samples'}) ``` | | POS | samples | |---:|------:|----------:| | 0 | 13 | 3 | | 1 | 19 | 1 | | 2 | 20 | 1 | | 3 | 21 | 1 | | 4 | 36 | 1 | Let's look at it: ```python= q = figure( plot_height=100, plot_width=1200, y_axis_label='# Samples', ) q.circle(x='POS',y='samples',source=var) show(q) ``` ![](https://i.imgur.com/grdVLPN.png) ## Putting everything together ```python= # Coverage figure p = figure( plot_height=400, plot_width = 1200, # Log scale for y-axis y_axis_type='log', y_axis_label='Coverage (lines)', x_range=Range1d(start=18000, end=23000,bounds=(0, 30000)) ) p.vline_stack(['amin','std','amax'], x='position', source=stat, line_width=3, color=colors[0:3], legend_label=['min','std','max']) p.legend.title = 'Read coverage' # Varinat figure q = figure( plot_height=100, plot_width=1200, y_axis_label='# Samples', ) q.circle(x='POS',y='samples',source=var) # Annotation figure r = figure( plot_height=200, plot_width = 1200, y_axis_label='Genes', x_axis_label='Position in genome', y_range=Range1d(start=-1, end=2,bounds=(-1, 2)), ) gene_coord = ColumnDataSource(annot) labels = LabelSet(x='start', y=1.5, text='func', level='glyph', x_offset=5, y_offset=-5, source=gene_coord, render_mode='canvas') genes = Quad(left="start", bottom='bottom', right='end', top='top',line_color='black',fill_color='color') r.add_glyph(gene_coord, genes) r.add_layout(labels) r.yaxis.visible = False r.ygrid.grid_line_color = None show(column(p,q,r)) ``` This is *almost* what we want, but not quite: ![](https://i.imgur.com/Ddg6udE.png) First, we need to sync up axes. This is done by linking plots `q` and `r` to the plot `p` by adding `x_range=p.x_range` to definition of each figure: ```python= # Coverage figure p = figure( plot_height=400, plot_width = 1200, # Log scale for y-axis y_axis_type='log', y_axis_label='Coverage (lines)', x_range=Range1d(start=18000, end=23000,bounds=(0, 30000)) ) p.vline_stack(['amin','std','amax'], x='position', source=stat, line_width=3, color=colors[0:3], legend_label=['min','std','max']) p.legend.title = 'Read coverage' # Varinat figure q = figure( plot_height=100, plot_width=1200, y_axis_label='# Samples', x_range=p.x_range ) q.circle(x='POS',y='samples',source=var) # Annotation figure r = figure( plot_height=200, plot_width = 1200, y_axis_label='Genes', x_axis_label='Position in genome', y_range=Range1d(start=-1, end=2,bounds=(-1, 2)), x_range=p.x_range ) gene_coord = ColumnDataSource(annot) labels = LabelSet(x='start', y=1.5, text='func', level='glyph', x_offset=5, y_offset=-5, source=gene_coord, render_mode='canvas') genes = Quad(left="start", bottom='bottom', right='end', top='top',line_color='black',fill_color='color') r.add_glyph(gene_coord, genes) r.add_layout(labels) r.yaxis.visible = False r.ygrid.grid_line_color = None show(column(p,q,r)) ``` And now we have a functional "genome browser" for the data: ![](https://i.imgur.com/oFpFMXL.png) ## Try yourself To begin: 1. Create a new [colab](https://colab.research.google.com) notebook. 2. Paste the following into the first cell: ```python= import numpy as np import pandas as pd # Our main plotting package (must have explicit import of submodules) import bokeh.io import bokeh.plotting from bokeh.models import ColumnDataSource,Quad,Range1d,LabelSet from bokeh.plotting import figure, show from bokeh.palettes import Set1_7 from bokeh.layouts import column colors = Set1_7 # Enable viewing Bokeh plots in the not bokeh.io.output_notebook() ``` 3. Go over all steps in this post.