---
tags: BMMB554, Block1
---
# 1.6. Graphics with Pandas and Python II
[](https://hackmd.io/DQDeRE6sSku3yfaqCY-xcw)
> Building a mini genome browser
# Warm up
Before diving in go through the following notebook and complete as many exercises as possible:
Open [this notebook](https://colab.research.google.com/github/nekrut/BMMB554/blob/master/2021/ipynb/Plotting2_try.ipynb) and complete all exercises. Type your code in cells containing exercise comment:
```
# EXERCISE
```
# Building a custoM "browser"
When working with SARS-CoV-2 data we discovered an interesting problem. There is a region of the genome that is completely void of any variation in intrahost samples. It is highlighted by an orange box in the figure below:

What is that? One possible explanation is that this region is for some reason difficult to sequence. To check what is going on we will build a complex plot that would show different types on data in one unified figure. We will be plotting four types of data:
1. Coverage distribution computed from BAM files.
2. Location of variants identified in our study.
3. Coordinates of SARS-CoV-2 genes.
First, let's look at each dataset separately.
## Coverage data
Coverage data is computed by taking sequencing reads aligned to genome (usually in the form of [BAM](https://academic.oup.com/bioinformatics/article/25/16/2078/204688) files) by iterating over genomic coordinates and computing how many reads cover each position (usually done with [Samtools](http://www.htslib.org/), [BEDTools](https://bedtools.readthedocs.io/en/latest/), etc.).
:::success
:point_right: We will learn about BAM format, read mappers and other essential tools in the next block of this course.
:::
let's import coverage data computed using `samtools depth` tool from Galaxy:
```python=
cvrg = pd.read_csv('https://usegalaxy.org/datasets/bbd44e69cb8906b528225f578c3a071b/display?to_ext=tabular',header=None,sep='\t')
```
the data looks like this (here we are showing first 6 columns, but the actual dataset here contains over 600):
| | 0 | 1 | 2 | 3 | 4 | 5 |
|---:|:------------|------:|----:|-----:|-----:|-----:|
| 0 | NC_045512.2 | 15000 | 193 | 6396 | 6329 | 7719 |
| 1 | NC_045512.2 | 15001 | 181 | 6245 | 6243 | 7679 |
| 2 | NC_045512.2 | 15002 | 182 | 6200 | 6194 | 7640 |
| 3 | NC_045512.2 | 15003 | 181 | 6183 | 6165 | 7596 |
| 4 | NC_045512.2 | 15004 | 181 | 6187 | 6176 | 7601 |
this data is in wide format, which is not what we need. Let's reshape it:
```python=
# Melt
cvrg=cvrg.melt(id_vars=[1],value_vars=cvrg.columns[2:],var_name='samples',value_name='cvrg')
# Name position column properly
cvrg.rename({1:'position'},axis=1,inplace=True)
```
The data now looks like this:
| | position | samples | cvrg |
|---:|-----------:|----------:|-------:|
| 0 | 15000 | 2 | 193 |
| 1 | 15001 | 2 | 181 |
| 2 | 15002 | 2 | 182 |
| 3 | 15003 | 2 | 181 |
| 4 | 15004 | 2 | 181 |
This is a very long table:
```python
len(cvrg)
6390639
```
Now we will aggregate coverage data by computing `min`, `max`, `mean` at each position across all samples:
```python=
stat = cvrg.sort_values(by=['position']).groupby(['position'])['cvrg'].agg([np.amin, np.std, np.amax])
```
Now let's look at at:
```python=
p = figure(
plot_height=400,
plot_width = 1200,
y_axis_label='Coverage (lines)',
x_range=Range1d(start=18000, end=23000,bounds=(0, 30000))
)
p.vline_stack(['amin','std','amax'],
x='position',
source=stat,
line_width=3,
color=colors[0:3],
legend_label=['min','std','max'])
p.legend.title = 'Read coverage'
show(p)
```

because range of Y-values is so large we can use log scale by adding `y_axis_type='log':
```python=
p = figure(
plot_height=400,
plot_width = 1200,
# Log scale for y-axis
y_axis_type='log',
y_axis_label='Coverage (lines)',
x_range=Range1d(start=18000, end=23000,bounds=(0, 30000))
)
p.vline_stack(['amin','std','amax'],
x='position',
source=stat,
line_width=3,
color=colors[0:3],
legend_label=['min','std','max'])
p.legend.title = 'Read coverage'
show(p)
```

## Dealing with annotations
We can extract gene coordinates directly from [SARS-CoV-2 GenBank file](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512). However, this would require an introduction to [BioPython](https://biopython.org/), which we will have later in this course. For now, let's use this dict:
```python=
annot = {
'start': [0, 265, 805, 2719, 8554, 10054, 10972, 11842, 12091, 12685, 13024, 13441, 16236, 18039, 19620, 20658, 13441, 13475, 13487, 21562, 25392, 26244, 26522, 27201, 27393, 27755, 27893, 28273, 29557, 29608, 29628, 29674, 29727],
'end': [265, 805, 2719, 8554, 10054, 10972, 11842, 12091, 12685, 13024, 13441, 16236, 18039, 19620, 20658, 21552, 13480, 13503, 13542, 25384, 26220, 26472, 27191, 27387, 27759, 27887, 28259, 29533, 29674, 29644, 29657, 29903, 29768],
'func': ['.', 'leader', 'nsp2', 'nsp3', 'nsp4', '3Cpro', 'nsp6', 'nsp7', 'nsp8', 'nsp9', 'nsp10', 'RdRp', 'helicase', 'ExoN', 'endoR', 'MethTr', 'nsp11', 'FS SL1', 'FS SL2', 'S', 'orf3a', 'E', 'M', 'orf6', 'orf7a', 'orf7b', 'orf8', 'N', 'orf10', '3-UTR SL1', '3-UTR SL2', '.', 'S2M'],
}
```
First, we create a Pandas dataframe out of this dict:
```python=
annot = pd.DataFrame.from_dict(annot)[['start','end','func']].sort_values(by=['start']).reset_index()
```
generating this:
| | index | start | end | func |
|---:|--------:|--------:|------:|:-------|
| 0 | 0 | 0 | 265 | . |
| 1 | 1 | 265 | 805 | leader |
| 2 | 2 | 805 | 2719 | nsp2 |
| 3 | 3 | 2719 | 8554 | nsp3 |
| 4 | 4 | 8554 | 10054 | nsp4 |
For better visual representation we would want adjacent genes to be shifted relative to one another. So, let's add necessary info the DataFrame:
```python=
annot['top']= annot.index % 2
annot['bottom'] = annot['top']-1
annot.loc[annot['top'] == 0, 'color'] = colors[4]
annot.loc[annot['top'] != 0, 'color'] = colors[5]
```
resulting in this:
| | index | start | end | func | top | bottom | color |
|---:|--------:|--------:|------:|:-------|------:|---------:|:--------|
| 0 | 0 | 0 | 265 | . | 0 | -1 | #ff7f00 |
| 1 | 1 | 265 | 805 | leader | 1 | 0 | #ffff33 |
| 2 | 2 | 805 | 2719 | nsp2 | 0 | -1 | #ff7f00 |
| 3 | 3 | 2719 | 8554 | nsp3 | 1 | 0 | #ffff33 |
| 4 | 4 | 8554 | 10054 | nsp4 | 0 | -1 | #ff7f00 |
Here `top` and `bottom` will be used a bounds for glyphs representing genes and `color` will give color then in alternating fashion.
Now let's look at at:
```python=
r = figure(
plot_height=200,
plot_width = 1200,
y_axis_label='Genes',
x_axis_label='Position in genome',
y_range=Range1d(start=-1, end=2,bounds=(-1, 2)),
)
gene_coord = ColumnDataSource(annot)
labels = LabelSet(x='start', y=1.5, text='func', level='glyph',
x_offset=5, y_offset=-5, source=gene_coord, render_mode='canvas')
genes = Quad(left="start", bottom='bottom', right='end', top='top',line_color='black',fill_color='color')
r.add_glyph(gene_coord, genes)
r.add_layout(labels)
r.yaxis.visible = False
r.ygrid.grid_line_color = None
show(r)
```

## Variant data
First, upload list of called variants from GitHub:
```python=
var = pd.read_csv("https://github.com/galaxyproject/SARS-CoV-2/raw/master/data/var/bos_by_var.tsv.gz",sep='\t')
var = var[['POS','countunique(Sample)']]
var = var.rename(columns={'countunique(Sample)':'samples'})
```
| | POS | samples |
|---:|------:|----------:|
| 0 | 13 | 3 |
| 1 | 19 | 1 |
| 2 | 20 | 1 |
| 3 | 21 | 1 |
| 4 | 36 | 1 |
Let's look at it:
```python=
q = figure(
plot_height=100,
plot_width=1200,
y_axis_label='# Samples',
)
q.circle(x='POS',y='samples',source=var)
show(q)
```

## Putting everything together
```python=
# Coverage figure
p = figure(
plot_height=400,
plot_width = 1200,
# Log scale for y-axis
y_axis_type='log',
y_axis_label='Coverage (lines)',
x_range=Range1d(start=18000, end=23000,bounds=(0, 30000))
)
p.vline_stack(['amin','std','amax'],
x='position',
source=stat,
line_width=3,
color=colors[0:3],
legend_label=['min','std','max'])
p.legend.title = 'Read coverage'
# Varinat figure
q = figure(
plot_height=100,
plot_width=1200,
y_axis_label='# Samples',
)
q.circle(x='POS',y='samples',source=var)
# Annotation figure
r = figure(
plot_height=200,
plot_width = 1200,
y_axis_label='Genes',
x_axis_label='Position in genome',
y_range=Range1d(start=-1, end=2,bounds=(-1, 2)),
)
gene_coord = ColumnDataSource(annot)
labels = LabelSet(x='start', y=1.5, text='func', level='glyph',
x_offset=5, y_offset=-5, source=gene_coord, render_mode='canvas')
genes = Quad(left="start", bottom='bottom', right='end', top='top',line_color='black',fill_color='color')
r.add_glyph(gene_coord, genes)
r.add_layout(labels)
r.yaxis.visible = False
r.ygrid.grid_line_color = None
show(column(p,q,r))
```
This is *almost* what we want, but not quite:

First, we need to sync up axes. This is done by linking plots `q` and `r` to the plot `p` by adding `x_range=p.x_range` to definition of each figure:
```python=
# Coverage figure
p = figure(
plot_height=400,
plot_width = 1200,
# Log scale for y-axis
y_axis_type='log',
y_axis_label='Coverage (lines)',
x_range=Range1d(start=18000, end=23000,bounds=(0, 30000))
)
p.vline_stack(['amin','std','amax'],
x='position',
source=stat,
line_width=3,
color=colors[0:3],
legend_label=['min','std','max'])
p.legend.title = 'Read coverage'
# Varinat figure
q = figure(
plot_height=100,
plot_width=1200,
y_axis_label='# Samples',
x_range=p.x_range
)
q.circle(x='POS',y='samples',source=var)
# Annotation figure
r = figure(
plot_height=200,
plot_width = 1200,
y_axis_label='Genes',
x_axis_label='Position in genome',
y_range=Range1d(start=-1, end=2,bounds=(-1, 2)),
x_range=p.x_range
)
gene_coord = ColumnDataSource(annot)
labels = LabelSet(x='start', y=1.5, text='func', level='glyph',
x_offset=5, y_offset=-5, source=gene_coord, render_mode='canvas')
genes = Quad(left="start", bottom='bottom', right='end', top='top',line_color='black',fill_color='color')
r.add_glyph(gene_coord, genes)
r.add_layout(labels)
r.yaxis.visible = False
r.ygrid.grid_line_color = None
show(column(p,q,r))
```
And now we have a functional "genome browser" for the data:

## Try yourself
To begin:
1. Create a new [colab](https://colab.research.google.com) notebook.
2. Paste the following into the first cell:
```python=
import numpy as np
import pandas as pd
# Our main plotting package (must have explicit import of submodules)
import bokeh.io
import bokeh.plotting
from bokeh.models import ColumnDataSource,Quad,Range1d,LabelSet
from bokeh.plotting import figure, show
from bokeh.palettes import Set1_7
from bokeh.layouts import column
colors = Set1_7
# Enable viewing Bokeh plots in the not
bokeh.io.output_notebook()
```
3. Go over all steps in this post.