owned this note
owned this note
Published
Linked with GitHub
---
tags: BMMB554-23
---
```vega
{
"config": {"view": {"continuousWidth": 400, "continuousHeight": 300}},
"layer": [
{
"mark": {"type": "rect", "opacity": 1},
"encoding": {
"color": {
"field": "run_accession",
"scale": {"scheme": "goldred", "type": "log"},
"title": "# Samples",
"type": "quantitative"
},
"tooltip": [
{
"field": "instrument_platform",
"title": "Machine",
"type": "nominal"
},
{
"field": "run_accession",
"title": "Number of runs",
"type": "quantitative"
},
{"field": "library_strategy", "title": "Protocol", "type": "nominal"}
],
"x": {
"field": "instrument_platform",
"title": "Instrument",
"type": "nominal"
},
"y": {
"axis": {"orient": "right"},
"field": "library_strategy",
"title": "Strategy",
"type": "nominal"
}
},
"height": 150,
"title": {
"text": [
"Breakdown of 6,164,922 datasets (unique accessions) from ENA",
"by Platform and Library Strategy"
],
"subtitle": "(Updated February 14, 2023)"
},
"width": 500
},
{
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"fontSize": 12,
"fontWeight": "bold"
},
"encoding": {
"color": {
"condition": {
"value": "white",
"test": "(datum.run_accession > 200)"
},
"value": "black"
},
"text": {
"field": "run_accession",
"format": ",.0f",
"type": "quantitative"
},
"tooltip": [
{
"field": "instrument_platform",
"title": "Machine",
"type": "nominal"
},
{
"field": "run_accession",
"title": "Number of runs",
"type": "quantitative"
},
{"field": "library_strategy", "title": "Protocol", "type": "nominal"}
],
"x": {
"field": "instrument_platform",
"title": "Instrument",
"type": "nominal"
},
"y": {
"axis": {"orient": "right"},
"field": "library_strategy",
"title": "Strategy",
"type": "nominal"
}
},
"height": 150,
"title": {
"text": [
"Breakdown of 6,164,922 datasets (unique accessions) from ENA",
"by Platform and Library Strategy"
],
"subtitle": "(Updated February 14, 2023)"
},
"width": 500
}
],
"data": {"name": "data-9da197cec4ddf3332c0c1c66ca8c87cd"},
"$schema": "https://vega.github.io/schema/vega-lite/v4.17.0.json",
"datasets": {
"data-9da197cec4ddf3332c0c1c66ca8c87cd": [
{
"instrument_platform": "BGISEQ",
"library_strategy": "AMPLICON",
"run_accession": 21
},
{
"instrument_platform": "BGISEQ",
"library_strategy": "OTHER",
"run_accession": 1067
},
{
"instrument_platform": "BGISEQ",
"library_strategy": "RNA-Seq",
"run_accession": 64
},
{
"instrument_platform": "BGISEQ",
"library_strategy": "Targeted-Capture",
"run_accession": 38
},
{
"instrument_platform": "BGISEQ",
"library_strategy": "WGA",
"run_accession": 1
},
{
"instrument_platform": "CAPILLARY",
"library_strategy": "AMPLICON",
"run_accession": 3
},
{
"instrument_platform": "DNBSEQ",
"library_strategy": "AMPLICON",
"run_accession": 325
},
{
"instrument_platform": "DNBSEQ",
"library_strategy": "OTHER",
"run_accession": 5
},
{
"instrument_platform": "DNBSEQ",
"library_strategy": "RNA-Seq",
"run_accession": 64
},
{
"instrument_platform": "ILLUMINA",
"library_strategy": "AMPLICON",
"run_accession": 4833110
},
{
"instrument_platform": "ILLUMINA",
"library_strategy": "OTHER",
"run_accession": 204
},
{
"instrument_platform": "ILLUMINA",
"library_strategy": "RNA-Seq",
"run_accession": 34826
},
{
"instrument_platform": "ILLUMINA",
"library_strategy": "Targeted-Capture",
"run_accession": 14843
},
{
"instrument_platform": "ILLUMINA",
"library_strategy": "WCS",
"run_accession": 74
},
{
"instrument_platform": "ILLUMINA",
"library_strategy": "WGA",
"run_accession": 89499
},
{
"instrument_platform": "ILLUMINA",
"library_strategy": "WGS",
"run_accession": 59618
},
{
"instrument_platform": "ILLUMINA",
"library_strategy": "WXS",
"run_accession": 2
},
{
"instrument_platform": "ILLUMINA",
"library_strategy": "miRNA-Seq",
"run_accession": 28
},
{
"instrument_platform": "ION_TORRENT",
"library_strategy": "AMPLICON",
"run_accession": 108617
},
{
"instrument_platform": "ION_TORRENT",
"library_strategy": "RNA-Seq",
"run_accession": 75
},
{
"instrument_platform": "ION_TORRENT",
"library_strategy": "WGA",
"run_accession": 724
},
{
"instrument_platform": "ION_TORRENT",
"library_strategy": "WGS",
"run_accession": 1385
},
{
"instrument_platform": "OXFORD_NANOPORE",
"library_strategy": "AMPLICON",
"run_accession": 445491
},
{
"instrument_platform": "OXFORD_NANOPORE",
"library_strategy": "OTHER",
"run_accession": 31
},
{
"instrument_platform": "OXFORD_NANOPORE",
"library_strategy": "RNA-Seq",
"run_accession": 18617
},
{
"instrument_platform": "OXFORD_NANOPORE",
"library_strategy": "WGA",
"run_accession": 7842
},
{
"instrument_platform": "OXFORD_NANOPORE",
"library_strategy": "WGS",
"run_accession": 13670
},
{
"instrument_platform": "PACBIO_SMRT",
"library_strategy": "AMPLICON",
"run_accession": 533112
},
{
"instrument_platform": "PACBIO_SMRT",
"library_strategy": "FL-cDNA",
"run_accession": 11
},
{
"instrument_platform": "PACBIO_SMRT",
"library_strategy": "RNA-Seq",
"run_accession": 1013
},
{
"instrument_platform": "PACBIO_SMRT",
"library_strategy": "Synthetic-Long-Read",
"run_accession": 29
},
{
"instrument_platform": "PACBIO_SMRT",
"library_strategy": "Targeted-Capture",
"run_accession": 465
},
{
"instrument_platform": "PACBIO_SMRT",
"library_strategy": "WGS",
"run_accession": 48
}
]
}
}
```
# Lecture 12: Making graphics programmatically
>A very short introduction
## Understanding [Altair](https://altair-viz.github.io/)
Before we will try to reproduce the heatmap shown above, let's review your [reading](https://www.practicaldatascience.org/html/plotting_altair_part1.html) for today's class. Open two links listed below in two browser tabs and preferably position them side-by-side:
- [Reading](https://www.practicaldatascience.org/html/plotting_altair_part1.html) - the chapter you were supposed to read (:smirk:)
- [Examples](https://colab.research.google.com/github/nekrut/BMMB554/blob/master/2023/ipynb/L12_altair_review.ipynb) - a notebook containing ready-to-run examples from the reading above
## Creating a heatmap
Open [this notebook](https://colab.research.google.com/github/nekrut/BMMB554/blob/master/2023/ipynb/L12_heatmap.ipynb) and this page side by side.
:::info
Outputs of your notebook may have different counts compared to this page.
:::
### Importing data
First, import all the packages we need:
```py=
import pandas as pd
import altair as alt
from datetime import date
today = date.today()
```
Next, read a gigantic dataset from DropBox:
```py=
sra = pd.read_csv(
"https://www.dropbox.com/s/sy2c1ds5gbi16x2/ena.tsv.gz?dl=1",
compression='gzip',
sep="\t",
low_memory=False
)
```
This dataset contains *a lot* of rows:
```py=
len(sra)
```
Which would output:
```
6187417
```
Let's look at the five random lines from this table (scroll sideways):
```py=
sra.sample(5)
```
| | study_accession | run_accession | collection_date | instrument_platform | library_strategy | library_construction_protocol |
|--------:|:------------------|:----------------|:--------------------|:----------------------|:-------------------|:--------------------------------|
| 2470282 | PRJEB37886 | ERR8618165 | 2022-02-03 00:00:00 | ILLUMINA | AMPLICON | nan |
| 5747352 | PRJNA686984 | SRR21294506 | 2022-07-10 00:00:00 | OXFORD_NANOPORE | AMPLICON | nan |
| 2588443 | PRJEB37886 | ERR8997118 | 2022-02-15 00:00:00 | ILLUMINA | AMPLICON | nan |
| 3662323 | PRJNA716984 | SRR16033134 | 2021-09-08 00:00:00 | PACBIO_SMRT | AMPLICON | Freed primers |
| 5298351 | PRJNA716984 | SRR20455932 | 2022-04-10 00:00:00 | PACBIO_SMRT | AMPLICON | Freed primers |
This dataset also has *a lot* of columns:
```py=
for _ in sra.columns: print(_)
```
We do not need all the columns, so let's restrict the dataset only to columns we would need. This would also make it much smaller:
```py=
sra = sra[
[
'study_accession',
'run_accession',
'collection_date',
'instrument_platform',
'library_strategy',
'library_construction_protocol'
]
]
```
### Cleaning the data
The `collection_date` field will be useful for us to be able to filter out nonsense as you will see below. But to use it properly, we need tell Pandas that it is not just a text, but actually dates:
```py=
sra = sra.assign(collection_date = pd.to_datetime(sra["collection_date"]))
```
Let's see what are the earliest and latest entries:
```py=
print('Earliest entry:', sra['collection_date'].min())
print('Latest entry:', sra['collection_date'].max())
```
:::danger
:scream: | Don't get surprised here - the metadata is only as good as the person who entered it. So, **when you enter metadata for you sequencing data -- pay attention!!!**
:::
:::success
Can you write a statement that would show how many rows contain these erroneous dates?
:::spoiler
```py=
sra[sra['collection_date'] == sra['collection_date'].min()]['run_accession'].nunique()
sra[sra['collection_date'] == sra['collection_date'].max()]['run_accession'].nunique()
```
:::
:::
It is likely that the data will need a bit of cleaning:
```py=
sra = sra[
( sra['collection_date'] >= '2020-01-01' )
&
( sra['collection_date'] <= '2023-02-16' )
]
```
Finally, in order to buold the heatmap, we need to aggregate the data:
```py=
heatmap_2d = sra.groupby(
['instrument_platform','library_strategy']
).agg(
{'run_accession':'nunique'}
).reset_index()
```
This will look something like this:
| | instrument_platform | library_strategy | run_accession |
|---:|:----------------------|:--------------------|----------------:|
| 0 | BGISEQ | AMPLICON | 21 |
| 1 | BGISEQ | OTHER | 1067 |
| 2 | BGISEQ | RNA-Seq | 64 |
| 3 | BGISEQ | Targeted-Capture | 38 |
| 4 | BGISEQ | WGA | 1 |
| 5 | CAPILLARY | AMPLICON | 3 |
| 6 | DNBSEQ | AMPLICON | 325 |
| 7 | DNBSEQ | OTHER | 5 |
| 8 | DNBSEQ | RNA-Seq | 64 |
| 9 | ILLUMINA | AMPLICON | 4833110 |
| 10 | ILLUMINA | OTHER | 204 |
| 11 | ILLUMINA | RNA-Seq | 34826 |
| 12 | ILLUMINA | Targeted-Capture | 14843 |
| 13 | ILLUMINA | WCS | 74 |
| 14 | ILLUMINA | WGA | 89499 |
| 15 | ILLUMINA | WGS | 59618 |
| 16 | ILLUMINA | WXS | 2 |
| 17 | ILLUMINA | miRNA-Seq | 28 |
| 18 | ION_TORRENT | AMPLICON | 108617 |
| 19 | ION_TORRENT | RNA-Seq | 75 |
| 20 | ION_TORRENT | WGA | 724 |
| 21 | ION_TORRENT | WGS | 1385 |
| 22 | OXFORD_NANOPORE | AMPLICON | 445491 |
| 23 | OXFORD_NANOPORE | OTHER | 31 |
| 24 | OXFORD_NANOPORE | RNA-Seq | 18617 |
| 25 | OXFORD_NANOPORE | WGA | 7842 |
| 26 | OXFORD_NANOPORE | WGS | 13670 |
| 27 | PACBIO_SMRT | AMPLICON | 533112 |
| 28 | PACBIO_SMRT | FL-cDNA | 11 |
| 29 | PACBIO_SMRT | RNA-Seq | 1013 |
| 30 | PACBIO_SMRT | Synthetic-Long-Read | 29 |
| 31 | PACBIO_SMRT | Targeted-Capture | 465 |
| 32 | PACBIO_SMRT | WGS | 48 |
### Plotting the data
Now let's create a graph. This graph will be layered: the "back" will be the heatmap squares and the "front" will be the numbers (see heatmap at the beginning of this page):
```back = alt.Chart(heatmap_2d).mark_rect(opacity=1).encode(
x=alt.X(
"instrument_platform:N",
title="Instrument"
),
y=alt.Y(
"library_strategy:N",
title="Strategy",
axis=alt.Axis(orient='right')
),
color=alt.Color(
"run_accession:Q",
title="# Samples",
scale=alt.Scale(
scheme="goldred",
type="log"
),
),
tooltip=[
alt.Tooltip(
"instrument_platform:N",
title="Machine"
),
alt.Tooltip(
"run_accession:Q",
title="Number of runs"
),
alt.Tooltip(
"library_strategy:N",
title="Protocol"
)
]
).properties(
width=500,
height=150,
title={
"text":
["Breakdown of datasets (unique accessions) from ENA",
"by Platform and Library Strategy"],
"subtitle":"(Updated {})".format(today.strftime("%B %d, %Y"))
}
)
back
```
This would give us a grid:

Now, it would be nice to fill the rectangles with actual numbers:
```py=
front = bottom.mark_text(
align="center",
baseline="middle",
fontSize=12,
fontWeight="bold",
).encode(
text=alt.Text("run_accession:Q",format=",.0f"),
color=alt.condition(
alt.datum.run_accession > 200,
alt.value("white"),
alt.value("black")
)
)
front
```
This would give us the text:

To superimpose these on top of each other we should simple do this:
```py=
back + front
```
