Lecture 12: Making graphics programmatically

--- tags: BMMB554-23 --- ```vega { "config": {"view": {"continuousWidth": 400, "continuousHeight": 300}}, "layer": [ { "mark": {"type": "rect", "opacity": 1}, "encoding": { "color": { "field": "run_accession", "scale": {"scheme": "goldred", "type": "log"}, "title": "# Samples", "type": "quantitative" }, "tooltip": [ { "field": "instrument_platform", "title": "Machine", "type": "nominal" }, { "field": "run_accession", "title": "Number of runs", "type": "quantitative" }, {"field": "library_strategy", "title": "Protocol", "type": "nominal"} ], "x": { "field": "instrument_platform", "title": "Instrument", "type": "nominal" }, "y": { "axis": {"orient": "right"}, "field": "library_strategy", "title": "Strategy", "type": "nominal" } }, "height": 150, "title": { "text": [ "Breakdown of 6,164,922 datasets (unique accessions) from ENA", "by Platform and Library Strategy" ], "subtitle": "(Updated February 14, 2023)" }, "width": 500 }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "fontSize": 12, "fontWeight": "bold" }, "encoding": { "color": { "condition": { "value": "white", "test": "(datum.run_accession > 200)" }, "value": "black" }, "text": { "field": "run_accession", "format": ",.0f", "type": "quantitative" }, "tooltip": [ { "field": "instrument_platform", "title": "Machine", "type": "nominal" }, { "field": "run_accession", "title": "Number of runs", "type": "quantitative" }, {"field": "library_strategy", "title": "Protocol", "type": "nominal"} ], "x": { "field": "instrument_platform", "title": "Instrument", "type": "nominal" }, "y": { "axis": {"orient": "right"}, "field": "library_strategy", "title": "Strategy", "type": "nominal" } }, "height": 150, "title": { "text": [ "Breakdown of 6,164,922 datasets (unique accessions) from ENA", "by Platform and Library Strategy" ], "subtitle": "(Updated February 14, 2023)" }, "width": 500 } ], "data": {"name": "data-9da197cec4ddf3332c0c1c66ca8c87cd"}, "$schema": "https://vega.github.io/schema/vega-lite/v4.17.0.json", "datasets": { "data-9da197cec4ddf3332c0c1c66ca8c87cd": [ { "instrument_platform": "BGISEQ", "library_strategy": "AMPLICON", "run_accession": 21 }, { "instrument_platform": "BGISEQ", "library_strategy": "OTHER", "run_accession": 1067 }, { "instrument_platform": "BGISEQ", "library_strategy": "RNA-Seq", "run_accession": 64 }, { "instrument_platform": "BGISEQ", "library_strategy": "Targeted-Capture", "run_accession": 38 }, { "instrument_platform": "BGISEQ", "library_strategy": "WGA", "run_accession": 1 }, { "instrument_platform": "CAPILLARY", "library_strategy": "AMPLICON", "run_accession": 3 }, { "instrument_platform": "DNBSEQ", "library_strategy": "AMPLICON", "run_accession": 325 }, { "instrument_platform": "DNBSEQ", "library_strategy": "OTHER", "run_accession": 5 }, { "instrument_platform": "DNBSEQ", "library_strategy": "RNA-Seq", "run_accession": 64 }, { "instrument_platform": "ILLUMINA", "library_strategy": "AMPLICON", "run_accession": 4833110 }, { "instrument_platform": "ILLUMINA", "library_strategy": "OTHER", "run_accession": 204 }, { "instrument_platform": "ILLUMINA", "library_strategy": "RNA-Seq", "run_accession": 34826 }, { "instrument_platform": "ILLUMINA", "library_strategy": "Targeted-Capture", "run_accession": 14843 }, { "instrument_platform": "ILLUMINA", "library_strategy": "WCS", "run_accession": 74 }, { "instrument_platform": "ILLUMINA", "library_strategy": "WGA", "run_accession": 89499 }, { "instrument_platform": "ILLUMINA", "library_strategy": "WGS", "run_accession": 59618 }, { "instrument_platform": "ILLUMINA", "library_strategy": "WXS", "run_accession": 2 }, { "instrument_platform": "ILLUMINA", "library_strategy": "miRNA-Seq", "run_accession": 28 }, { "instrument_platform": "ION_TORRENT", "library_strategy": "AMPLICON", "run_accession": 108617 }, { "instrument_platform": "ION_TORRENT", "library_strategy": "RNA-Seq", "run_accession": 75 }, { "instrument_platform": "ION_TORRENT", "library_strategy": "WGA", "run_accession": 724 }, { "instrument_platform": "ION_TORRENT", "library_strategy": "WGS", "run_accession": 1385 }, { "instrument_platform": "OXFORD_NANOPORE", "library_strategy": "AMPLICON", "run_accession": 445491 }, { "instrument_platform": "OXFORD_NANOPORE", "library_strategy": "OTHER", "run_accession": 31 }, { "instrument_platform": "OXFORD_NANOPORE", "library_strategy": "RNA-Seq", "run_accession": 18617 }, { "instrument_platform": "OXFORD_NANOPORE", "library_strategy": "WGA", "run_accession": 7842 }, { "instrument_platform": "OXFORD_NANOPORE", "library_strategy": "WGS", "run_accession": 13670 }, { "instrument_platform": "PACBIO_SMRT", "library_strategy": "AMPLICON", "run_accession": 533112 }, { "instrument_platform": "PACBIO_SMRT", "library_strategy": "FL-cDNA", "run_accession": 11 }, { "instrument_platform": "PACBIO_SMRT", "library_strategy": "RNA-Seq", "run_accession": 1013 }, { "instrument_platform": "PACBIO_SMRT", "library_strategy": "Synthetic-Long-Read", "run_accession": 29 }, { "instrument_platform": "PACBIO_SMRT", "library_strategy": "Targeted-Capture", "run_accession": 465 }, { "instrument_platform": "PACBIO_SMRT", "library_strategy": "WGS", "run_accession": 48 } ] } } ``` # Lecture 12: Making graphics programmatically >A very short introduction ## Understanding [Altair](https://altair-viz.github.io/) Before we will try to reproduce the heatmap shown above, let's review your [reading](https://www.practicaldatascience.org/html/plotting_altair_part1.html) for today's class. Open two links listed below in two browser tabs and preferably position them side-by-side: - [Reading](https://www.practicaldatascience.org/html/plotting_altair_part1.html) - the chapter you were supposed to read (:smirk:) - [Examples](https://colab.research.google.com/github/nekrut/BMMB554/blob/master/2023/ipynb/L12_altair_review.ipynb) - a notebook containing ready-to-run examples from the reading above ## Creating a heatmap Open [this notebook](https://colab.research.google.com/github/nekrut/BMMB554/blob/master/2023/ipynb/L12_heatmap.ipynb) and this page side by side. :::info Outputs of your notebook may have different counts compared to this page. ::: ### Importing data First, import all the packages we need: ```py= import pandas as pd import altair as alt from datetime import date today = date.today() ``` Next, read a gigantic dataset from DropBox: ```py= sra = pd.read_csv( "https://www.dropbox.com/s/sy2c1ds5gbi16x2/ena.tsv.gz?dl=1", compression='gzip', sep="\t", low_memory=False ) ``` This dataset contains *a lot* of rows: ```py= len(sra) ``` Which would output: ``` 6187417 ``` Let's look at the five random lines from this table (scroll sideways): ```py= sra.sample(5) ``` | | study_accession | run_accession | collection_date | instrument_platform | library_strategy | library_construction_protocol | |--------:|:------------------|:----------------|:--------------------|:----------------------|:-------------------|:--------------------------------| | 2470282 | PRJEB37886 | ERR8618165 | 2022-02-03 00:00:00 | ILLUMINA | AMPLICON | nan | | 5747352 | PRJNA686984 | SRR21294506 | 2022-07-10 00:00:00 | OXFORD_NANOPORE | AMPLICON | nan | | 2588443 | PRJEB37886 | ERR8997118 | 2022-02-15 00:00:00 | ILLUMINA | AMPLICON | nan | | 3662323 | PRJNA716984 | SRR16033134 | 2021-09-08 00:00:00 | PACBIO_SMRT | AMPLICON | Freed primers | | 5298351 | PRJNA716984 | SRR20455932 | 2022-04-10 00:00:00 | PACBIO_SMRT | AMPLICON | Freed primers | This dataset also has *a lot* of columns: ```py= for _ in sra.columns: print(_) ``` We do not need all the columns, so let's restrict the dataset only to columns we would need. This would also make it much smaller: ```py= sra = sra[ [ 'study_accession', 'run_accession', 'collection_date', 'instrument_platform', 'library_strategy', 'library_construction_protocol' ] ] ``` ### Cleaning the data The `collection_date` field will be useful for us to be able to filter out nonsense as you will see below. But to use it properly, we need tell Pandas that it is not just a text, but actually dates: ```py= sra = sra.assign(collection_date = pd.to_datetime(sra["collection_date"])) ``` Let's see what are the earliest and latest entries: ```py= print('Earliest entry:', sra['collection_date'].min()) print('Latest entry:', sra['collection_date'].max()) ``` :::danger :scream: | Don't get surprised here - the metadata is only as good as the person who entered it. So, **when you enter metadata for you sequencing data -- pay attention!!!** ::: :::success Can you write a statement that would show how many rows contain these erroneous dates? :::spoiler ```py= sra[sra['collection_date'] == sra['collection_date'].min()]['run_accession'].nunique() sra[sra['collection_date'] == sra['collection_date'].max()]['run_accession'].nunique() ``` ::: ::: It is likely that the data will need a bit of cleaning: ```py= sra = sra[ ( sra['collection_date'] >= '2020-01-01' ) & ( sra['collection_date'] <= '2023-02-16' ) ] ``` Finally, in order to buold the heatmap, we need to aggregate the data: ```py= heatmap_2d = sra.groupby( ['instrument_platform','library_strategy'] ).agg( {'run_accession':'nunique'} ).reset_index() ``` This will look something like this: | | instrument_platform | library_strategy | run_accession | |---:|:----------------------|:--------------------|----------------:| | 0 | BGISEQ | AMPLICON | 21 | | 1 | BGISEQ | OTHER | 1067 | | 2 | BGISEQ | RNA-Seq | 64 | | 3 | BGISEQ | Targeted-Capture | 38 | | 4 | BGISEQ | WGA | 1 | | 5 | CAPILLARY | AMPLICON | 3 | | 6 | DNBSEQ | AMPLICON | 325 | | 7 | DNBSEQ | OTHER | 5 | | 8 | DNBSEQ | RNA-Seq | 64 | | 9 | ILLUMINA | AMPLICON | 4833110 | | 10 | ILLUMINA | OTHER | 204 | | 11 | ILLUMINA | RNA-Seq | 34826 | | 12 | ILLUMINA | Targeted-Capture | 14843 | | 13 | ILLUMINA | WCS | 74 | | 14 | ILLUMINA | WGA | 89499 | | 15 | ILLUMINA | WGS | 59618 | | 16 | ILLUMINA | WXS | 2 | | 17 | ILLUMINA | miRNA-Seq | 28 | | 18 | ION_TORRENT | AMPLICON | 108617 | | 19 | ION_TORRENT | RNA-Seq | 75 | | 20 | ION_TORRENT | WGA | 724 | | 21 | ION_TORRENT | WGS | 1385 | | 22 | OXFORD_NANOPORE | AMPLICON | 445491 | | 23 | OXFORD_NANOPORE | OTHER | 31 | | 24 | OXFORD_NANOPORE | RNA-Seq | 18617 | | 25 | OXFORD_NANOPORE | WGA | 7842 | | 26 | OXFORD_NANOPORE | WGS | 13670 | | 27 | PACBIO_SMRT | AMPLICON | 533112 | | 28 | PACBIO_SMRT | FL-cDNA | 11 | | 29 | PACBIO_SMRT | RNA-Seq | 1013 | | 30 | PACBIO_SMRT | Synthetic-Long-Read | 29 | | 31 | PACBIO_SMRT | Targeted-Capture | 465 | | 32 | PACBIO_SMRT | WGS | 48 | ### Plotting the data Now let's create a graph. This graph will be layered: the "back" will be the heatmap squares and the "front" will be the numbers (see heatmap at the beginning of this page): ```back = alt.Chart(heatmap_2d).mark_rect(opacity=1).encode( x=alt.X( "instrument_platform:N", title="Instrument" ), y=alt.Y( "library_strategy:N", title="Strategy", axis=alt.Axis(orient='right') ), color=alt.Color( "run_accession:Q", title="# Samples", scale=alt.Scale( scheme="goldred", type="log" ), ), tooltip=[ alt.Tooltip( "instrument_platform:N", title="Machine" ), alt.Tooltip( "run_accession:Q", title="Number of runs" ), alt.Tooltip( "library_strategy:N", title="Protocol" ) ] ).properties( width=500, height=150, title={ "text": ["Breakdown of datasets (unique accessions) from ENA", "by Platform and Library Strategy"], "subtitle":"(Updated {})".format(today.strftime("%B %d, %Y")) } ) back ``` This would give us a grid: ![](https://i.imgur.com/74itE5M.png) Now, it would be nice to fill the rectangles with actual numbers: ```py= front = bottom.mark_text( align="center", baseline="middle", fontSize=12, fontWeight="bold", ).encode( text=alt.Text("run_accession:Q",format=",.0f"), color=alt.condition( alt.datum.run_accession > 200, alt.value("white"), alt.value("black") ) ) front ``` This would give us the text: ![](https://i.imgur.com/pKalDaC.png) To superimpose these on top of each other we should simple do this: ```py= back + front ``` ![](https://i.imgur.com/SzhM1QW.png)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.