###### tags: `softserve` `article`
# On the visualization of longitudinal data
## Introduction
If you are working on eHealthecare project you know that there are lots of longitudinal data. It means that we have many measurements of many features at different times. E.g., in clinical trials, we could have a thousand patients and each of them has measurements of blood pressure (BP), biochemistry, pulse, etc., measured weekly over a few months.
Most of the time you may find it useful to visualize your data for gaining insights or for presentation. And there are only a handful common ways to do it:
+ individual line trends
+ boxplots
+ violinplots
+ Sankey plots
All those methods have their flaws. Individual trends could be a mess, especially if you have a lot of patients, boxplots could be misleading and lose to violinplots almost in everything (e.g., [5 reasons you should use a violin graph](https://blog.bioturing.com/2018/05/16/5-reasons-you-should-use-a-violin-graph/)), violin plots do not show transitions between the states and sankey plot could be hard to read.
During our work with a Fortune 500 list company, we had a task to improve the treatment protocol for a certain chronic condition. At some point we recognized that client's data had four patient clusters instead of two, as they'd thought previously. The initial (pretty common) idea was to divide patients into groups of responding and non-responding to treatment. However, if we take a closer look from the different timescale it becomes clear that they also have patients who responded to therapy at first but failed at the end, and vice versa, those who failed at first but responded at the end.
It was a very valuable finding, which helped us a lot and it looked like an insight with a great business value, but we had to present it to the client in a transparent and understandable way. We were aware that a poor presentation of work results often leads to misunderstanding, client dissatisfaction, and reputational losses.
No one will believe in the significance of your discovery if you can not convey your message in a way that is clear and understandable to vis-a-vis. Moreover, the decision makers usually don’t have the time and/or expertise to dive into the tangled DS Jupyter notebooks or series of complicated plots for answers.
Thus, we had to choose the right way to plot our findings. We solved this problem creatively, the client was satisfied and now we want to share our experience. Because of NDA, we are not able to share original data or plots, so I’ll provide explanations based on small synthetically generated dataset, which is completely irrelevant to the original task but shares similar properties.
## Use case data generation
Consider the case: you have a thousand patients whose BP was measured 6 times at certain time points. You decide to group your BP results in 5 groups (1..5) by range, where 1 could represent the 'very low', and 5 'very high'. And now, you want to visualize results.
Here I generate a completely artificial dataset for illustrative purposes, with the very strange behavior of BP levels.
It looks like this (columns are patients, rows are points in time, cell values are BP group that Patient's BP fell into at a given time point):
![](https://i.imgur.com/SetZcZh.png)
:::spoiler *Code for data generation (click to show)*
```python
import pandas as pd
import numpy as np
import random
random.seed(999)
patients = 1000
time_points = 6
groups = [1, 2, 3, 4, 5]
# state distribution splits from initial 3 to 1 and 5
tmp = []
for _ in range(time_points):
x = np.array(random.choices(groups, k=patients, weights=[0.01, 0.48, 0.02, 0.48 ,0.01]))
tmp.append(x)
df0 = pd.DataFrame(tmp)
```
:::
## Why Vyshyvanka plot?
Let's plot generated data using each of three common methods:
![](https://i.imgur.com/NgGCOyQ.png)
:::spoiler *Common methods plot code (click to show)*
```python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
def plot_common(df):
fig, axs = plt.subplots(3, 1, figsize=(12, 10), sharex=True)
fig.patch.set_facecolor('white')
fig.tight_layout(rect=[0, 0, 1, 0.96], h_pad=1.0, w_pad=0.94)
fig.suptitle(f'BP groups over time (n={df.shape[1]})', weight='bold')
# individual trends
for i in df:
axs[0].plot(df[i])
# boxplot and violin
ax = sns.boxplot(ax=axs[1], data=df.T, medianprops={'lw': 4})
ax = sns.violinplot(ax=axs[2], data=df.T, scale='count', inner='box')
for ax in axs:
_ = ax.set_yticks(np.unique(df0))
_ = ax.set_ylabel('BP group')
_ = axs[-1].set_xlabel('Time point')
plot_common(df0)
```
:::
<br />
We may see that individual trends are a mess, boxplots are not very informative and a violin plot shows us that patients are heavily distributed between 2nd and 4th groups and their distribution does not change significantly over time.
But this is simply not true, we created the dataset in a way where each patient has an equal chance to appear in the 2 or the 4 group at the next step. Will Sankey plot help us?
![](https://i.imgur.com/NrQbosf.png)
Well, a sort of (here `N_` numbers stands for BP group and `_M` numbers stands for certain time point, e.g. 3_2 is 3rd BP group at 2nd timepoint). We are able to see some transactions but the plot is too fiddly and obviously needs some tuning. You may find a separate chapter dedicated to Sankey plot problems on python below.
Meanwhile, compare it with a better way to visualize this:
![](https://i.imgur.com/HmP4tmf.png)
I'll provide the code, but for now, let me describe the plot. It shows BP group transition over time. The line width and markers opacity (black dots) correlates with the number of patients. Here we can see that a majority of patients started in the 2 and the 4 groups and these groups have a stable amount of patients during the whole period of observation. But it also shows that there were 2 types of transitions with relatively equal probabilities: to remain in the initial group or switch between 2nd and 4th, with a small chance to switch to groups 1, 3, or 5. Here we do not see patients' individual pathways, rather the transition probability trends.
And this could be a pretty valuable insight. Also, it is a convenient plot to present and explain.
We have invented this plot almost by accident under the pressure of practical needs during our work with the real customer. Our Lviv teammates have named it "Vyshyvanka plot" due to similarity of appearance with the [national Ukrainian costume](https://en.wikipedia.org/wiki/Vyshyvanka) pattern.
Now, as promised, I'll share the code and a few more examples. In the remaining post I'll be plotting violin and vyshyvanka plots only because they are obviously superior to individual line trends and boxplots. Also, to make it more general, we will discontinue the BP groups example and just talk about state transitions over time.
:::spoiler *Vyshyvanka code (click to show)*
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import itertools
def plot_vyshyvanka(df: pd.DataFrame, ax: plt.Axes,
coef_width=50, line_opacity=0.6, marker_opacity=0.1) -> (plt.Axes, int, np.ndarray):
"""
It will build the flow plot "Vyshyvanka": the transitions between possible states of
observations on the Y-axis over some time periods on the X-axis.
Line width and marker opacity correlate with the number of observations at a current point.
:param df: dataframe where columns are observations and rows are time points
:param ax: an instance of ``matplotlib.axes.Axes`` to plot on
:param coef_width: the bigger it, the thicker will be the lines
:param line_opacity: lines opacity
:param marker opacity: markers opacity
:return: (``matplotlib.axes.Axes``, number_of_observations_without_na, possible_states)
**Example:**
.. code-block:: python
import random
import pandas as pd
import matplotlib.pyplot as plt
obs_num = 1000
time_points = 8
states = [1, 2, 3, 4, 5]
tmp = []
for _ in range(time_points):
tmp.append(random.choices(states, k=obs_num,
weights=[0.05, 0.2 ,0.5, 0.2 ,0.05]))
df = pd.DataFrame(tmp)
fig, ax = plt.subplots(figsize=(15, 7))
fig.patch.set_facecolor('white')
ax, num, relevant_states = plot_vyshyvanka(df, ax)
_ = ax.set_yticks(relevant_states)
_ = ax.set_xlabel('Time points')
_ = ax.set_ylabel('States')
_ = ax.set_title(f'State transitions over time (n={num})')
"""
# drop observations with NA an count what left
df_plot = df.dropna(axis=1)
obs_num = len(df_plot.columns)
# iterate over time points using sliding window with length=2, stride=1
for row1, row2 in zip(itertools.islice(df_plot.iterrows(), 0, None, 1),
itertools.islice(df_plot.iterrows(), 1, None, 1)):
# get row idxs (from df.iterrows() output)
x1 = row1[0]
x2 = row2[0]
# get row values
tmp = pd.concat([row1[1], row2[1]], axis=1).to_numpy()
# count unique pairs was-now states
pairs, nums = np.unique(tmp, axis=0, return_counts=True)
# get all possible states
states = np.unique(pairs.flatten())
# calculate line widths for each pair (state transition)
# line width correlates with observations amount,
# where ``obs_num * coef_width`` is the maximum width
widths = nums / obs_num * coef_width
# add axis to widths in order to merge it with pairs for comfy iteration
for row in np.hstack([pairs, np.expand_dims(widths, axis=1)]):
y1, y2, w = row
# plot line
ax.plot([x1, x2], [y1, y2], linewidth=w, color='red', alpha=line_opacity, solid_capstyle='round')
# plot markers
# marker opacity correlates with observations amount
ax.plot(x1, y1, 'o', markersize=2, color='black', alpha=marker_opacity)
ax.plot(x2, y2, 'o', markersize=2, color='black', alpha=marker_opacity)
return ax, obs_num, states
```
:::
## Python problems with Sankey plot
There are not so many python plotting libs ([link1](https://mode.com/blog/python-data-visualization-libraries/), [link2](https://mode.com/blog/python-interactive-plot-libraries/)) and only a few that could do Sankey plots -- [matplotlib](https://matplotlib.org/api/sankey_api.html), [holoviews](https://holoviews.org/reference/elements/bokeh/Sankey.html) and [plotly](https://plotly.com/python/sankey-diagram/) (also there is a jupyter widget [ipysankeywidget](https://github.com/ricklupton/ipysankeywidget)). But only one lib could create beautiful enough Sankey plots -- plotly (through JS frontend).
:::spoiler *Here is the code for example Sankey plot (click to show)*
```python
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import itertools
def prepare_data_for_sankey(df: pd.DataFrame) -> (np.ndarray, np.ndarray, np.ndarray, list):
""" Convert our data for sankey
We need next one dimensional vectors:
FROM, TO, HOW_MANY, NODE_LABLES (source, target, value, labels)
:return: ``(source, target, value, labels)``
"""
tmp = df.to_numpy()
time_points = tmp.shape[0]
source = []
target = []
value = []
# get groups range
groups = range(np.unique(df).min(), np.unique(df).max() + 1)
# generate lables -- all possible combinations of groups and time points
labels = [f'{g}_{t}' for g, t in itertools.product(groups, range(time_points))]
# all possible transitions betveen 5 BP groups
all_variants = np.array(list(itertools.product(groups, groups)))
# sliding widow for time points with size=2 and step_size=1
for i in range(time_points - 1):
pairs, counts = np.unique(tmp[i+0:i+2].T, axis=0, return_counts=True)
# check every possible transaction
for v in all_variants:
# check if we have this theoretical variant in our real pairs
mask = (pairs == v).all(axis=1)
if mask.sum() == 1:
# get HOW_MANY from counts
value.append(counts[mask][0])
else:
value.append(0)
# i for BP group 'now'
source.append(labels.index(f'{v[0]}_{i}'))
# i+1 for BP group 'in future'
target.append(labels.index(f'{v[1]}_{i+1}'))
return source, target, value, labels
def plot_sankey(source: np.ndarray, target: np.ndarray,
value: np.ndarray, labels: list) -> go.Figure:
""" Based on <https://plotly.com/python/sankey-diagram/>
source and target are indices correspond to labels
"""
fig = go.Figure(data=[go.Sankey(
arrangement = "perpendicular",
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = labels,
color = "red",
),
link = dict(
source = source,
target = target,
value = value
))])
return fig
def convert_and_plot_sankey(df: pd.DataFrame, order=False) -> go.Figure:
"Both above functions in one place + node ordering"
source, target, value, labels = prepare_data_for_sankey(df)
fig = plot_sankey(source, target, value, labels)
if order:
groups = len(np.unique(df))
time_points = df.shape[0]
# Xs
pre_x = np.expand_dims(np.linspace(1e-09, 1, time_points), axis=1)
fig.data[0].node.x = pre_x.repeat(groups, axis=1).T.flatten()
# Ys
pre_y = np.expand_dims(np.linspace(1e-09, 1, groups), axis=1)
fig.data[0].node.y = pre_y.repeat(time_points, axis=1).flatten()
# make lables smaller
fig.update_traces(textfont={'size': 10})
# remove big margins
fig.update_layout(margin=dict(l=25, r=25, b=50, t=50, pad=0))
return fig
```
:::
```python
fig = convert_and_plot_sankey(df0)
fig.show()
```
![](https://i.imgur.com/NrQbosf.png)
To make it more intuitive we must force-order our BP groups like on Vyshyvanka plot, thus we will able to keep track of group transitions much more easily:
```python
### order nodes on sankey plot
# Plot same states on the same line (as on vyshyvanka plot)
fig = convert_and_plot_sankey(df0, order=True)
fig.show()
```
:::spoiler *Ordering code explanation (click to show)*
```python
""" We need to set x and y normalized coordinates (0..1) for each node.
And they can't be exactly 0.
We just will try to distribute them evenly, but in certain order,
where all timepoints of single BP group must be on one line.
"""
time_points = 6
groups = 5
#We have nodes in that order
fig.data[0].node.label
>>>[1_0, 1_1, 1_2, 1_3, 1_4, 1_5,
2_0, 2_1, 2_2, 2_3, 2_4, 2_5,
3_0, 3_1, 3_2, 3_3, 3_4, 3_5,
4_0, 4_1, 4_2, 4_3, 4_4, 4_5,
5_0, 5_1, 5_2, 5_3, 5_4, 5_5]
# Xs: each group of (1_n..5_n) labels should be aligned vertically
np.expand_dims(np.linspace(0, 1, time_points), axis=1).repeat(groups, axis=1).T.flatten()
>>>array([0. , 0.2, 0.4, 0.6, 0.8, 1. ,
0. , 0.2, 0.4, 0.6, 0.8, 1. ,
0. , 0.2, 0.4, 0.6, 0.8, 1. ,
0. , 0.2, 0.4, 0.6, 0.8, 1. ,
0. , 0.2, 0.4, 0.6, 0.8, 1. ])
# Ys: each group of (n_1 .. n_5) labels should be aligned horizontally
np.expand_dims(np.linspace(1e-09, 1, groups), axis=1).repeat(time_points, axis=1).flatten()
>>>array([0. , 0. , 0. , 0. , 0. , 0.,
0.25, 0.25, 0.25, 0.25, 0.25, 0.25,
0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 ,
0.75, 0.75, 0.75, 0.75, 0.75, 0.75,
1. , 1. , 1. , 1. , 1. , 1. ])
```
:::
</br>
![](https://i.imgur.com/12Oau7U.png)
Despite that on this example everything seems OK, plotly has a lot of problems with changing default nodes order, which makes it hard to use it on most data (see [examples](https://gitlab.com/banderlog/vyshyvanka_plot_article/-/blob/master/sankey_plot.ipynb)).
Consider these github issues (especially the last one):
+ [FR: Add textposition property for Sankey plot's node labels](https://github.com/plotly/plotly.py/issues/3004)
+ [Sankey node x and y positions require both variables set to reflect in chart](https://github.com/plotly/plotly.py/issues/1732)
+ [Sankey node x and y positions could not be equal to 0 or np.nan](https://github.com/plotly/plotly.py/issues/3002)
+ [Sankey node x and y positions ignored if node is empty](https://github.com/plotly/plotly.py/issues/3003)
## Examples
You may find all relevant Jupyter notebooks with all code from this post [here](https://gitlab.com/banderlog/vyshyvanka_plot_article).
:::spoiler *Wrapper plot function which we will use in further examples (click to show)*
```python
import matplotlib.pyplot as plt
import seaborn as sns
def plot_violins_and_vyshyvanka(df):
fig, axs = plt.subplots(2, 1, figsize=(10, 10), sharex=True)
fig.patch.set_facecolor('white')
fig.tight_layout(rect=[0, 0, 1, 0.96], h_pad=1.0, w_pad=0.94)
ax = sns.violinplot(ax=axs[0], data=df.T, scale='count', inner='box')
ax, obs_num, states = plot_vyshyvanka(df, axs[1])
_ = axs[1].set_xlabel('Time points')
for ax in axs:
_ = ax.set_yticks(states)
_ = ax.set_ylabel('States')
fig.suptitle(f'State transitions over time (n={obs_num})', weight='bold')
```
:::
### State distribution changes with mode of 1->5 transition
:::spoiler *Data geneation for this example (click to show)*
```python
# state distribution changes from 1 to 5 mostly
random.seed(999)
obs_num = 1000
time_points = 12
states = [1, 2, 3, 4, 5]
tmp = []
x = np.array(np.array(random.choices(states, k=obs_num, weights=[0.64, 0.2, 0.1, 0.05, 0.01])))
tmp.append(x)
for _ in range(time_points - 1):
x = x + np.array(random.choices([-1, 0, 1], k=obs_num, weights=[0.1, 0.45, 0.45]))
x = np.clip(x, 1, 5)
tmp.append(x)
df2 = pd.DataFrame(tmp)
```
:::
![](https://i.imgur.com/zOP3wss.png)
![](https://i.imgur.com/4fZE6b4.png)
### State distribution changes, bi-modal 1->5 and 5->1 transitions
:::spoiler *Data geneation for this example (click to show)*
```python
# state distribution changes from 1 to 5 and back to 1
random.seed(999)
obs_num = 1000
states = [1, 2, 3, 4, 5]
tmp = []
x = np.array(np.array(random.choices(states, k=obs_num, weights=[0.64, 0.2 ,0.1, 0.05 ,0.01])))
tmp.append(x)
for _ in range(5):
x = x + np.array(random.choices([-1, 0, 1], k=obs_num, weights=[0.05, 0.45, 0.5]))
x = np.clip(x, 1, 5)
tmp.append(x)
x = x + np.array(random.choices([-1, 0, 1], k=obs_num, weights=[0.05, 0.9, 0.05]))
x = np.clip(x, 1, 5)
tmp.append(x)
for _ in range(5):
x = x + np.array(random.choices([-1, 0, 1], k=obs_num, weights=[0.55, 0.4, 0.05]))
x = np.clip(x, 1, 5)
tmp.append(x)
df3 = pd.DataFrame(tmp)
```
:::
![](https://i.imgur.com/NeHT7nt.png)
![](https://i.imgur.com/zh7hxMI.png)
### State distribution splits from initial 3 flowing to either 1 or 5 ultimately
:::spoiler *Data geneation for this example (click to show)*
```python
# state distribution splits from initial 3 to 1 or 5 mostly
random.seed(999)
obs_num = 1000
time_points = 12
states = [1, 2, 3, 4, 5]
# state distribution splits from initial 3 to 1 and 5
tmp = []
x = np.array(random.choices(states, k=obs_num, weights=[0, 0, 1, 0 ,0]))
tmp.append(x)
for _ in range(time_points - 1):
x = x + np.array([random.choices([-1, 0, 1], k=int(obs_num/2), weights=[0.02, 0.48, 0.5]),
random.choices([-1, 0, 1], k=int(obs_num/2), weights=[0.5, 0.48, 0.02])]).ravel()
x = np.clip(x, 1, 5)
tmp.append(x)
df4 = pd.DataFrame(tmp)
```
:::
![](https://i.imgur.com/12WZryR.png)
![](https://i.imgur.com/aXp7Bct.png)
## Interactive Vyshyvanka plot
In rare cases, you might want to make it interactive so that to be able to know exact number and percent of transitioned states by clicking on a line. Also, interactive mode allows for finding optimal values for opacity and line widths. For this, we provide you with the code below. Keep in mind, that it will work in Jupyter notebook or a Jupyter-lab upon installation of [mplcursors](https://mplcursors.readthedocs.io/en/stable/) and [ipywidgets](https://github.com/jupyter-widgets/ipywidgets/blob/master/docs/source/user_install.md) (be careful with virtualenv)
:::spoiler *Interactive Vyshyvanka plot code (click to show)*
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import random
import itertools
import mplcursors
import ipywidgets as widgets
from ipywidgets import interact
from matplotlib.legend import Legend
from typing import List
%matplotlib widget
###### Dataset generation
# for reproducibility
random.seed(999)
# create dataset
patients = 1000
time_points = 6
groups = [1, 2, 3, 4, 5]
tmp = []
for _ in range(time_points -1 ):
x = np.array(random.choices(groups, k=patients, weights=[0.01, 0.48, 0.02, 0.48 ,0.01]))
tmp.append(x)
df = pd.DataFrame(tmp)
###### Static part
## Plot
coef_width = 50
fig, ax = plt.subplots(figsize=(9, 4))
fig.patch.set_facecolor('white')
ax, num, relevant_states = plot_vyshyvanka(df, ax, coef_width=coef_width)
_ = ax.set_yticks(relevant_states)
_ = ax.set_xlabel('Time points')
_ = ax.set_ylabel('States')
_ = ax.set_title(f'State transitions over time (n={num})')
## Generate legend
# create lines which represent different percent of observations (number of observations)
steps = ([0.25, 0.1, 0.05, 0.01])
lines = [Line2D([0], [0], color='red', alpha=0.6, solid_capstyle='round', lw=(coef_width * i)) for i in steps]
l_labels = [f'{i*100:.0f}% ({i*num:.0f})' for i in steps]
legend = ax.legend(lines, l_labels, loc='upper left', borderpad=0.8, handlelength=1, handletextpad=1)
###### Interactive part
## Interactive sliders for line opacity and width change
class Foo:
def __init__(self, ax: plt.Axes, legend: Legend):
""" This is the only known way to avoid global variables
with ipwidgets updates on interaction
"""
self.ax = ax
self.legend = legend
def update(self, width_coeff: int, line_opacity: float,
orig_w: List[float], orig_l_w: List[float]) -> None:
""" Will set new width and opacity for every line
on legend and plot
:param width_coeff: this affects line width
:param line_opacity: this affects line opacity
:param orig_w: previous line widths on Axes
:param orig_l_w: previous line widths on Legend
"""
# set new width and opacity for Axes lines
for line, w in zip(self.ax.get_lines(), orig_w):
line.set_lw(w * width_coeff)
line.set_alpha(line_opacity)
# set new width and opacity for Legend lines
for line, w in zip(self.legend.get_lines(), orig_l_w):
line.set_lw(w * width_coeff)
line.set_alpha(line_opacity)
# get current width for lines on plot and on legend
orig_widths = [line.get_lw() for line in ax.get_lines()]
orig_l_w = [line.get_lw() for line in legend.get_lines()]
# plot actual sliders
updater = Foo(ax, legend)
interact(updater.update,
width_coeff=widgets.FloatSlider(min=0, max=2, step=0.1, value=1),
line_opacity=widgets.FloatSlider(min=0, max=1, step=0.05, value=0.4),
orig_w=widgets.fixed(orig_widths),
orig_l_w=widgets.fixed(orig_l_w))
## Interactive annotation
# set label "percent_of_observations / number_of_observations" for each line
for line, w in zip(ax.get_lines(), orig_widths):
tmp = w / coef_width
line.set_label(f'{tmp * 100:.1f}% / {tmp * num:.0f}')
# on LB (on line) it will show annotation
# on RB (on annotation) it will hide it
mplcursors.cursor().connect("add", lambda sel: sel.annotation.set_text(sel.artist.get_label()));
```
:::
</br>
You should get something like this:
![](https://media.giphy.com/media/SfpfYK6gJVpaRhun9W/giphy.gif)
We hope some of you find the Vyshyvanka plot useful :)