owned this note
owned this note
Published
Linked with GitHub
# Boris Bikes
```python
import datetime
import h3 # uber geo package
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import json
import pickle
import os
import folium
import requests as requests
from geopandas import GeoDataFrame, points_from_xy
import movingpandas as mpd
from shapely.geometry import Point
from folium.plugins import TimestampedGeoJson
import re
import contextily as cx
import community
import networkx as nx
from tqdm.auto import tqdm
import pyproj
```
<style>.bk-root, .bk-root .bk:before, .bk-root .bk:after {
font-family: var(--jp-ui-font-size1);
font-size: var(--jp-ui-font-size1);
color: var(--jp-ui-font-color1);
}
</style>
```python
os.environ['PROJ_LIB'] = pyproj.datadir.get_data_dir()
!echo $PROJ_LIB
```
/Users/mhauru/projects/TDSCamila/stories/2022-10-06-Boris-Bikes/.venv/lib/python3.10/site-packages/pyproj/proj_dir/share/proj
```python
tqdm.pandas()
```
TODO: this needs doesn't persist in initial notebook run and has to be run again separately and charts redrawn :( - fix
```python
matplotlib.rcParams['figure.figsize'] = (20, 10)
```
# Introduction
In recent years, London has become a much more bike-friendly city, thanks in large part to the Transport for London (TfL) Santander Bike Scheme. The scheme was launched in July 2010 (until 2015 the scheme was sponsored by Barclays Bank) and has since become a popular way for Londoners to get around the city.
The bicycles, colloquially called "Boris Bikes" after the Mayor of London at the time, are part of a sharing system that allows users to pick them up and drop them off at designated docking stations located around London. The bikes are available 24 hours a day, seven days a week, and can be used for a variety of journeys, from commuting to leisure rides. Riders can purchase access to the scheme through a range of payment options, including yearly and monthly memberships as well as a day pass. [The scheme also offers a variety of discounts for students, workers of the National Health Service and users of the Cycle to work scheme](https://tfl.gov.uk/modes/cycling/santander-cycles/register-for-your-own-key).
The scheme has been praised for its convenience and affordability and has become a popular way for Londoners to get around the city. According to some [studies](https://www.bmj.com/content/348/bmj.g425) it has had a positive effect on the health of its users by increasing physical activity within the capital. As reported by TfL in mid 2022 more than 111.2 million journeys had been made using the cycles, with [the record for cycle hires](https://tfl.gov.uk/info-for/media/press-releases/2015/july/london-celebrates-five-successful-years-of-the-mayor-s-flagship-cycle-hire-sche) in a single day being 73,000. In October 2022, TfL introduced new e-bikes to the scheme, the [first docked e-bikes in London](https://tfl.gov.uk/info-for/media/press-releases/2022/october/docked-e-bikes-now-available-for-hire-as-part-of-london-s-record-breaking-santander-cycles-scheme).
In addition to this, the cycling infrastructure has played an essential role in London, as well as in many other cities, during the pandemic. With many people avoiding public transport and opting to cycle instead, the Boris bikes were used by many commuters on their journeys.
For these reasons, in this Turing Data Story we have decided to focus on the dataset provided by TfL on bike journeys as an opportunity to analyse how the scheme and its use has changed over time and how it has affected the daily lives of Londoners.
[TODO Add outline/table of contents, with links to the various sections?]
# Data description
TfL provides an [open dataset](https://cycling.data.tfl.gov.uk/) containing all bike journeys since the launch of the scheme in July 2010. The dataset contains information of each journey such as start and end stations, start and end dates and times, and the duration of the journey. For this story we are using all available data from the start of the scheme to January 2022.
An example of the format of the data provided by TFL is the following:
| Rental Id |Duration | Bike Id| End Date| EndStation Id| EndStation Name| Start Date| StartStation Id |Start Station Name|
|-------- | -------- | -------- |-------- |-------- |-------- |-------- |-------- |-------- |
| 85323786 | 180 |6711 | 26/03/2019 14:25 |400 | George Street, Marylebone |26/03/2019 14:22 |6|Broadcasting House, Marylebone|
| 85290142 | 300| 7125 |25/03/2019 13:37 |624 |Courland Grove, Wandsworth Road |25/03/2019 13:32 |772 |Binfield Road, Stockwell |
# Data processing and cleaning
The data covers a wide time range, and the format has seen several small changes in that period. Consequently having a single table with all the journey data requires a bit of cleaning, harmonising, and general data-massaging. Some examples of the kinds of things one needs to deal with are
* Data files are provided at weekly level and need to be downloaded programatically from the TfL web page,
* Most of the data comes in CSV files, but some time periods are in Excel spreadsheets,
* Station name (string) to station ID (integer) mappings are many-to-many: The same station often has slightly differently spelled names in different files, sometimes the same station has gone by several IDs, and sometimes the same ID has been used for entirely different stations at different times.
* Station IDs are integers, except for one (there's always one isn't there?) that is a string,
* Some rows have missing data,
* Columns in CSV files have changed name a few times,
* etc.
Dealing with all this is a boring affair and can deviate from the purpose of this data story, and therefore we have decided to do the cleaning elsewere. However, if you enjoy boredom, you can find our data-cleaning script [here](https://github.com/alan-turing-institute/pfeffel/blob/main/src/clean_data.py).
In the end, we have around 100M bicycle journeys of cleaned data (we had to drop a neglibly small number of rows that were missing/malformated beyond repair), all in one pandas DataFrame, that is, thankfully, at around 7 gigabytes just small enough to handle in memory on a modern laptop.
In addition to bike journeys, we also make use of another data set the TfL provides, that has coordinates for the locations of all stations.
We have uploaded all the necesary datasets for this story onto Zenodo[](https://doi.org/10.5281/zenodo.7551619). Before running this story make sure to download the data and place it in a directory named `data` at the same level as this notebook.
```
!mkdir data
!wget https://zenodo.org/record/7551619/files/BorisBikes_journeys_cleaned_data.pickle -P data/
!wget https://zenodo.org/record/7551619/files/BorisBikes_station_names.pickle -P data/
!wget https://zenodo.org/record/7551619/files/BorisBikes_stations_coordinates.json -P data/
```
Now let's load the data.
## Load processed data
```python
# TODO update
RAW = "data/BorisBikes_journeys_cleaned_data.pickle"
LOCATION_REF = "data/BorisBikes_stations_coordinates.json"
```
```python
station_locations_df = pd.read_json(LOCATION_REF).T
station_locations_df.head()
```
```python
# load raw data
df = pd.read_pickle(RAW)
```
```python
# num rows
len(df)
```
97656969
As mentioned above we have around 100 million rows.
Some trips have bogus dates from before the scheme started, or from the future. We drop those rows from our dataset. We do allow for journeys that have no end date, since there's quite a lot of them, and presumably they could mark bikes that were lost or broke down.
```python
EARLIEST_DATE = datetime.datetime(2010, 1, 1)
```
```python
%%time
# filter out of range dates
df = df[df["start_date"] > EARLIEST_DATE]
# allow NA for end dates
df = df[(df["end_date"] > EARLIEST_DATE) | df["end_date"].isna()]
# also drop entries where start date before end date
df = df[df["start_date"] < df["end_date"]]
# recalc duration
df["duration"] = df["end_date"] - df["start_date"]
```
CPU times: user 19.3 s, sys: 17.2 s, total: 36.6 s
Wall time: 44.8 s
We still have the vast majority of our data left after this:
```python
len(df)
```
97120201
The last journey date observed in our dataset is the following:
```python
max(df['end_date'])
```
Timestamp('2022-01-04 23:59:00')
# Statistics on bike usage
After some basic cleaning in the sections above we can start looking at some statitics. Let's see how many unique bikes we have in our dataset:
```python
df["bike_id"].nunique()
```
Now, let's look at the bike with the most trips in our dataset:
```python
bike_groups = df.groupby("bike_id")
# bike with the most trips
group_counts = bike_groups.count()["filename"] # pick abritrary column (without nulls) to get counts
b_id = group_counts.idxmax()
n_trips = group_counts.loc[b_id]
print(f"""
bike with most trips: {b_id}
did {n_trips} trips
""")
```
bike with most trips: 8875
did 10034 trips
Trips vary in length, so let's also see what's the maximum number of hours a bike has been used, summing up the durations of all its trips:
```python
# bike with the longest trips
group_sums = bike_groups["duration"].sum()
b_id = group_sums.idxmax()
d_sum = group_sums.loc[b_id]
print(f"""
bike with longest sum duration of trips: {b_id}
total of {d_sum} seconds
""")
```
bike with longest sum duration of trips: 2143
total of 215 days 05:46:00 seconds
At the macro level we can look at the distribution of the durations of the journeys (here we have decided to exclude outliers representing very long journeys to focus on the bulk of the distribution).
```python
fig, ax = plt.subplots()
ax = df[df["duration"].dt.seconds < 10000]["duration"].dt.seconds.hist(bins=50)
plt.xlabel('Trip duration (seconds)')
plt.ylabel('Frequency')
plt.title('Trip duration')
plt.show()
```

We can see in the figure above that the bikes are used mostly for short journeys, with the core of the distribution centered around journeys ~15 minutes long, and most of the journeys are less than 30 minutes.
## Long-lived bikes
We can use this distribution to calculate the average lifetime of a bike. We define as lifetime the time period in which a bike is present in our dataset.
To do this, for a given bike we find the earliest start date and the latest end date of the bike available in the dataset.
```python
trips_per_bike = bike_groups["filename"].count()
bike_start = bike_groups["start_date"].first()
bike_end = bike_groups["end_date"].last()
bike_lifetime = bike_end - bike_start
```
The distribution of the lifetimes in days is the following
```python
fig, ax = plt.subplots()
ax = bike_lifetime.dt.days.hist(bins=50)
plt.xlabel('Days')
plt.ylabel('Bikes')
plt.title('Lifetime of a bike')
plt.show()
```
The distribution looks quite flat, with a large peak on around ~3600 days. These are very likely bikes that have been part of the scheme from the begining.
Now, let's see what the average utilisation of a bike is. By this we mean the total ride duration divided by its total lifetime.
```python
duration_sums = bike_groups["duration"].sum()
bike_utilisation = duration_sums / bike_lifetime
bike_utilisation.mean()
bike_utilisation.max()
```
The distribution above looks like during their lifetime an average bike gets used around 45 minutes per day.
We must note the caveat that the lifetime definition does not consider the time bikes are in maintenance or not in service for any given reason. As a consequence this utilisation metric may be an underestimate, because it counts any downtime as the bike being available but not in use.
```python
fig, ax = plt.subplots()
ax = bike_utilisation.hist(bins=500)
plt.xlim([0, 0.15])
plt.xlabel('Days')
plt.ylabel('Bikes')
plt.title('Lifetime of a bike')
plt.show()
```

### Time series of bike usage
The number of bikes and stations has evolved since the begining of the scheme, with more being added every year.
Furthermore, London has changed since 2010 and bike usage might have changed with it.
Hence we'll take a look at the number of bikes and stations available and at bike utilisation as a function of time.
Our utilisation measure here will be slightly different to previous figure. Previously we looked at the utilisation at the bike level and averaged this. Now, we're looking at sum of use over the entire fleet and dividing this by the max possible usage per month (24/7 riding).
Let's start by creating a monthly index for the time series:
```python
# don't want to incude first and last months as may be incompelte, use in filter later
incomplete_months = df["start_date"].iloc[[0, -1]].dt.to_period("M")
# create a complete monthly index that covers ALL months in period
complete_monthly_index = pd.date_range(start=df["start_date"].iloc[0], end=df["end_date"].iloc[-1], freq="M").to_period("M")
# remove incomplete months
complete_monthly_index = complete_monthly_index.delete(complete_monthly_index.isin(incomplete_months))
```
Next we create a function that counts how many bikes or stations are in use in a given month.
```python
# TODO should stations count as allive for next month rather than current?
def calc_alive_per_month(starts: pd.Series, ends: pd.Series, incomplete_months: pd.Series, complete_monthly_index: pd.PeriodIndex):
starts_per_month = starts.dt.to_period("M").value_counts()
ends_per_month = ends.dt.to_period("M").value_counts()
counts_df = complete_monthly_index.to_frame(name="foo").join(starts_per_month).join(ends_per_month).sort_index().fillna(0)
# ending items should only be counted at the start of next month, so shift
counts_df["end_date"] = counts_df["end_date"].shift(fill_value=0)
alive_per_month = counts_df["start_date"].cumsum() - counts_df["end_date"].cumsum()
return alive_per_month[~alive_per_month.index.isin(incomplete_months)]
```
Using the function defined above we create a dataframe of bikes available at the monthly level.
```python
alive_bikes_per_month = calc_alive_per_month(starts=bike_start, ends=bike_end, incomplete_months=incomplete_months, complete_monthly_index=complete_monthly_index)
```
We are also interested in seeing the total usage of the bikes per month in terms of the sum of the duration of each journey and the total utilisation.
```python
duration_sums_per_month = df[["duration"]].groupby(df["start_date"].dt.to_period("M"))["duration"].sum()
duration_sums_per_month = duration_sums_per_month.to_frame()
duration_sums_per_month["max_possible_duration"] = duration_sums_per_month.index.map(lambda x: x.end_time - x.start_time)
utilisation_per_month = duration_sums_per_month["duration"] / duration_sums_per_month["max_possible_duration"] / alive_bikes_per_month
# remove incomplelte months
utilisation_per_month = utilisation_per_month[~utilisation_per_month.index.isin(incomplete_months)]
```
Let's also look at how many stations are available in each month.
```python
station_groups = df.groupby("start_station_id")
# relies on time ordering of df via rental_id
station_start = station_groups["start_date"].first()
station_end = station_groups["end_date"].last()
```
```python
alive_stations_per_month = calc_alive_per_month(starts=station_start, ends=station_end,
incomplete_months=incomplete_months, complete_monthly_index=complete_monthly_index)
```
Let's merge our bike usage dataframe with our new station usage data. This results in a monthly indexed dataframe that we can use for visualisation later on:
```python
# forward fill gaps
stats_df = complete_monthly_index.to_frame(name="date")\
.join(alive_bikes_per_month.rename("alive_bikes"))\
.join(alive_stations_per_month.rename("alive_stations"))\
.join(utilisation_per_month.rename("utilisation"))\
.fillna(method="ffill")
```
```python
stats_df.head()
```
Let's look at our time series from March 2012.
```python
stats_df[1:].plot.area(subplots=True)
plt.xlabel('Years')
```
Here we can see clearly the expansion of the scheme in 2012 where new stations were added to [include East London](https://www.theguardian.com/uk/2011/feb/01/bike-hire-scheme-east-london-spring-2012). Also, the effects of the pandemic in 2020 is evident in the spike of usage which seems to have slowly decreased after the excitement of the first couple of years of the scheme.
# Looking at the behaviour of individual bikes
For the next section of the story we are interested in looking at the pattern of movements of bikes. For this we want to follow the different journeys a bike can take during a time period. For this we need to define a new concept which we call chain. A chain is a sequence of trips for a given bike, where the start location matches the previous end location.
Creating chains for all avalaible bikes in this dataset can be time consuming and we are only interested in a few examples. We will get the chains a subset only: let's look at bikes with the longest lifetimes.
Let's get the top ten longest living bikes in our dataset.
```python
top_ten_lived_bike_ids = bike_lifetime.sort_values()[-10:].index.values
```
```python
top_ten_bike_subset = df[df["bike_id"].isin(top_ten_lived_bike_ids)].copy()
```
Let's create a function that for a given bike created a chain id for each secuence journeys. The chain id changes once the starting station of a journey is not the same as the end station from the previous one.
```python
def add_chains(bike_id: int, bike_group: pd.DataFrame, df: pd.DataFrame) -> None:
""" note: adds to dataframe as side effect """
# note fillna for end station to allow for comparison to NA
breaks = bike_group[bike_group["start_station_id"] != bike_group.shift()["end_station_id"].fillna(-1)]
break_indices = breaks.index.values
chains = list()
for i, (start, end) in enumerate(zip([None, *break_indices], [*break_indices, None])):
chain = bike_group.loc[start:end]
chain_id = f"{bike_id}_{i}"
chains.append(pd.Series(chain_id, index=chain.index))
return pd.concat(chains)
```
Using the function defined above, lets get the chains for the top ten longest lived bikes.
```python
chains = list()
for k, g in tqdm(top_ten_bike_subset.groupby("bike_id")):
g = bike_groups.get_group(k)
chains.append(add_chains(bike_id=k, bike_group=g, df=df))
```
0%| | 0/10 [00:00<?, ?it/s]
```python
top_ten_bike_subset = top_ten_bike_subset.join(pd.concat(chains).rename("chain_id"))
```
# A day on the life of a bike
Now that we have "chains" associated to our long lived bikes we can use them to follow them around on a given time period. We are interested to how they move around the city.
There is still some missingness in our dataset, in the form of stations id not linked to our station dataset. Let's write a simple function to remove journeys where the station information is missing.
```python
def remove_missing_stations(df, stations):
def check_id(row, stations):
start_id = str(int(row["start_station_id"]))
end_id = str(int(row["end_station_id"]))
if str(start_id) in stations.keys() and str(end_id) in stations.keys():
return True
return False
df["check_stations_ids"] = df.apply(
lambda row: check_id(row, stations), axis=1
)
df = df[df.check_stations_ids.eq(True)]
return df
```
```python
with open("data/BorisBikes_stations_coordinates.json") as f:
stations = json.load(f)
data = remove_missing_stations(top_ten_bike_subset,stations)
data.head()
```
No we can make this code a bit more formal by building objects that represent Bikes and Trips. A trip is a journey between two stations, the data from TFL only provides a journey starting and final station but we can use https://www.cyclestreets.net/ API to build the most probable route between two points given the duration of that journey.
Our Trip object (containing the journey and route data) is defined as the following:
```python
class Trip:
def __init__(self, data, bike_id, trip_id, station_data):
df = data[data.index == trip_id]
self.init_station = {
"name": df.start_station_name.values[0],
"id": df.start_station_id.values[0],
"latitude": station_data[str(int(df.start_station_id.values[0]))][
"lat"
],
"longitude": station_data[str(int(df.start_station_id.values[0]))][
"lon"
],
}
self.end_station = {
"name": df.end_station_name.values[0],
"id": df.end_station_id.values[0],
"latitude": station_data[str(int(df.end_station_id.values[0]))][
"lat"
],
"longitude": station_data[str(int(df.end_station_id.values[0]))][
"lon"
],
}
self.bike = df.bike_id.values[0]
self.duration = df.duration.values[0]
self.date = {
"start": df.start_date.values[0],
"end": df.end_date.values[0],
}
self.circular = self.init_station == self.end_station
self.route = {}
self.bike_id = bike_id
self.trip_id = trip_id
def get_route(self, key, route_path= 'routes/'):
if not os.path.exists(route_path):
os.makedirs(route_path)
route_file_path = (
route_path
+ str(self.bike_id)
+ "_"
+ str(self.trip_id)
+ ".json"
)
if os.path.isfile(route_file_path):
with open(route_file_path, "r") as fp:
data = json.load(fp)
self.route = data
else:
if self.circular:
self.route = {}
else:
plans = ["balanced", "fastest", "quietest", "shortest"]
closest_time = False
trip_data = {}
for plan in plans:
name = (
"https://www.cyclestreets.net/api/journey.json?key="
+ key
+ "&itinerarypoints="
+ str(self.init_station["longitude"])
+ ","
+ str(self.init_station["latitude"])
+ "|"
+ str(self.end_station["longitude"])
+ ","
+ str(self.end_station["latitude"])
+ "&plan="
+ plan
)
data = requests.get(name).json()["marker"][0][
"@attributes"
]
time = int(data["time"])
if closest_time is False:
closest_time = abs(time - self.duration)
trip_data = data
elif abs(self.duration - time) < closest_time:
closest_time = abs(time - self.duration)
trip_data = data
self.route = trip_data
with open(route_file_path, "w") as fp:
json.dump(self.route, fp)
```
A Bike object represents a bike identified by its ID. The Bike object has methods that allow you to fetch their "story" which are all the trips recorded in the data and routes obtained from https://www.cyclestreets.net/.
In order to fetch data from the Cyclestreets API you need to apply for access and obtain a key. For this story we have already downloaded the routes from the API for the bikes in this example.
```python
class Bike:
def __init__(self, id):
self.id = id
def get_chains(self, stations):
chain_ids = self.bike_rides.chain_id.to_list()
chains = {}
for chain_id in chain_ids:
chain_rides = self.bike_rides[
self.bike_rides["chain_id"] == chain_id
]
chains[chain_id] = [Trip(chain_rides, self.id, trip_id, stations) for trip_id in chain_rides.index]
#self.get_trips(chain_rides, stations)
self.chains = chains
def get_story(self, dataset, stations,key):
bike_rides = dataset[dataset["bike_id"] == self.id]
self.bike_rides = bike_rides
self.get_chains(stations)
for chain_id, chain in self.chains.items():
for counter, trip in enumerate(chain):
trip.get_route(key)
if trip.route == {}:
continue
```
We can visualise the journeys of a bike on a map using folium and moving pandas.
```python
def get_colours(steps):
colours = sns.color_palette("mako").as_hex()
rev_colours = sns.color_palette("mako").as_hex()
rev_colours.reverse()
colours = rev_colours + colours
while len(colours) < steps:
colours += colours
return colours
def get_trajectory(bike_id, route_folder = "routes/"):
chains = [
filename
for filename in sorted(os.listdir(route_folder))
if str(bike_id) + "_" in filename
]
times = []
geometry = []
colours = []
many_colurs = get_colours(len(chains))
for c in range(len(chains)):
chain = chains[c]
with open(route_folder + chain) as f:
d = json.load(f)
if len(d) > 0:
geometry += [
Point([float(y) for y in x.split(",")])
for x in d["coordinates"].split(" ")
]
if len(times) == 0:
time_now = datetime.datetime.now()
else:
time_now = times[-1]
times += [
time_now + datetime.timedelta(seconds=1 * t + 1)
for t in range(len(d["coordinates"].split(" ")))
]
colours += [
many_colurs[c] for x in range(len(d["coordinates"].split(" ")))
]
df = pd.DataFrame()
df["t"] = times
df["trajectory_id"] = [1 for x in range(len(geometry))]
df["sequence"] = [x + 1 for x in range(len(geometry))]
df["colour"] = colours
gdf = GeoDataFrame(df, crs="EPSG:4326", geometry=geometry)
gdf = gdf.set_index("t")
trajs = mpd.TrajectoryCollection(gdf, "trajectory_id")
trajs = mpd.MinTimeDeltaGeneralizer(trajs).generalize(
tolerance=datetime.timedelta(seconds=1)
)
traj = trajs.trajectories[0]
return traj
def draw_map(traj):
features = traj_to_timestamped_geojson(traj)
# Create base map
London = [51.506949, -0.122876]
map = folium.Map(location=London, zoom_start=12, tiles="cartodbpositron")
TimestampedGeoJson(
{
"type": "FeatureCollection",
"features": features,
},
period="PT1S",
add_last_point=False,
transition_time=10,
).add_to(map)
return map
def traj_to_timestamped_geojson(trajectory):
features = []
df = trajectory.df.copy()
df["previous_geometry"] = df["geometry"].shift()
df["time"] = df.index
df["previous_time"] = df["time"].shift()
for _, row in df.iloc[1:].iterrows():
coordinates = [
[
row["previous_geometry"].xy[0][0],
row["previous_geometry"].xy[1][0],
],
[row["geometry"].xy[0][0], row["geometry"].xy[1][0]],
]
times = [row["previous_time"].isoformat(), row["time"].isoformat()]
features.append(
{
"type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": coordinates,
},
"properties": {
"times": times,
"style": {
"color": row["colour"],
"weight": 5,
},
},
}
)
return features
```
```python
key = open("cycle_street_key.txt", "r").read()
```
```python
bike_id = 893
selected_data = data[(data['start_date']> '2020-03-23') & (data['start_date']< '2020-05-14')]
bike = Bike(id=bike_id)
bike.get_story(selected_data, stations,key)
```
```python
traj = get_trajectory(bike_id)
map_trajectory = draw_map(traj)
map_trajectory
```
```python
bike_id = 3278
selected_data =data[(data['start_date']> '2021-03-23') & (data['start_date']< '2021-05-14')]
bike = Bike(id=bike_id)
bike.get_story(selected_data, stations,key)
traj = get_trajectory(bike_id)
map_trajectory = draw_map(traj)
map_trajectory
```
# Bike mobility patterns
The patterns of mobility observed in this dataset can be abstracted using networks built from trips between stations.
We construct a bidirectional weighted graph/network where the nodes are stations and the weight of an edge from station A to station B is the number of trips taken from A to B.
As we'll see below, this reveals some high-level patterns in different regions and around different stations in London, that vary between seasons, weekdays, and times of the day.
For the network analysis, we use the `networkx` package. We start by writing a function that builds a `networkx` network from a dataframe of trip data. A parameter `trip_count_threshold` is used to exclude edges that have a very small weight, to make the analysis computationally more tractable.
```python
def create_network_from_data(df, trip_count_threshold = 1e-5):
trip_counts = (
(
df[["start_station_id", "end_station_id", "bike_id"]]
.groupby(["start_station_id", "end_station_id"])
.count()
)
.reset_index()
.rename(columns={"bike_id": "trip_count"})
)
trip_counts = trip_counts.sort_values("trip_count")
total_num_trips = trip_counts["trip_count"].sum()
trip_counts = trip_counts[
trip_counts["trip_count"] >= trip_count_threshold * total_num_trips
]
graph = nx.from_pandas_edgelist(
trip_counts,
source="start_station_id",
target="end_station_id",
edge_attr="trip_count",
create_using=nx.DiGraph,
)
return graph
```
As we will see in a moment, our network, like many naturally occurring networks, has so called communities:
these are clusters of nodes that are strongly linked to each other, and less linked to nodes outside the cluster.
In our case this corresponds to groups of bike docking stations that have a lot of traffic amongst each other.
Network theorists have developed many different kinds of algorithms for detecting community structure in networks.
We use the Louvain algorithm, and its implementation in the `python-louvain` package.
The Louvain algorithm chooses the communities by trying to maximise a metric called modularity, that roughly speaking measures how much total weight there is in edges that are internal to communities, compared to weight in edges that cross from one community to another.
It is a greedy and quite fast (O(N log N)) algorithm, that works by hierarchically grouping densely connected nodes together and replacing them with single nodes that represent small communities, grouping those small communities to form bigger communities, etc.
You can find more details in [wikipedia](https://en.wikipedia.org/wiki/Louvain_method) or the documentation of the `python-louvain` [package](https://github.com/taynaud/python-louvain).
Below we write a function that does such community detection. Note that we need to ignore the directionality of our graph, i.e. the distinction between bikes going from station A to station B rather than from B to A. This is a limitation of the Louvain algorithm, but for community-detection purposes the directionality isn't important.
```python
def network_community_detection(graph, edge_weight):
graph_undirected = nx.Graph()
undirected_edges = set(sorted(graph.edges))
for edge in undirected_edges:
reverse_edge = (edge[1], edge[0])
trip_count = graph.edges[edge][edge_weight]
if reverse_edge in graph.edges:
trip_count += graph.edges[reverse_edge][edge_weight]
graph_undirected.add_edge(edge[0], edge[1], trip_count=trip_count)
partition = community.best_partition(graph_undirected, weight=edge_weight)
df_partition = pd.DataFrame(partition, index=[0]).T.reset_index()
df_partition.columns = ["id", "partition"]
return df_partition
```
Finally a function for visualising these graphs of bike trips and their communities on a map, together with some supporting functions to aid in that job.
```python
STATION_NAMES_FILE = "data/BorisBikes_station_names.pickle"
STATION_COORDS_FILE = LOCATION_REF
LABEL_STATIONS = [
"Belgrove Street",
"Waterloo Station 3",
"Hyde Park Corner",
"Aquatic Centre",
"Bethnal Green Road",
"Natural History Museum",
"Kennington Oval",
"Mudchute DLR",
]
def get_station_name(id):
with open(STATION_NAMES_FILE, "rb") as f:
station_allnames = pickle.load(f)
name = sorted(station_allnames[id])[0]
name = re.split(";|,|:", name)[0].strip()
return name
def get_node_info(graph):
with open(STATION_COORDS_FILE, "r") as f:
station_latlon = json.load(f)
nodes = graph.nodes()
pos = [station_latlon[str(int(node))] for node in nodes]
pos = [(p["lon"], p["lat"]) for p in pos]
station_sizes = [i[1] for i in list(graph.degree(weight="trip_count"))]
labels = [get_station_name(int(node)) for node in nodes]
nodes_df = pd.DataFrame(
{"id": list(nodes), "pos": pos, "size": station_sizes, "name": labels}
)
return nodes_df
def _scale_range(values, min_scaled, max_scaled):
values = np.array(values)
if min_scaled is not None:
max_value = np.max(values)
min_value = np.min(values)
mult_coeff = (max_scaled - min_scaled) / (max_value - min_value)
add_coeff = (max_value * min_scaled - min_value * max_scaled) / (
max_value - min_value
)
scaled = mult_coeff * values + add_coeff
else:
max_value = np.max(values)
scaled = max_scaled * values / max_value
return scaled
def _drop_stations_without_location(graph):
with open(STATION_COORDS_FILE, "r") as f:
station_latlon = json.load(f)
nodes = tuple(graph.nodes)
stations_with_location = tuple(map(int, station_latlon.keys()))
for n in nodes:
if n not in stations_with_location:
print(f"Removing node {n} because of missing location data.")
graph.remove_node(n)
return None
def create_network_and_map(
df,
label_stations= LABEL_STATIONS,
allow_self_loops=False,
arrows=True,
bg_map_zoom=11,
):
community_graph = create_network_from_data(df)
_drop_stations_without_location(community_graph)
nodes_info = get_node_info(community_graph)
visualisation_graph = community_graph.copy()
if not allow_self_loops:
visualisation_graph.remove_edges_from(
nx.selfloop_edges(community_graph)
)
community_df = network_community_detection(community_graph, "trip_count")
nodes_info = nodes_info.merge(community_df, on="id")
nodes_info = nodes_info.sort_values(by="size", ascending=False)
del community_df
nodes_info['lon'] = [p[0] for p in nodes_info["pos"]]
nodes_info['lat'] = [p[1] for p in nodes_info["pos"]]
nodes_info = GeoDataFrame(nodes_info,geometry=points_from_xy(nodes_info.lon, nodes_info.lat),crs="EPSG:4326")
# Project to UK national grid
nodes_info = nodes_info.to_crs("EPSG:27700")
labels = {
id: name
for id, name in zip(nodes_info["id"], nodes_info["name"])
if name in label_stations
}
fig, ax = plt.subplots(1, 1, figsize=(20, 20))
nodes_info.plot(ax=ax, markersize=1, aspect=None)
cx.add_basemap(ax, crs=nodes_info.crs, source=cx.providers.Stamen.TonerLite, zoom=bg_map_zoom)
xynps = [np.array([p.x for p in nodes_info["geometry"]]),
np.array([p.y for p in nodes_info["geometry"]])]
pos = {
k: (xynps[0][i], xynps[1][i]) for i, k in enumerate(nodes_info["id"])
}
MAX_NODE_SIZE = 300.0
MIN_NODE_SIZE = 5.0
sizes = _scale_range(nodes_info["size"], MIN_NODE_SIZE, MAX_NODE_SIZE)
weights = np.array(
[
visualisation_graph.edges[e]["trip_count"]
for e in visualisation_graph.edges
]
)
MAX_EDGE_WIDTH = 3.0
MIN_EDGE_WIDTH = None
weights = _scale_range(weights, MIN_EDGE_WIDTH, MAX_EDGE_WIDTH)
MAX_EDGE_ALPHA = 0.9
MIN_EDGE_ALPHA = None
edge_alpha = _scale_range(weights, MIN_EDGE_ALPHA, MAX_EDGE_ALPHA)
# Plots
nx.draw_networkx_nodes(
visualisation_graph,
pos=pos,
nodelist=nodes_info["id"],
node_color=nodes_info["partition"],
alpha=1.0,
node_size=sizes,
cmap="tab10",
ax=ax,
)
nx.draw_networkx_edges(
visualisation_graph,
pos=pos,
edge_color="#222222",
width=weights,
alpha=edge_alpha,
arrows=arrows,
ax=ax,
)
nx.draw_networkx_labels(
visualisation_graph,
pos=pos,
labels=labels,
font_size=12,
ax=ax,
)
return fig, ax, nodes_info
```
To make the network analysis computationally manageable, we restrict our attention to data from 2021 only.
```python
HARD_START_DATE = datetime.datetime(year=2010, month=1, day=1)
start_date = datetime.datetime(year=2021, month=1, day=1)
end_date = datetime. datetime(year=2022, month=1, day=1)
df = df[
(df["start_date"] > HARD_START_DATE) & (df["end_date"] > HARD_START_DATE)
]
df_year = df[(df["start_date"] > start_date) & (df["start_date"] < end_date)]
```
Plot time!
Below is the network of bike trips aggregated over weekday mornings only, in 2021.
The colours of the nodes mark the different communities as discovered by the Louvain algorithm.
The lines connecting nodes have arrows indicating the direction of travel and their thickness marks the amount of traffic.
```python
print("Plotting mornings")
df_year_mornings = df[
df["start_date"].dt.hour.isin([7, 8, 9, 10])
& df["start_date"].dt.weekday.isin((0, 1, 2, 3, 4))
]
fig, ax, nodes_info = create_network_and_map(df_year_mornings)
num_communities = len(nodes_info["partition"].unique())
print(f"Number of communities: {num_communities}")
plt.title("Weekday mornings (7-10)")
plt.show()
```
Plotting mornings
534715.5043561324
Number of communities: 5

The two most prominent types of weekday morning traffic are outgoing trips from King's Cross/St. Pancras (station called Belgrove Street marked on the map) and from Waterloo. Commuters presumably come in on trains and cover the last bit to their workplace by a Boris Bike. These two types of traffic form two of the communities found by the Louvain algorithm, with the others being less concentrated communities for East, West, and South London.
Below is a similar graph but for weekday afternoons rather than mornings. The dominant patterns are the reverse of the morning traffic, with people going from their offices to the big train stations. A new feature is a community of traffic around Hyde Park, presumably pleasure rides in the afternoon by people on holiday.
There are some other shifts in the communities marked by the colours, that we find hard to interpret. There's also some instability and randomness in the Louvain algorithm, where if you run it several times sometimes it divides the smaller communities in slightly different ways. The commuter communities are robust, though.
```python
print("Plotting afternoons")
df_year_afternoons = df[
df["start_date"].dt.hour.isin([15, 16, 17, 18, 19])
& df["start_date"].dt.weekday.isin((0, 1, 2, 3, 4))
]
fig, ax, nodes_info = create_network_and_map(df_year_afternoons)
num_communities = len(nodes_info["partition"].unique())
print(f"Number of communities: {num_communities}")
plt.title("Weekday afternoons (15-19)")
plt.show()
```
Plotting afternoons
526993.5149398237
Number of communities: 12

Below is the same plot but for weekends. Here the most common trips are around Hyde Park and marked by self-loops, where the bike is returned to the same station where it was picked up. We don't know where they were ridden in between, but probably around the park for fun or a picnic. The Olympic Park as a similar loop. The communities again show different regions of London, with people travelling within neighbourhoods, although some of them are somewhat surprising and hard to interpret, with outlier stations way outside the bulk of their communities.
```python
print("Plotting weekends")
df_year_weekends = df[df["start_date"].dt.weekday.isin((5, 6))]
fig, ax, nodes_info = create_network_and_map(
df_year_weekends,
allow_self_loops=True,
)
num_communities = len(nodes_info["partition"].unique())
print(f"Number of communities: {num_communities}")
plt.title("Weekends")
plt.show()
```
Plotting weekends
525180.1235857526
Finally, let's see how patterns have changed over time.
Below we create these network plots but this time not for a specific time of day or week, but for specific years: 2013, 2017, 2020, and 2021.
```python
for year in (2013, 2017, 2020,2021):
print(f"Plotting {year}")
start_date = datetime.datetime(year=year, month=1, day=1)
end_date = datetime.datetime(year=year + 1, month=1, day=1)
df_year = df[
(df["start_date"] > start_date) & (df["start_date"] < end_date)
]
fig, ax, nodes_info = create_network_and_map(
df_year,
allow_self_loops=False,
arrows=False,
)
num_communities = len(nodes_info["partition"].unique())
print(f"Number of communities: {num_communities}")
plt.title(f"Year {year}")
plt.show()
```
Plotting 2013
id pos size name partition \
301 14 (-0.123616, 51.529943) 150033 Belgrove Street 1
22 154 (-0.112824, 51.503791) 134846 Waterloo Station 3 0
432 191 (-0.15352, 51.503117) 129340 Hyde Park Corner 3
238 194 (-0.091773, 51.504627) 108997 Hop Exchange 0
52 307 (-0.187842, 51.509908) 95135 Black Lion Gate 3
.. ... ... ... ... ...
442 494 (-0.016251, 51.50196) 468 South Quay East 4
541 519 (-0.011662, 51.518811) 453 Teviot Street 4
558 472 (-0.027616, 51.529452) 408 Malmesbury Road 4
547 622 (-0.205535, 51.507481) 177 Lansdowne Road 3
422 79 (-0.113855, 51.511726) 162 Arundel Street 0
lon lat geometry
301 -0.123616 51.529943 POINT (530254.816 182894.392)
22 -0.112824 51.503791 POINT (531078.357 180005.453)
432 -0.153520 51.503117 POINT (528255.915 179858.465)
238 -0.091773 51.504627 POINT (532536.883 180136.298)
52 -0.187842 51.509908 POINT (525855.197 180554.120)
.. ... ... ...
442 -0.016251 51.501960 POINT (537786.055 179979.077)
541 -0.011662 51.518811 POINT (538053.630 181861.581)
558 -0.027616 51.529452 POINT (536914.894 183014.928)
547 -0.205535 51.507481 POINT (524634.062 180253.976)
422 -0.113855 51.511726 POINT (530984.064 180886.006)
[564 rows x 8 columns]
Number of communities: 6

Plotting 2017
id pos size \
102 14 (-0.123616, 51.529943) 169510
126 191 (-0.15352, 51.503117) 149073
9 154 (-0.112824, 51.503791) 145498
288 307 (-0.187842, 51.509908) 101722
221 303 (-0.158456, 51.502953) 96691
.. ... ... ...
628 767 (-0.200806, 51.457059) 450
764 689 (-0.180884, 51.459225) 290
748 363 (-0.171185, 51.529121) 240
766 183 (-0.136123, 51.482362) 233
765 672 (-0.17102170771560854, 51.46543023973911) 147
name partition lon lat \
102 Belgrove Street 3 -0.123616 51.529943
126 Hyde Park Corner 2 -0.153520 51.503117
9 Waterloo Station 3 5 -0.112824 51.503791
288 Black Lion Gate 2 -0.187842 51.509908
221 Albert Gate 2 -0.158456 51.502953
.. ... ... ... ...
628 Santos Road 1 -0.200806 51.457059
764 Spanish Road 8 -0.180884 51.459225
748 Lord's 3 -0.171185 51.529121
766 Milford Lane 6 -0.136123 51.482362
765 Grant Road Central 1 -0.171022 51.465430
geometry
102 POINT (530254.816 182894.392)
126 POINT (528255.915 179858.465)
9 POINT (531078.357 180005.453)
288 POINT (525855.197 180554.120)
221 POINT (527913.805 179831.597)
.. ...
628 POINT (525100.022 174654.904)
764 POINT (526478.066 174929.957)
748 POINT (526957.662 182719.438)
766 POINT (529522.065 177581.044)
765 POINT (527145.955 175637.064)
[768 rows x 8 columns]
Number of communities: 9

Plotting 2020
id pos size name \
117 191 (-0.15352, 51.503117) 113767 Hyde Park Corner
392 785 (-0.01051, 51.54094) 99892 Aquatic Centre
124 307 (-0.187842, 51.509908) 73598 Black Lion Gate
168 213 (-0.149569, 51.50274) 72850 Wellington Arch
76 194 (-0.091773, 51.504627) 69085 Hop Exchange
.. ... ... ... ...
744 258 (-0.178656, 51.501432) 358 Kensington Gore
727 482 (-0.173689, 51.495947) 348 Exhibition Road Museums 2
753 391 (-0.142345, 51.510662) 345 Clifford Street
759 528 (-0.145827, 51.507326) 296 Clarges Street
740 232 (-0.112753, 51.51501) 246 Carey Street
partition lon lat geometry
117 11 -0.153520 51.503117 POINT (528255.915 179858.465)
392 12 -0.010510 51.540940 POINT (538066.636 184324.544)
124 11 -0.187842 51.509908 POINT (525855.197 180554.120)
168 11 -0.149569 51.502740 POINT (528531.182 179823.466)
76 0 -0.091773 51.504627 POINT (532536.883 180136.298)
.. ... ... ... ...
744 11 -0.178656 51.501432 POINT (526516.087 179627.377)
727 11 -0.173689 51.495947 POINT (526876.038 179026.014)
753 8 -0.142345 51.510662 POINT (529010.185 180717.121)
759 11 -0.145827 51.507326 POINT (528777.966 180340.018)
740 3 -0.112753 51.515010 POINT (531051.113 181253.169)
[783 rows x 8 columns]
Number of communities: 13

Plotting 2021
id pos size \
91 191 (-0.15352, 51.503117) 140998
526 785 (-0.01051, 51.54094) 92570
244 213 (-0.149569, 51.50274) 85288
266 194 (-0.091773, 51.504627) 80866
438 307 (-0.187842, 51.509908) 80773
.. ... ... ...
619 232 (-0.112753, 51.51501) 342
772 760 (-0.1646535645501321, 51.52521872896051) 314
780 339 (-0.111781, 51.533319) 176
778 464 (-0.052099, 51.51417) 169
777 87 (-0.079684, 51.516468) 154
name partition lon lat \
91 Hyde Park Corner 2 -0.153520 51.503117
526 Aquatic Centre 4 -0.010510 51.540940
244 Wellington Arch 0 -0.149569 51.502740
266 Hop Exchange 0 -0.091773 51.504627
438 Black Lion Gate 2 -0.187842 51.509908
.. ... ... ... ...
619 Carey Street 3 -0.112753 51.515010
772 Rossmore Road 5 -0.164654 51.525219
780 Risinghill Street 5 -0.111781 51.533319
778 St Mary and St Michael Church 6 -0.052099 51.514170
777 Devonshire Square 0 -0.079684 51.516468
geometry
91 POINT (528255.915 179858.465)
526 POINT (538066.636 184324.544)
244 POINT (528531.182 179823.466)
266 POINT (532536.883 180136.298)
438 POINT (525855.197 180554.120)
.. ...
619 POINT (531051.113 181253.169)
772 POINT (527421.584 182296.833)
780 POINT (531066.013 183290.934)
778 POINT (535261.971 181270.035)
777 POINT (533341.297 181474.993)
[784 rows x 8 columns]
Number of communities: 11

From 2013 to 2017 the big change is the growth of the network of stations.
Especially in the west and southwest, but also around for instance the Olympic Park, the Boris Bike network grows a lot.
From 2017 to 2020 the notable change is the effect of the pandemic: Most of the commuter traffic from e.g. King's Cross and Waterloo has vanished. The dominant traffic patterns that did survive the pandemic were the rides around Hyde Park and the Olympic Park. In 2021 we see some of the commuter traffic return, but we remain far from the levels of 2017.
Note that we should be careful in reading too much into the coloured communities and how they've changed from year to year. First of all, the colours bear no particular meaning, and the same community may be coloured differently in different plots. Second, the Louvain algorithm for discovering them is stochastic, and many of the smaller communities might arrange themselves differently by simply running the same cell again. The communities with the most traffic remain robust though.
# Conclusion
In this Turing data story we aimed to use the TfL bike journey dataset to understand the evolution of London as a cycling city and the patterns in how people move around the capital. This story was born out of a 'Hack week' organised by the [Research Engineering Group](https://www.turing.ac.uk/research-engineering) at the Turing, an annual event in which people from the team get to work on projects/datasets they find interesting.
Our strategy was to extract some general statistics about bike usage over time, followed by trying to characterise the movements of some long-lived bikes, and finally to use network analysis techniques to abstract these movements into high level patterns of behaviour.
[TODO add summaries of other findings]
The analysis very clearly showed how the pandemic changed drasticly the way we move around the city and the strength of the commute links. Furthermore, in the last couple of years a number of other cycling schemes have been introduced, and significant improvements to the cycling infrastructure might have encouraged people to acquire their own bikes, both of which may have contributed to changes in Boris Bike use. A task for a future analysis (from a Turing Data Story reader :-) ) is to include 2022 data and see if this change in behaviour has persisted.