Boris Bikes - HackMD

# Boris Bikes ```python import datetime import h3 # uber geo package import pandas as pd import seaborn as sns import numpy as np import matplotlib import matplotlib.pyplot as plt import json import pickle import os import folium import requests as requests from geopandas import GeoDataFrame, points_from_xy import movingpandas as mpd from shapely.geometry import Point from folium.plugins import TimestampedGeoJson import re import contextily as cx import community import networkx as nx from tqdm.auto import tqdm import pyproj ``` <style>.bk-root, .bk-root .bk:before, .bk-root .bk:after { font-family: var(--jp-ui-font-size1); font-size: var(--jp-ui-font-size1); color: var(--jp-ui-font-color1); } </style> ```python os.environ['PROJ_LIB'] = pyproj.datadir.get_data_dir() !echo $PROJ_LIB ``` /Users/mhauru/projects/TDSCamila/stories/2022-10-06-Boris-Bikes/.venv/lib/python3.10/site-packages/pyproj/proj_dir/share/proj ```python tqdm.pandas() ``` TODO: this needs doesn't persist in initial notebook run and has to be run again separately and charts redrawn :( - fix ```python matplotlib.rcParams['figure.figsize'] = (20, 10) ``` # Introduction In recent years, London has become a much more bike-friendly city, thanks in large part to the Transport for London (TfL) Santander Bike Scheme. The scheme was launched in July 2010 (until 2015 the scheme was sponsored by Barclays Bank) and has since become a popular way for Londoners to get around the city. The bicycles, colloquially called "Boris Bikes" after the Mayor of London at the time, are part of a sharing system that allows users to pick them up and drop them off at designated docking stations located around London. The bikes are available 24 hours a day, seven days a week, and can be used for a variety of journeys, from commuting to leisure rides. Riders can purchase access to the scheme through a range of payment options, including yearly and monthly memberships as well as a day pass. [The scheme also offers a variety of discounts for students, workers of the National Health Service and users of the Cycle to work scheme](https://tfl.gov.uk/modes/cycling/santander-cycles/register-for-your-own-key). The scheme has been praised for its convenience and affordability and has become a popular way for Londoners to get around the city. According to some [studies](https://www.bmj.com/content/348/bmj.g425) it has had a positive effect on the health of its users by increasing physical activity within the capital. As reported by TfL in mid 2022 more than 111.2 million journeys had been made using the cycles, with [the record for cycle hires](https://tfl.gov.uk/info-for/media/press-releases/2015/july/london-celebrates-five-successful-years-of-the-mayor-s-flagship-cycle-hire-sche) in a single day being 73,000. In October 2022, TfL introduced new e-bikes to the scheme, the [first docked e-bikes in London](https://tfl.gov.uk/info-for/media/press-releases/2022/october/docked-e-bikes-now-available-for-hire-as-part-of-london-s-record-breaking-santander-cycles-scheme). In addition to this, the cycling infrastructure has played an essential role in London, as well as in many other cities, during the pandemic. With many people avoiding public transport and opting to cycle instead, the Boris bikes were used by many commuters on their journeys. For these reasons, in this Turing Data Story we have decided to focus on the dataset provided by TfL on bike journeys as an opportunity to analyse how the scheme and its use has changed over time and how it has affected the daily lives of Londoners. [TODO Add outline/table of contents, with links to the various sections?] # Data description TfL provides an [open dataset](https://cycling.data.tfl.gov.uk/) containing all bike journeys since the launch of the scheme in July 2010. The dataset contains information of each journey such as start and end stations, start and end dates and times, and the duration of the journey. For this story we are using all available data from the start of the scheme to January 2022. An example of the format of the data provided by TFL is the following: | Rental Id |Duration | Bike Id| End Date| EndStation Id| EndStation Name| Start Date| StartStation Id |Start Station Name| |-------- | -------- | -------- |-------- |-------- |-------- |-------- |-------- |-------- | | 85323786 | 180 |6711 | 26/03/2019 14:25 |400 | George Street, Marylebone |26/03/2019 14:22 |6|Broadcasting House, Marylebone| | 85290142 | 300| 7125 |25/03/2019 13:37 |624 |Courland Grove, Wandsworth Road |25/03/2019 13:32 |772 |Binfield Road, Stockwell | # Data processing and cleaning The data covers a wide time range, and the format has seen several small changes in that period. Consequently having a single table with all the journey data requires a bit of cleaning, harmonising, and general data-massaging. Some examples of the kinds of things one needs to deal with are * Data files are provided at weekly level and need to be downloaded programatically from the TfL web page, * Most of the data comes in CSV files, but some time periods are in Excel spreadsheets, * Station name (string) to station ID (integer) mappings are many-to-many: The same station often has slightly differently spelled names in different files, sometimes the same station has gone by several IDs, and sometimes the same ID has been used for entirely different stations at different times. * Station IDs are integers, except for one (there's always one isn't there?) that is a string, * Some rows have missing data, * Columns in CSV files have changed name a few times, * etc. Dealing with all this is a boring affair and can deviate from the purpose of this data story, and therefore we have decided to do the cleaning elsewere. However, if you enjoy boredom, you can find our data-cleaning script [here](https://github.com/alan-turing-institute/pfeffel/blob/main/src/clean_data.py). In the end, we have around 100M bicycle journeys of cleaned data (we had to drop a neglibly small number of rows that were missing/malformated beyond repair), all in one pandas DataFrame, that is, thankfully, at around 7 gigabytes just small enough to handle in memory on a modern laptop. In addition to bike journeys, we also make use of another data set the TfL provides, that has coordinates for the locations of all stations. We have uploaded all the necesary datasets for this story onto Zenodo[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7551619.svg)](https://doi.org/10.5281/zenodo.7551619). Before running this story make sure to download the data and place it in a directory named `data` at the same level as this notebook. ``` !mkdir data !wget https://zenodo.org/record/7551619/files/BorisBikes_journeys_cleaned_data.pickle -P data/ !wget https://zenodo.org/record/7551619/files/BorisBikes_station_names.pickle -P data/ !wget https://zenodo.org/record/7551619/files/BorisBikes_stations_coordinates.json -P data/ ``` Now let's load the data. ## Load processed data ```python # TODO update RAW = "data/BorisBikes_journeys_cleaned_data.pickle" LOCATION_REF = "data/BorisBikes_stations_coordinates.json" ``` ```python station_locations_df = pd.read_json(LOCATION_REF).T station_locations_df.head() ``` ```python # load raw data df = pd.read_pickle(RAW) ``` ```python # num rows len(df) ``` 97656969 As mentioned above we have around 100 million rows. Some trips have bogus dates from before the scheme started, or from the future. We drop those rows from our dataset. We do allow for journeys that have no end date, since there's quite a lot of them, and presumably they could mark bikes that were lost or broke down. ```python EARLIEST_DATE = datetime.datetime(2010, 1, 1) ``` ```python %%time # filter out of range dates df = df[df["start_date"] > EARLIEST_DATE] # allow NA for end dates df = df[(df["end_date"] > EARLIEST_DATE) | df["end_date"].isna()] # also drop entries where start date before end date df = df[df["start_date"] < df["end_date"]] # recalc duration df["duration"] = df["end_date"] - df["start_date"] ``` CPU times: user 19.3 s, sys: 17.2 s, total: 36.6 s Wall time: 44.8 s We still have the vast majority of our data left after this: ```python len(df) ``` 97120201 The last journey date observed in our dataset is the following: ```python max(df['end_date']) ``` Timestamp('2022-01-04 23:59:00') # Statistics on bike usage After some basic cleaning in the sections above we can start looking at some statitics. Let's see how many unique bikes we have in our dataset: ```python df["bike_id"].nunique() ``` Now, let's look at the bike with the most trips in our dataset: ```python bike_groups = df.groupby("bike_id") # bike with the most trips group_counts = bike_groups.count()["filename"] # pick abritrary column (without nulls) to get counts b_id = group_counts.idxmax() n_trips = group_counts.loc[b_id] print(f""" bike with most trips: {b_id} did {n_trips} trips """) ``` bike with most trips: 8875 did 10034 trips Trips vary in length, so let's also see what's the maximum number of hours a bike has been used, summing up the durations of all its trips: ```python # bike with the longest trips group_sums = bike_groups["duration"].sum() b_id = group_sums.idxmax() d_sum = group_sums.loc[b_id] print(f""" bike with longest sum duration of trips: {b_id} total of {d_sum} seconds """) ``` bike with longest sum duration of trips: 2143 total of 215 days 05:46:00 seconds At the macro level we can look at the distribution of the durations of the journeys (here we have decided to exclude outliers representing very long journeys to focus on the bulk of the distribution). ```python fig, ax = plt.subplots() ax = df[df["duration"].dt.seconds < 10000]["duration"].dt.seconds.hist(bins=50) plt.xlabel('Trip duration (seconds)') plt.ylabel('Frequency') plt.title('Trip duration') plt.show() ``` ![png](output_30_0.png) We can see in the figure above that the bikes are used mostly for short journeys, with the core of the distribution centered around journeys ~15 minutes long, and most of the journeys are less than 30 minutes. ## Long-lived bikes We can use this distribution to calculate the average lifetime of a bike. We define as lifetime the time period in which a bike is present in our dataset. To do this, for a given bike we find the earliest start date and the latest end date of the bike available in the dataset. ```python trips_per_bike = bike_groups["filename"].count() bike_start = bike_groups["start_date"].first() bike_end = bike_groups["end_date"].last() bike_lifetime = bike_end - bike_start ``` The distribution of the lifetimes in days is the following ```python fig, ax = plt.subplots() ax = bike_lifetime.dt.days.hist(bins=50) plt.xlabel('Days') plt.ylabel('Bikes') plt.title('Lifetime of a bike') plt.show() ``` The distribution looks quite flat, with a large peak on around ~3600 days. These are very likely bikes that have been part of the scheme from the begining. Now, let's see what the average utilisation of a bike is. By this we mean the total ride duration divided by its total lifetime. ```python duration_sums = bike_groups["duration"].sum() bike_utilisation = duration_sums / bike_lifetime bike_utilisation.mean() bike_utilisation.max() ``` The distribution above looks like during their lifetime an average bike gets used around 45 minutes per day. We must note the caveat that the lifetime definition does not consider the time bikes are in maintenance or not in service for any given reason. As a consequence this utilisation metric may be an underestimate, because it counts any downtime as the bike being available but not in use. ```python fig, ax = plt.subplots() ax = bike_utilisation.hist(bins=500) plt.xlim([0, 0.15]) plt.xlabel('Days') plt.ylabel('Bikes') plt.title('Lifetime of a bike') plt.show() ``` ![png](output_37_0.png) ### Time series of bike usage The number of bikes and stations has evolved since the begining of the scheme, with more being added every year. Furthermore, London has changed since 2010 and bike usage might have changed with it. Hence we'll take a look at the number of bikes and stations available and at bike utilisation as a function of time. Our utilisation measure here will be slightly different to previous figure. Previously we looked at the utilisation at the bike level and averaged this. Now, we're looking at sum of use over the entire fleet and dividing this by the max possible usage per month (24/7 riding). Let's start by creating a monthly index for the time series: ```python # don't want to incude first and last months as may be incompelte, use in filter later incomplete_months = df["start_date"].iloc[[0, -1]].dt.to_period("M") # create a complete monthly index that covers ALL months in period complete_monthly_index = pd.date_range(start=df["start_date"].iloc[0], end=df["end_date"].iloc[-1], freq="M").to_period("M") # remove incomplete months complete_monthly_index = complete_monthly_index.delete(complete_monthly_index.isin(incomplete_months)) ``` Next we create a function that counts how many bikes or stations are in use in a given month. ```python # TODO should stations count as allive for next month rather than current? def calc_alive_per_month(starts: pd.Series, ends: pd.Series, incomplete_months: pd.Series, complete_monthly_index: pd.PeriodIndex): starts_per_month = starts.dt.to_period("M").value_counts() ends_per_month = ends.dt.to_period("M").value_counts() counts_df = complete_monthly_index.to_frame(name="foo").join(starts_per_month).join(ends_per_month).sort_index().fillna(0) # ending items should only be counted at the start of next month, so shift counts_df["end_date"] = counts_df["end_date"].shift(fill_value=0) alive_per_month = counts_df["start_date"].cumsum() - counts_df["end_date"].cumsum() return alive_per_month[~alive_per_month.index.isin(incomplete_months)] ``` Using the function defined above we create a dataframe of bikes available at the monthly level. ```python alive_bikes_per_month = calc_alive_per_month(starts=bike_start, ends=bike_end, incomplete_months=incomplete_months, complete_monthly_index=complete_monthly_index) ``` We are also interested in seeing the total usage of the bikes per month in terms of the sum of the duration of each journey and the total utilisation. ```python duration_sums_per_month = df[["duration"]].groupby(df["start_date"].dt.to_period("M"))["duration"].sum() duration_sums_per_month = duration_sums_per_month.to_frame() duration_sums_per_month["max_possible_duration"] = duration_sums_per_month.index.map(lambda x: x.end_time - x.start_time) utilisation_per_month = duration_sums_per_month["duration"] / duration_sums_per_month["max_possible_duration"] / alive_bikes_per_month # remove incomplelte months utilisation_per_month = utilisation_per_month[~utilisation_per_month.index.isin(incomplete_months)] ``` Let's also look at how many stations are available in each month. ```python station_groups = df.groupby("start_station_id") # relies on time ordering of df via rental_id station_start = station_groups["start_date"].first() station_end = station_groups["end_date"].last() ``` ```python alive_stations_per_month = calc_alive_per_month(starts=station_start, ends=station_end, incomplete_months=incomplete_months, complete_monthly_index=complete_monthly_index) ``` Let's merge our bike usage dataframe with our new station usage data. This results in a monthly indexed dataframe that we can use for visualisation later on: ```python # forward fill gaps stats_df = complete_monthly_index.to_frame(name="date")\ .join(alive_bikes_per_month.rename("alive_bikes"))\ .join(alive_stations_per_month.rename("alive_stations"))\ .join(utilisation_per_month.rename("utilisation"))\ .fillna(method="ffill") ``` ```python stats_df.head() ``` Let's look at our time series from March 2012. ```python stats_df[1:].plot.area(subplots=True) plt.xlabel('Years') ``` Here we can see clearly the expansion of the scheme in 2012 where new stations were added to [include East London](https://www.theguardian.com/uk/2011/feb/01/bike-hire-scheme-east-london-spring-2012). Also, the effects of the pandemic in 2020 is evident in the spike of usage which seems to have slowly decreased after the excitement of the first couple of years of the scheme. # Looking at the behaviour of individual bikes For the next section of the story we are interested in looking at the pattern of movements of bikes. For this we want to follow the different journeys a bike can take during a time period. For this we need to define a new concept which we call chain. A chain is a sequence of trips for a given bike, where the start location matches the previous end location. Creating chains for all avalaible bikes in this dataset can be time consuming and we are only interested in a few examples. We will get the chains a subset only: let's look at bikes with the longest lifetimes. Let's get the top ten longest living bikes in our dataset. ```python top_ten_lived_bike_ids = bike_lifetime.sort_values()[-10:].index.values ``` ```python top_ten_bike_subset = df[df["bike_id"].isin(top_ten_lived_bike_ids)].copy() ``` Let's create a function that for a given bike created a chain id for each secuence journeys. The chain id changes once the starting station of a journey is not the same as the end station from the previous one. ```python def add_chains(bike_id: int, bike_group: pd.DataFrame, df: pd.DataFrame) -> None: """ note: adds to dataframe as side effect """ # note fillna for end station to allow for comparison to NA breaks = bike_group[bike_group["start_station_id"] != bike_group.shift()["end_station_id"].fillna(-1)] break_indices = breaks.index.values chains = list() for i, (start, end) in enumerate(zip([None, *break_indices], [*break_indices, None])): chain = bike_group.loc[start:end] chain_id = f"{bike_id}_{i}" chains.append(pd.Series(chain_id, index=chain.index)) return pd.concat(chains) ``` Using the function defined above, lets get the chains for the top ten longest lived bikes. ```python chains = list() for k, g in tqdm(top_ten_bike_subset.groupby("bike_id")): g = bike_groups.get_group(k) chains.append(add_chains(bike_id=k, bike_group=g, df=df)) ``` 0%| | 0/10 [00:00<?, ?it/s] ```python top_ten_bike_subset = top_ten_bike_subset.join(pd.concat(chains).rename("chain_id")) ``` # A day on the life of a bike Now that we have "chains" associated to our long lived bikes we can use them to follow them around on a given time period. We are interested to how they move around the city. There is still some missingness in our dataset, in the form of stations id not linked to our station dataset. Let's write a simple function to remove journeys where the station information is missing. ```python def remove_missing_stations(df, stations): def check_id(row, stations): start_id = str(int(row["start_station_id"])) end_id = str(int(row["end_station_id"])) if str(start_id) in stations.keys() and str(end_id) in stations.keys(): return True return False df["check_stations_ids"] = df.apply( lambda row: check_id(row, stations), axis=1 ) df = df[df.check_stations_ids.eq(True)] return df ``` ```python with open("data/BorisBikes_stations_coordinates.json") as f: stations = json.load(f) data = remove_missing_stations(top_ten_bike_subset,stations) data.head() ``` No we can make this code a bit more formal by building objects that represent Bikes and Trips. A trip is a journey between two stations, the data from TFL only provides a journey starting and final station but we can use https://www.cyclestreets.net/ API to build the most probable route between two points given the duration of that journey. Our Trip object (containing the journey and route data) is defined as the following: ```python class Trip: def __init__(self, data, bike_id, trip_id, station_data): df = data[data.index == trip_id] self.init_station = { "name": df.start_station_name.values[0], "id": df.start_station_id.values[0], "latitude": station_data[str(int(df.start_station_id.values[0]))][ "lat" ], "longitude": station_data[str(int(df.start_station_id.values[0]))][ "lon" ], } self.end_station = { "name": df.end_station_name.values[0], "id": df.end_station_id.values[0], "latitude": station_data[str(int(df.end_station_id.values[0]))][ "lat" ], "longitude": station_data[str(int(df.end_station_id.values[0]))][ "lon" ], } self.bike = df.bike_id.values[0] self.duration = df.duration.values[0] self.date = { "start": df.start_date.values[0], "end": df.end_date.values[0], } self.circular = self.init_station == self.end_station self.route = {} self.bike_id = bike_id self.trip_id = trip_id def get_route(self, key, route_path= 'routes/'): if not os.path.exists(route_path): os.makedirs(route_path) route_file_path = ( route_path + str(self.bike_id) + "_" + str(self.trip_id) + ".json" ) if os.path.isfile(route_file_path): with open(route_file_path, "r") as fp: data = json.load(fp) self.route = data else: if self.circular: self.route = {} else: plans = ["balanced", "fastest", "quietest", "shortest"] closest_time = False trip_data = {} for plan in plans: name = ( "https://www.cyclestreets.net/api/journey.json?key=" + key + "&itinerarypoints=" + str(self.init_station["longitude"]) + "," + str(self.init_station["latitude"]) + "|" + str(self.end_station["longitude"]) + "," + str(self.end_station["latitude"]) + "&plan=" + plan ) data = requests.get(name).json()["marker"][0][ "@attributes" ] time = int(data["time"]) if closest_time is False: closest_time = abs(time - self.duration) trip_data = data elif abs(self.duration - time) < closest_time: closest_time = abs(time - self.duration) trip_data = data self.route = trip_data with open(route_file_path, "w") as fp: json.dump(self.route, fp) ``` A Bike object represents a bike identified by its ID. The Bike object has methods that allow you to fetch their "story" which are all the trips recorded in the data and routes obtained from https://www.cyclestreets.net/. In order to fetch data from the Cyclestreets API you need to apply for access and obtain a key. For this story we have already downloaded the routes from the API for the bikes in this example. ```python class Bike: def __init__(self, id): self.id = id def get_chains(self, stations): chain_ids = self.bike_rides.chain_id.to_list() chains = {} for chain_id in chain_ids: chain_rides = self.bike_rides[ self.bike_rides["chain_id"] == chain_id ] chains[chain_id] = [Trip(chain_rides, self.id, trip_id, stations) for trip_id in chain_rides.index] #self.get_trips(chain_rides, stations) self.chains = chains def get_story(self, dataset, stations,key): bike_rides = dataset[dataset["bike_id"] == self.id] self.bike_rides = bike_rides self.get_chains(stations) for chain_id, chain in self.chains.items(): for counter, trip in enumerate(chain): trip.get_route(key) if trip.route == {}: continue ``` We can visualise the journeys of a bike on a map using folium and moving pandas. ```python def get_colours(steps): colours = sns.color_palette("mako").as_hex() rev_colours = sns.color_palette("mako").as_hex() rev_colours.reverse() colours = rev_colours + colours while len(colours) < steps: colours += colours return colours def get_trajectory(bike_id, route_folder = "routes/"): chains = [ filename for filename in sorted(os.listdir(route_folder)) if str(bike_id) + "_" in filename ] times = [] geometry = [] colours = [] many_colurs = get_colours(len(chains)) for c in range(len(chains)): chain = chains[c] with open(route_folder + chain) as f: d = json.load(f) if len(d) > 0: geometry += [ Point([float(y) for y in x.split(",")]) for x in d["coordinates"].split(" ") ] if len(times) == 0: time_now = datetime.datetime.now() else: time_now = times[-1] times += [ time_now + datetime.timedelta(seconds=1 * t + 1) for t in range(len(d["coordinates"].split(" "))) ] colours += [ many_colurs[c] for x in range(len(d["coordinates"].split(" "))) ] df = pd.DataFrame() df["t"] = times df["trajectory_id"] = [1 for x in range(len(geometry))] df["sequence"] = [x + 1 for x in range(len(geometry))] df["colour"] = colours gdf = GeoDataFrame(df, crs="EPSG:4326", geometry=geometry) gdf = gdf.set_index("t") trajs = mpd.TrajectoryCollection(gdf, "trajectory_id") trajs = mpd.MinTimeDeltaGeneralizer(trajs).generalize( tolerance=datetime.timedelta(seconds=1) ) traj = trajs.trajectories[0] return traj def draw_map(traj): features = traj_to_timestamped_geojson(traj) # Create base map London = [51.506949, -0.122876] map = folium.Map(location=London, zoom_start=12, tiles="cartodbpositron") TimestampedGeoJson( { "type": "FeatureCollection", "features": features, }, period="PT1S", add_last_point=False, transition_time=10, ).add_to(map) return map def traj_to_timestamped_geojson(trajectory): features = [] df = trajectory.df.copy() df["previous_geometry"] = df["geometry"].shift() df["time"] = df.index df["previous_time"] = df["time"].shift() for _, row in df.iloc[1:].iterrows(): coordinates = [ [ row["previous_geometry"].xy[0][0], row["previous_geometry"].xy[1][0], ], [row["geometry"].xy[0][0], row["geometry"].xy[1][0]], ] times = [row["previous_time"].isoformat(), row["time"].isoformat()] features.append( { "type": "Feature", "geometry": { "type": "LineString", "coordinates": coordinates, }, "properties": { "times": times, "style": { "color": row["colour"], "weight": 5, }, }, } ) return features ``` ```python key = open("cycle_street_key.txt", "r").read() ``` ```python bike_id = 893 selected_data = data[(data['start_date']> '2020-03-23') & (data['start_date']< '2020-05-14')] bike = Bike(id=bike_id) bike.get_story(selected_data, stations,key) ``` ```python traj = get_trajectory(bike_id) map_trajectory = draw_map(traj) map_trajectory ``` ```python bike_id = 3278 selected_data =data[(data['start_date']> '2021-03-23') & (data['start_date']< '2021-05-14')] bike = Bike(id=bike_id) bike.get_story(selected_data, stations,key) traj = get_trajectory(bike_id) map_trajectory = draw_map(traj) map_trajectory ``` # Bike mobility patterns The patterns of mobility observed in this dataset can be abstracted using networks built from trips between stations. We construct a bidirectional weighted graph/network where the nodes are stations and the weight of an edge from station A to station B is the number of trips taken from A to B. As we'll see below, this reveals some high-level patterns in different regions and around different stations in London, that vary between seasons, weekdays, and times of the day. For the network analysis, we use the `networkx` package. We start by writing a function that builds a `networkx` network from a dataframe of trip data. A parameter `trip_count_threshold` is used to exclude edges that have a very small weight, to make the analysis computationally more tractable. ```python def create_network_from_data(df, trip_count_threshold = 1e-5): trip_counts = ( ( df[["start_station_id", "end_station_id", "bike_id"]] .groupby(["start_station_id", "end_station_id"]) .count() ) .reset_index() .rename(columns={"bike_id": "trip_count"}) ) trip_counts = trip_counts.sort_values("trip_count") total_num_trips = trip_counts["trip_count"].sum() trip_counts = trip_counts[ trip_counts["trip_count"] >= trip_count_threshold * total_num_trips ] graph = nx.from_pandas_edgelist( trip_counts, source="start_station_id", target="end_station_id", edge_attr="trip_count", create_using=nx.DiGraph, ) return graph ``` As we will see in a moment, our network, like many naturally occurring networks, has so called communities: these are clusters of nodes that are strongly linked to each other, and less linked to nodes outside the cluster. In our case this corresponds to groups of bike docking stations that have a lot of traffic amongst each other. Network theorists have developed many different kinds of algorithms for detecting community structure in networks. We use the Louvain algorithm, and its implementation in the `python-louvain` package. The Louvain algorithm chooses the communities by trying to maximise a metric called modularity, that roughly speaking measures how much total weight there is in edges that are internal to communities, compared to weight in edges that cross from one community to another. It is a greedy and quite fast (O(N log N)) algorithm, that works by hierarchically grouping densely connected nodes together and replacing them with single nodes that represent small communities, grouping those small communities to form bigger communities, etc. You can find more details in [wikipedia](https://en.wikipedia.org/wiki/Louvain_method) or the documentation of the `python-louvain` [package](https://github.com/taynaud/python-louvain). Below we write a function that does such community detection. Note that we need to ignore the directionality of our graph, i.e. the distinction between bikes going from station A to station B rather than from B to A. This is a limitation of the Louvain algorithm, but for community-detection purposes the directionality isn't important. ```python def network_community_detection(graph, edge_weight): graph_undirected = nx.Graph() undirected_edges = set(sorted(graph.edges)) for edge in undirected_edges: reverse_edge = (edge[1], edge[0]) trip_count = graph.edges[edge][edge_weight] if reverse_edge in graph.edges: trip_count += graph.edges[reverse_edge][edge_weight] graph_undirected.add_edge(edge[0], edge[1], trip_count=trip_count) partition = community.best_partition(graph_undirected, weight=edge_weight) df_partition = pd.DataFrame(partition, index=[0]).T.reset_index() df_partition.columns = ["id", "partition"] return df_partition ``` Finally a function for visualising these graphs of bike trips and their communities on a map, together with some supporting functions to aid in that job. ```python STATION_NAMES_FILE = "data/BorisBikes_station_names.pickle" STATION_COORDS_FILE = LOCATION_REF LABEL_STATIONS = [ "Belgrove Street", "Waterloo Station 3", "Hyde Park Corner", "Aquatic Centre", "Bethnal Green Road", "Natural History Museum", "Kennington Oval", "Mudchute DLR", ] def get_station_name(id): with open(STATION_NAMES_FILE, "rb") as f: station_allnames = pickle.load(f) name = sorted(station_allnames[id])[0] name = re.split(";|,|:", name)[0].strip() return name def get_node_info(graph): with open(STATION_COORDS_FILE, "r") as f: station_latlon = json.load(f) nodes = graph.nodes() pos = [station_latlon[str(int(node))] for node in nodes] pos = [(p["lon"], p["lat"]) for p in pos] station_sizes = [i[1] for i in list(graph.degree(weight="trip_count"))] labels = [get_station_name(int(node)) for node in nodes] nodes_df = pd.DataFrame( {"id": list(nodes), "pos": pos, "size": station_sizes, "name": labels} ) return nodes_df def _scale_range(values, min_scaled, max_scaled): values = np.array(values) if min_scaled is not None: max_value = np.max(values) min_value = np.min(values) mult_coeff = (max_scaled - min_scaled) / (max_value - min_value) add_coeff = (max_value * min_scaled - min_value * max_scaled) / ( max_value - min_value ) scaled = mult_coeff * values + add_coeff else: max_value = np.max(values) scaled = max_scaled * values / max_value return scaled def _drop_stations_without_location(graph): with open(STATION_COORDS_FILE, "r") as f: station_latlon = json.load(f) nodes = tuple(graph.nodes) stations_with_location = tuple(map(int, station_latlon.keys())) for n in nodes: if n not in stations_with_location: print(f"Removing node {n} because of missing location data.") graph.remove_node(n) return None def create_network_and_map( df, label_stations= LABEL_STATIONS, allow_self_loops=False, arrows=True, bg_map_zoom=11, ): community_graph = create_network_from_data(df) _drop_stations_without_location(community_graph) nodes_info = get_node_info(community_graph) visualisation_graph = community_graph.copy() if not allow_self_loops: visualisation_graph.remove_edges_from( nx.selfloop_edges(community_graph) ) community_df = network_community_detection(community_graph, "trip_count") nodes_info = nodes_info.merge(community_df, on="id") nodes_info = nodes_info.sort_values(by="size", ascending=False) del community_df nodes_info['lon'] = [p[0] for p in nodes_info["pos"]] nodes_info['lat'] = [p[1] for p in nodes_info["pos"]] nodes_info = GeoDataFrame(nodes_info,geometry=points_from_xy(nodes_info.lon, nodes_info.lat),crs="EPSG:4326") # Project to UK national grid nodes_info = nodes_info.to_crs("EPSG:27700") labels = { id: name for id, name in zip(nodes_info["id"], nodes_info["name"]) if name in label_stations } fig, ax = plt.subplots(1, 1, figsize=(20, 20)) nodes_info.plot(ax=ax, markersize=1, aspect=None) cx.add_basemap(ax, crs=nodes_info.crs, source=cx.providers.Stamen.TonerLite, zoom=bg_map_zoom) xynps = [np.array([p.x for p in nodes_info["geometry"]]), np.array([p.y for p in nodes_info["geometry"]])] pos = { k: (xynps[0][i], xynps[1][i]) for i, k in enumerate(nodes_info["id"]) } MAX_NODE_SIZE = 300.0 MIN_NODE_SIZE = 5.0 sizes = _scale_range(nodes_info["size"], MIN_NODE_SIZE, MAX_NODE_SIZE) weights = np.array( [ visualisation_graph.edges[e]["trip_count"] for e in visualisation_graph.edges ] ) MAX_EDGE_WIDTH = 3.0 MIN_EDGE_WIDTH = None weights = _scale_range(weights, MIN_EDGE_WIDTH, MAX_EDGE_WIDTH) MAX_EDGE_ALPHA = 0.9 MIN_EDGE_ALPHA = None edge_alpha = _scale_range(weights, MIN_EDGE_ALPHA, MAX_EDGE_ALPHA) # Plots nx.draw_networkx_nodes( visualisation_graph, pos=pos, nodelist=nodes_info["id"], node_color=nodes_info["partition"], alpha=1.0, node_size=sizes, cmap="tab10", ax=ax, ) nx.draw_networkx_edges( visualisation_graph, pos=pos, edge_color="#222222", width=weights, alpha=edge_alpha, arrows=arrows, ax=ax, ) nx.draw_networkx_labels( visualisation_graph, pos=pos, labels=labels, font_size=12, ax=ax, ) return fig, ax, nodes_info ``` To make the network analysis computationally manageable, we restrict our attention to data from 2021 only. ```python HARD_START_DATE = datetime.datetime(year=2010, month=1, day=1) start_date = datetime.datetime(year=2021, month=1, day=1) end_date = datetime. datetime(year=2022, month=1, day=1) df = df[ (df["start_date"] > HARD_START_DATE) & (df["end_date"] > HARD_START_DATE) ] df_year = df[(df["start_date"] > start_date) & (df["start_date"] < end_date)] ``` Plot time! Below is the network of bike trips aggregated over weekday mornings only, in 2021. The colours of the nodes mark the different communities as discovered by the Louvain algorithm. The lines connecting nodes have arrows indicating the direction of travel and their thickness marks the amount of traffic. ```python print("Plotting mornings") df_year_mornings = df[ df["start_date"].dt.hour.isin([7, 8, 9, 10]) & df["start_date"].dt.weekday.isin((0, 1, 2, 3, 4)) ] fig, ax, nodes_info = create_network_and_map(df_year_mornings) num_communities = len(nodes_info["partition"].unique()) print(f"Number of communities: {num_communities}") plt.title("Weekday mornings (7-10)") plt.show() ``` Plotting mornings 534715.5043561324 Number of communities: 5 ![png](output_78_1.png) The two most prominent types of weekday morning traffic are outgoing trips from King's Cross/St. Pancras (station called Belgrove Street marked on the map) and from Waterloo. Commuters presumably come in on trains and cover the last bit to their workplace by a Boris Bike. These two types of traffic form two of the communities found by the Louvain algorithm, with the others being less concentrated communities for East, West, and South London. Below is a similar graph but for weekday afternoons rather than mornings. The dominant patterns are the reverse of the morning traffic, with people going from their offices to the big train stations. A new feature is a community of traffic around Hyde Park, presumably pleasure rides in the afternoon by people on holiday. There are some other shifts in the communities marked by the colours, that we find hard to interpret. There's also some instability and randomness in the Louvain algorithm, where if you run it several times sometimes it divides the smaller communities in slightly different ways. The commuter communities are robust, though. ```python print("Plotting afternoons") df_year_afternoons = df[ df["start_date"].dt.hour.isin([15, 16, 17, 18, 19]) & df["start_date"].dt.weekday.isin((0, 1, 2, 3, 4)) ] fig, ax, nodes_info = create_network_and_map(df_year_afternoons) num_communities = len(nodes_info["partition"].unique()) print(f"Number of communities: {num_communities}") plt.title("Weekday afternoons (15-19)") plt.show() ``` Plotting afternoons 526993.5149398237 Number of communities: 12 ![png](output_80_1.png) Below is the same plot but for weekends. Here the most common trips are around Hyde Park and marked by self-loops, where the bike is returned to the same station where it was picked up. We don't know where they were ridden in between, but probably around the park for fun or a picnic. The Olympic Park as a similar loop. The communities again show different regions of London, with people travelling within neighbourhoods, although some of them are somewhat surprising and hard to interpret, with outlier stations way outside the bulk of their communities. ```python print("Plotting weekends") df_year_weekends = df[df["start_date"].dt.weekday.isin((5, 6))] fig, ax, nodes_info = create_network_and_map( df_year_weekends, allow_self_loops=True, ) num_communities = len(nodes_info["partition"].unique()) print(f"Number of communities: {num_communities}") plt.title("Weekends") plt.show() ``` Plotting weekends 525180.1235857526 Finally, let's see how patterns have changed over time. Below we create these network plots but this time not for a specific time of day or week, but for specific years: 2013, 2017, 2020, and 2021. ```python for year in (2013, 2017, 2020,2021): print(f"Plotting {year}") start_date = datetime.datetime(year=year, month=1, day=1) end_date = datetime.datetime(year=year + 1, month=1, day=1) df_year = df[ (df["start_date"] > start_date) & (df["start_date"] < end_date) ] fig, ax, nodes_info = create_network_and_map( df_year, allow_self_loops=False, arrows=False, ) num_communities = len(nodes_info["partition"].unique()) print(f"Number of communities: {num_communities}") plt.title(f"Year {year}") plt.show() ``` Plotting 2013 id pos size name partition \ 301 14 (-0.123616, 51.529943) 150033 Belgrove Street 1 22 154 (-0.112824, 51.503791) 134846 Waterloo Station 3 0 432 191 (-0.15352, 51.503117) 129340 Hyde Park Corner 3 238 194 (-0.091773, 51.504627) 108997 Hop Exchange 0 52 307 (-0.187842, 51.509908) 95135 Black Lion Gate 3 .. ... ... ... ... ... 442 494 (-0.016251, 51.50196) 468 South Quay East 4 541 519 (-0.011662, 51.518811) 453 Teviot Street 4 558 472 (-0.027616, 51.529452) 408 Malmesbury Road 4 547 622 (-0.205535, 51.507481) 177 Lansdowne Road 3 422 79 (-0.113855, 51.511726) 162 Arundel Street 0 lon lat geometry 301 -0.123616 51.529943 POINT (530254.816 182894.392) 22 -0.112824 51.503791 POINT (531078.357 180005.453) 432 -0.153520 51.503117 POINT (528255.915 179858.465) 238 -0.091773 51.504627 POINT (532536.883 180136.298) 52 -0.187842 51.509908 POINT (525855.197 180554.120) .. ... ... ... 442 -0.016251 51.501960 POINT (537786.055 179979.077) 541 -0.011662 51.518811 POINT (538053.630 181861.581) 558 -0.027616 51.529452 POINT (536914.894 183014.928) 547 -0.205535 51.507481 POINT (524634.062 180253.976) 422 -0.113855 51.511726 POINT (530984.064 180886.006) [564 rows x 8 columns] Number of communities: 6 ![png](output_84_1.png) Plotting 2017 id pos size \ 102 14 (-0.123616, 51.529943) 169510 126 191 (-0.15352, 51.503117) 149073 9 154 (-0.112824, 51.503791) 145498 288 307 (-0.187842, 51.509908) 101722 221 303 (-0.158456, 51.502953) 96691 .. ... ... ... 628 767 (-0.200806, 51.457059) 450 764 689 (-0.180884, 51.459225) 290 748 363 (-0.171185, 51.529121) 240 766 183 (-0.136123, 51.482362) 233 765 672 (-0.17102170771560854, 51.46543023973911) 147 name partition lon lat \ 102 Belgrove Street 3 -0.123616 51.529943 126 Hyde Park Corner 2 -0.153520 51.503117 9 Waterloo Station 3 5 -0.112824 51.503791 288 Black Lion Gate 2 -0.187842 51.509908 221 Albert Gate 2 -0.158456 51.502953 .. ... ... ... ... 628 Santos Road 1 -0.200806 51.457059 764 Spanish Road 8 -0.180884 51.459225 748 Lord's 3 -0.171185 51.529121 766 Milford Lane 6 -0.136123 51.482362 765 Grant Road Central 1 -0.171022 51.465430 geometry 102 POINT (530254.816 182894.392) 126 POINT (528255.915 179858.465) 9 POINT (531078.357 180005.453) 288 POINT (525855.197 180554.120) 221 POINT (527913.805 179831.597) .. ... 628 POINT (525100.022 174654.904) 764 POINT (526478.066 174929.957) 748 POINT (526957.662 182719.438) 766 POINT (529522.065 177581.044) 765 POINT (527145.955 175637.064) [768 rows x 8 columns] Number of communities: 9 ![png](output_84_3.png) Plotting 2020 id pos size name \ 117 191 (-0.15352, 51.503117) 113767 Hyde Park Corner 392 785 (-0.01051, 51.54094) 99892 Aquatic Centre 124 307 (-0.187842, 51.509908) 73598 Black Lion Gate 168 213 (-0.149569, 51.50274) 72850 Wellington Arch 76 194 (-0.091773, 51.504627) 69085 Hop Exchange .. ... ... ... ... 744 258 (-0.178656, 51.501432) 358 Kensington Gore 727 482 (-0.173689, 51.495947) 348 Exhibition Road Museums 2 753 391 (-0.142345, 51.510662) 345 Clifford Street 759 528 (-0.145827, 51.507326) 296 Clarges Street 740 232 (-0.112753, 51.51501) 246 Carey Street partition lon lat geometry 117 11 -0.153520 51.503117 POINT (528255.915 179858.465) 392 12 -0.010510 51.540940 POINT (538066.636 184324.544) 124 11 -0.187842 51.509908 POINT (525855.197 180554.120) 168 11 -0.149569 51.502740 POINT (528531.182 179823.466) 76 0 -0.091773 51.504627 POINT (532536.883 180136.298) .. ... ... ... ... 744 11 -0.178656 51.501432 POINT (526516.087 179627.377) 727 11 -0.173689 51.495947 POINT (526876.038 179026.014) 753 8 -0.142345 51.510662 POINT (529010.185 180717.121) 759 11 -0.145827 51.507326 POINT (528777.966 180340.018) 740 3 -0.112753 51.515010 POINT (531051.113 181253.169) [783 rows x 8 columns] Number of communities: 13 ![png](output_84_5.png) Plotting 2021 id pos size \ 91 191 (-0.15352, 51.503117) 140998 526 785 (-0.01051, 51.54094) 92570 244 213 (-0.149569, 51.50274) 85288 266 194 (-0.091773, 51.504627) 80866 438 307 (-0.187842, 51.509908) 80773 .. ... ... ... 619 232 (-0.112753, 51.51501) 342 772 760 (-0.1646535645501321, 51.52521872896051) 314 780 339 (-0.111781, 51.533319) 176 778 464 (-0.052099, 51.51417) 169 777 87 (-0.079684, 51.516468) 154 name partition lon lat \ 91 Hyde Park Corner 2 -0.153520 51.503117 526 Aquatic Centre 4 -0.010510 51.540940 244 Wellington Arch 0 -0.149569 51.502740 266 Hop Exchange 0 -0.091773 51.504627 438 Black Lion Gate 2 -0.187842 51.509908 .. ... ... ... ... 619 Carey Street 3 -0.112753 51.515010 772 Rossmore Road 5 -0.164654 51.525219 780 Risinghill Street 5 -0.111781 51.533319 778 St Mary and St Michael Church 6 -0.052099 51.514170 777 Devonshire Square 0 -0.079684 51.516468 geometry 91 POINT (528255.915 179858.465) 526 POINT (538066.636 184324.544) 244 POINT (528531.182 179823.466) 266 POINT (532536.883 180136.298) 438 POINT (525855.197 180554.120) .. ... 619 POINT (531051.113 181253.169) 772 POINT (527421.584 182296.833) 780 POINT (531066.013 183290.934) 778 POINT (535261.971 181270.035) 777 POINT (533341.297 181474.993) [784 rows x 8 columns] Number of communities: 11 ![png](output_84_7.png) From 2013 to 2017 the big change is the growth of the network of stations. Especially in the west and southwest, but also around for instance the Olympic Park, the Boris Bike network grows a lot. From 2017 to 2020 the notable change is the effect of the pandemic: Most of the commuter traffic from e.g. King's Cross and Waterloo has vanished. The dominant traffic patterns that did survive the pandemic were the rides around Hyde Park and the Olympic Park. In 2021 we see some of the commuter traffic return, but we remain far from the levels of 2017. Note that we should be careful in reading too much into the coloured communities and how they've changed from year to year. First of all, the colours bear no particular meaning, and the same community may be coloured differently in different plots. Second, the Louvain algorithm for discovering them is stochastic, and many of the smaller communities might arrange themselves differently by simply running the same cell again. The communities with the most traffic remain robust though. # Conclusion In this Turing data story we aimed to use the TfL bike journey dataset to understand the evolution of London as a cycling city and the patterns in how people move around the capital. This story was born out of a 'Hack week' organised by the [Research Engineering Group](https://www.turing.ac.uk/research-engineering) at the Turing, an annual event in which people from the team get to work on projects/datasets they find interesting. Our strategy was to extract some general statistics about bike usage over time, followed by trying to characterise the movements of some long-lived bikes, and finally to use network analysis techniques to abstract these movements into high level patterns of behaviour. [TODO add summaries of other findings] The analysis very clearly showed how the pandemic changed drasticly the way we move around the city and the strength of the commute links. Furthermore, in the last couple of years a number of other cycling schemes have been introduced, and significant improvements to the cycling infrastructure might have encouraged people to acquire their own bikes, both of which may have contributed to changes in Boris Bike use. A task for a future analysis (from a Turing Data Story reader :-) ) is to include 2022 data and see if this change in behaviour has persisted.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.