# Turing Data Stories: London Air Pollution - Notes
## Notes from hack (4/3/22)
Links for data (shared in Zoom chat):
* Deprivation: https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019
* London schools atlas: https://data.gov.uk/dataset/6b776872-c786-4960-af1d-dab521aa4ab0/london-schools-atlas
* LSOA: https://data.london.gov.uk/dataset/super-output-area-population-lsoa-msoa-london
* School travel: https://www.gov.uk/government/statistical-data-sets/nts06-age-gender-and-modal-breakdown#school-travel (this only gives us % mode for all of London)
## Notes from Breathe London meeting
Greg: senior data analyst
Dan (Austin, Texas):
- Calibration and maintainance of the network
- Reliability of low-cost data
- PM and NO2 sensors were adversely affected by meterological conditions
- Precise to each other
- Precise to the reference grid
- During high humidity/low visibility, data was not reliability
- RMSE ~40-50% of mean concentrations
#### Temperature as a metric of confidence
- high air temperature (> 15/20 degrees), biased high and the low grade sensors were way higher than the reference grade sensors
- good at representing relative changes, but the difference between sensors is the issue (due to calibration)
- comparing peak hr vs on weekend, traffic on exposure at schools
#### Be aware of the robustness of the data and make your analysis is resilient
#### Affect of school streets
- GLA commissioned this results http://schoolstreets.org.uk/
- Average deprivation of LSOAs feeding pupils into schools,
- school capacity and catchment to create weights https://www.get-information-schools.service.gov.uk/
-
#### Project
- Initated by Mayor for London and GLA (EDF were the coordinators)
- 2 running projects
1. Mobile monitoring with reference grade sensors driven by google streetview cars
2. ยฃ100's low-cost sensors and collected data from them. Run a PoC on reliability of low-cost sensor networks
- Using data to advocate for policies to improve air quality in london, and let other NGOs get value for money from sensor networks
#### Current network
- Fill gaps in the reference network
- Permissions and access to sites made this challenging
- Labels are informative but imperfect, jugement was involved in determining the site-type. Not all of them have a a 270 degree of free airflow
- Standard definition of sensor location types (LLAQM)
- Section 2.2 for site type https://amt.copernicus.org/articles/15/321/2022/
#### Method they used
- grouped `curbside` and `roadside`
- buffer calculations (50m)
- tool (pollution distance from road calculator) published by DEFRA alongside technical guidance for boroughs
- councils use this formula to estimate the gradient from a road
- https://laqm.defra.gov.uk/air-quality/air-quality-assessment/no2-falloff/
- Dispersion model from Cambridge to model concentrations across space
- https://www.cerc.co.uk/environmental-software/ADMS-model.html
#### Tasks
- [ ] Dan to send over section of the report that talks about the uncertainty
## Draft narrative
### Title and teaser
**How bad is the air quality near London's schools?**
Exploring environmental and social deprivation near schools in London.
### โ๏ธ Authors and reviewers
* Christina Last
* Bill Finnegan
* Gantugs?
Reviewers TBD
### ๐จ Background
An estimated nine thousand Londoners die early every year due to air pollution-related causes (see [report](https://www.scribd.com/document/271641490/King-s-College-London-report-on-mortality-burden-of-NO2-and-PM2-5-in-London)), and in 2020, a [legal ruling](https://www.bbc.co.uk/news/uk-england-london-55330945) determined that air pollution was the cause of death for an 9-year-old girl living in London. Moreover, recent [research](https://www.london.gov.uk/press-releases/mayoral/toxic-air-linked-to-severity-of-covid-19) from Imperial College London suggests exposure to high air pollution concentrations enhances the risk of COVID-19 hospitalisation. We know that the impact of air pollution is felt more acutely by the young, with 1 in every 4 deaths under 5 years related to environmental risks. (*Christina: do you have a source for this?*)
Air quality in London was a major issue in the [2016 mayoral race](https://time.com/4316873/london-mayor-election-air-pollution/), and Sadiq Khan has introduced a number of air quality improvement [measures](https://www.london.gov.uk/what-we-do/environment/pollution-and-air-quality), including the expansion of the Ultra Low Emission Zone, investing in cleaner buses, and auditing the air quality at London's schools. However, 25% of schools in London are located in areas with air pollution above limits defined by the World Health Organisation (WHO) according to [research](https://www.cleanairday.org.uk/news-stories/clean-air-day-2021over-a-quarter-of-uk-schools-are-above-who-air-pollution-limits-0) by the charity Global Action Plan announced on Clean Air Day in 2021.
This data story investigates air quality near schools in London, which will also be compared to measures of social deptivation to explore the environmental justice implications of air pollution. We start by exploring an overview of daily and annual trends for air quality across the city. There are two different types of air pollution that we will be looking at in this data story:
* [NO2](https://www.londonair.org.uk/londonair/guide/WhatIsNO2.aspx): Nitrogen Dioxide is generated by road transportation and gas boilers and contributes to respiratory infections. This pollutant was mentioned by the coroner in the inquest into Adoo-Kissi-Debrah's death.
* [PM2.5](https://www.londonair.org.uk/londonair/guide/WhatIsPM.aspx): Particulate matter refers to any sort of solid particles suspended in the air, for example dust or smoke generated by building, industry, and road traffic. Particulates are measured based on size, with PM2.5 being 2.5 micrometres or smaller. These small particles contribute to heart and lung disease and premature death.
### ๐ Data
We are using data from [Global Clean Air](https://www.globalcleanair.org/innovative-air-quality-monitoring/london-uk/breathe-london-data/) which contains information about pollutionโs health effects, as well as readily available โ and understandable โ air pollution data and analysis. The Breathe London pilot project mapped and measured pollution across the capital, led for two years by Environmental Defense Fund Europe and launched in partnership with the Mayor of London and leading science and technology experts.
The dataset contains more than 100 lower-cost sensor pods and specially-equipped Google Street View cars, Breathe London complemented and expanded upon Londonโs existing monitoring networks. The project aimed to help people better understand their local air quality and support cities around the world with future monitoring initiatives.
*Christina: I didn't edit these paragraphs from the current notebook, but was wondering if this implies we are using the Google Street View car data, which I don't think we are.*
#### ๐ง Setup, import and prep
We begin by setting up our environment and importing various python libraries that we will be using for the analysis. In particular, `pandas` and `numpy` are key data science libraries used for data processing. `matplotlib` and `seaborn` will help us visualise our data.
`Code block`
*Christina: Current notebook references accessing collab data, but this needs to be updated for the story. Will the notebook grab the data directly from the Global Clean Air url?*
`Code block`
We can see that there is a csv recording the timeseries value of the PM2.5 reading (hourly) and a csv containing information about the location of the sensor. They both share a common id in the column pod_id_location. Lets join the PM2.5 data and the NO2 data to one dataframe containing the timeseries values and location information.
`Code block`
Here we can see the NO2 data has successfully been joined with the metadata available for the sensor location. We drop repeated columns using a regex filter on our suffix `_DROP` added to duplicated columns within the merge.
`Code block`
The distribution of the numeric variables looks positively skewed and these variables may need to be transformed for modelling. It looks like `NA` values are coded differently depending on the variable. For example, the value `-999.00000` seems to indicate null values in bopth PM2.5 and NO2 readings. Lets take a further look at the number of `NA` values in the data and where they exist.
`Code block`
Here we can see that most of the `NA` values exist in the variables `relocate_date_UTC`, `Distance_from_Road` and `Site_Type`. We can deal with these missing data through recoding and omitting the values themselves, or imputing them.
*Christina: the prep above references modelling, but I don't think we'll be doing any new modelling in this story. Does this need to be revisited?*
### ๐ญ Preliminary data exploration
To start, we can find out a little more about the air pollution sensor pods.
`Code block`
Here we can see the number of unique values for each of our datasets. We can see there are `107` `NO2` unique stations in across `34` boroughs in London, and `95` stations capturing `PM2.5` values across `29` boroughs in London.
#### ๐บ Plotting geospacial distribution
In order to display the map of the station locations accurately, we need to set the Coordinate reference system of the georeferences information in our `GeoDataFrame`. This requires knowing some information on projection systems, but for now we know that any geospatial information given in `Latitude` and `Longitude` is well-projected using the `EPSG:4326` coordinate reference system.
`Code block`
In the above map (using the metadata of the individual stations) we can see that there are more N02 stations than PM2.5 (also indicated by the no. of unique stations from each `dataframe`, and that these stations are located slightly further out. The stations in purple indicate an overlapping of a PM2.5 and NO2 sensor in the same location.
#### ๐ Plotting average daily values
Lets create some additional columns to summarise the data with, and lets merge both dataframes on the newly created `date` column.
`Code block`
We can see the average daily PM2.5 and NO2 results plotted above. These demonstrate that there is high variability in the average daily pollution levels across the city, reflected in the spikey line graphs. We can also see some seasonal variability, with higher PM2.5 and NO2 concentrations being recorded in the winter months (September to February) in 2018-2019. We cannot see the same increase in emissions for the following winter (2019-2020), potentially due to lockdowns being enforced in the late winter to early spring 2020.
*Christina: What do you think about moving the line graph of daily averages above the mapping? The idea being to start big picture and then drill down into details.*
#### ๐ Local view: annual and daily trends
Let's see how the levels of pollutants are varying, taking their local daily average (IDs are unique).
`Code block`
After sampling the yearly concentration values from the first 5 unique station IDs, we can plot the air pollution concentrations for each of the locations.
`Code block`
Here we can see the daily average pollution concentration data for the first 5 unique sensor IDs. The individual sensor readings show there is a lot of missing data (at differing intervals for each sensor). We can also see all the sensors shows exhibit a spike in NO2 concentrations in July/August 2019, and again in 2020.
We can also look at average values for the time of the measurement at a single location to explore daily trends.
*Christina: For consistency, is it also worth doing this for all sensors in combination with the daily averages above?*
`Code block`
*Christina: Do you still want to include the distance from road analysis, or should we drop that?*
### ๐ซ Schools and air quality
Using the `Site_Type` column, we can further investigate sensor locations near schools.
`Code block`
This shows the number of sensors available in the network, 98. Out of the total sensor count, 35 of them were located on or by schools.
#### Analysing the location of schools proximate to air pollution sensors
While the site type is a useful starting point, we want to dig a little deeper into the location of sensors and schools. We need to know whether schools are located in areas of high pollution, in which case we need to define a methodology to assess whether a school location intersects a location with pollution.
Here we are pulling in data from [London Schools Atlas](https://data.gov.uk/dataset/6b776872-c786-4960-af1d-dab521aa4ab0/london-schools-atlas) on locations of schools, and some key information such as:
* count of the pupil flows from `LSOA` to each school
* multiple deprivation information such as free school meals
* the locations of each school and the number of total students they have.
`Code block`
In order to intersect the sensor locations with school locations, we need to generate a geographic "buffer zone" around the sensors, one which we think represents the area likely to also have high pollution levels. In this analysis, we assume a 75m buffer around the location point to create this buffer.
`Code block`
#### Defining exposure
Before we continue, we'd like to step back and come up with a definition of exposure to pollutants above WHO limits that will be used in the analysis below.
...
#### Exposure at school
...
#### Exposure in school catchments
...
#### Combining social deprivation
...
### Conclusions
TBD
#### ๐ Additional Resources
* [Clean Air Day (16 June 2020)](https://www.actionforcleanair.org.uk/campaigns/clean-air-day)
* [London Schools Pollution Helpdesk](https://www.pollutionhelpdesk.co.uk/)
* [Clean Air for Schools Framework](https://www.transform-our-world.org/programmes/clean-air-for-schools)
* [London Air Quality Network](https://www.londonair.org.uk/)
* [Breathe London](https://www.breathelondon.org/)