# **Weather and Air Pollution in Bergen**
Katarina Andersen, Astrid Bragstad Gjelsvik, Asena Goren
November 15, 2020
## **Introduction**
Air pollution is a major health problem in urban areas. The World Health Organization estimates that outdoor air pollution causes 4.2 million preventable deaths every year **(1)**. Geographic and atmospheric conditions play an important role in determining how the concentration of pollutants from local emissions develop. If we want to understand and predict patterns of air pollution in urban areas, it is important to understand exactly how weather patterns affect the concentrations of pollution indicators. Our aim in this project is to explore how atmospheric variables such as air temperature, wind speed, relative humidity and precipitation relate to measured concentrations of nitrogen dioxide (NO2), and particulate matter (PM). Specifically, we look at particulate matter classfied as PM10 and PM2.5 based on size. Particulate matter of size up to 10 $\mu$m is classified as PM10, and PM2.5 refers to particulate matter of size up to 2.5 $\mu$m. Because the classifications refer to a size maximum, PM2.5 particles are also included in PM10. However, the two categories are treated in a distinct matter because small pollutant particles are thought to have different and more severe health impacts than larger particles **(2)**.
In this project, we use data from the Meteorological Institute of Norway (MET) and The Norwegian Institute for Air Research (NILU) to analyse trends in the Bergen city center over a one year period, namely December 2009 to December 2010. This time period features an extreme air pollution event, which occured on January 12th, 2010. The NO2 levels on January 12th surpassed the threshold level of 200 $\mu$g/m$^3$, which is generally considered safe by the Norwegian Environment Agency and the Norwegian Public Health Institute **(3)**. The NO2 concentration even surpassed the "alarm" threshold of 400 $\mu$g/m$^3$ for several hours.
This extreme pollution event was a result of a specific atmospheric condition, namely a strong inversion. A strong inversion is characterised by a stable atmospheric layer, in which warm air is situated above cool air. This hinders vertical mixing, and thus the NO2 concentration accumulates close to the ground. When we analyse weather conditions in relation to air pollution concentrations, we are therefore interested in considering both windspeed and air temperature. While wind breaks up inversions and transports pollutants away even on an inversion-free day, cold temperatures on winter days can indicate an inversion and thus high concentrations of air pollution. We also consider rain, as we are interested in whether rain could lead to wet deposition of pollutants and thereby lower their concentrations.
In the following sections, we will first discuss our study area more in detail. Next, we will present the data we used and how we processed it. We will also discuss seasonality in the data. We will investigate the correlations between our weather variables and the concentrations of pollution indicators, and describe the statistical properties of our dataset. Thereafter, we further analyse the relationships between the air pollution indicators and the weather variables through linear regression models, principal component analysis and prediction by machine learning.
## **Study area**
Because we want to understand air pollution in urban environments, for this study, we tried to locate the air pollution station and meteorological station with the closest proximity to the city center of Bergen. As a result, we chose the NILU station at RÄdhusplassen in Bergen and the MET station at the Geophysical Institute of University of Bergen (Florida). The locations of the stations can be seen in the maps below. The stations are approximately 1.3 km apart.
The city centre is the most a densely populated area and has a large amount of traffic, which causes high pollution concentrations. Additionally, the topography of Bergen causes air pollution to accumulate in the city center. The city of Bergen is surrounded by mountains, which can hinder air transport when vertical mixing is reduced.
*Maps of the Bergen city center with data collection stations marked*

## **Presentation of data**
Here we present the raw data used in this analysis. It is presented seasonwise, divided into three-month sections. For each three-month section we see the concentration of air pollutants (PM10, PM2.5 and NO2), followed by the daily average temperature, daily relative humidity, total daily precipitation and daily average windspeed. Generally, we will refer to these variables as temperature, relative humidity, precipitation and windspeed, respectively. Note that the plots for the different seasons are not scaled to have the same y-axis, in order to make the within-season variations more visible.
*All variables in December-January-Febuary*

The graphs above show the highest concentrations of air pollutants for the whole year, which occurred in December, January and Febuary. We recognize a peak higher than 200 $\mu$g/m$^3$ that corresponds with the extreme air pollution event reported on January 12th 2010. We see that this event also corresponded with a low air temperature and relatively low wind speeds.
*All variables in March-April-May*

In March, April and May we see lower air pollution concentration than in the months before. While there are some large peaks in the beginning of March, the concentrations seem to be more constrained as we approach April and May. We also see rising temperatures.
*All variables in June-July-August*

In June, July and August we see also see relatively low air pollutions concentrations compared to the previous months, with a peak of about 60 $\mu g/m^3$ in the middle of August. We note that this peak occurs simultaneously with peak in temperature.
*All variables in September-October-November*

In September, October and November we see that we get some larger peaks in air pollution as we approach the end of November, which corresponds to decreases in temperature to below 0 degrees. This coincides with lower relative humidity and precipitation. It should also be noted that we see a couple missing values in our data for PM10 in the beginning of January, and NO2 at the end of November.
*Correlation plot of initial data*

Above we see the correlations between different weather and air pollution variables. We can see that the air pollution indicators are highly correlated among themselves. Apart from this, we have somewhat high negative correlations between NO2 and and temperature and NO2 and wind (both below -0.5). PM10 and PM2.5 are a little less correlated with temperature and wind than NO2. All three air pollution indicators have stronger correlations with temperature and wind than with precipitation. Among the climate variables, relative humidity is the most weakly correlated with all three pollution indicators, and also notably correlated with precipitation. For this reason, many of our analyses will exclude using relative humidity as a representative climate variable.
*Scatter plots showing correlations between PM10 and all other variables*

The plots above show the correlations between PM10 and the other pollution indicators (PM2.5 and NO2), as well as all the climate variables used in this study. It is not surprising to see a strong positive correlation between PM10 and PM2.5 since, as mentioned earlier, PM10 measurements include PM2.5 particles. Cars contribute to both the amount of coarse mode particles in the air and the NO2 concentrations, thus this might be a good explanation for the strong positive correlation between PM10 and NO2. Additionally, PM10 and NO2 concentrations are both impacted by the same meteorological conditions such as wind transportation and inversions.
As seen in both the scatterplots and the correlation table above, there is some negative correlation between the PM10 concentration and temperature and wind speed. Negative correlation suggests that the PM10 concentration is higher when wind speeds and temperatures are low. The relationships between PM10 concentration and precipitation and relative humidity are a little more convoluted, and the direct linear correlations are weak.
*Scatter plots showing correlations between PM2.5 and all other variables*

The scatterplots above showcase how the PM2.5 concentration relates to the other variables considered in this study. Again, we can see strong positive correlations between PM2.5 and the other pollution indicators, PM10 and NO2. This time we see an even stronger negative correlation between PM2.5 concentration and the daily averaged windspeeds and daily averaged temperatures, compared to what we saw above with PM10. This could be because the smaller particles are more affected by air transportation caused by the wind and warm temperatures.
*Scatter plots showing correlations between NO2 and all other variables*

The scatterplots above show how the NO2 mass concentration correlates with the other variables. Again, the pollution indicators have a strong positive correlation with each other. The negative correlation between NO2 and wind is even stronger than for PM10 and PM2.5. However, we can see that the trend between NO2 and wind is not quite linear, indicating that low concentrations of air pollutions are not necessarily the direct result of high windspeeds.
*Histogram and boxplot showing the distributions of each variable*







The plots above show how the variables are distributed. Temperature is the variable closest to being normally distributed, with only a slight positive skew. A positive skew indicates that there are more higher values than lower values. The temperature data did not have any outliers. The wind data has a fairly normal looking peak with a bit of a left skew. However, because windspeeds can not be below 0 m/s, the left tail of the distribution stops abruptly. From the boxplot of the wind data, we see that there are some outliers to the right of the distribution, indicating that there were some days with very strong winds. The relative humidity histogram suggests that the data is somewhate normally distributed, but this time with a right skew. The boxplot shows some outliers to the left of the distribution, suggesting there were a few days in our dataset with rather low relative humidity.
The histograms and boxplots of the precipitation and the mass concentrations of PM10, PM2.5 and NO2 suggest that these variables are not normally distributed. All three of these features also have a cut-off at zero. Precipitation further has a large peak at zero because there are so many days without any rain, which lends an almost binomial-like attribute. The pollution indicators also are quite peaked showing higher kurtosis than a normal distribution.
## **Methods**
**Data Acquisition**
*MET data*
We accessed the MET data using Norsk klimaservicesenter (klimaservicesenter.no). From this webpage we collected daily mean temperature data, average of daily mean wind speed data, 24 hour precipitation data and daily mean relative humidity data. The period we extracted was December 1, 2009 - November 30, 2010. We selected this range to overlap two calendar years so that the winter season would be continuous. There is no missing data in this time period.
*NILU data*
We accessed the NILU data using NILUs application programming interface (api.nilu.no). This site contained the hourly measurements of NO2, PM10 and PM2.5 concentrations. The period we extracted was December 1, 2009 - November 30, 2010. There are only two datapoints missing in the time period, one for PM10 and one for NO2.
To make the NILU and MET data comparable on a daily scale, we calculated the daily averages of the concentrations of each pollution indicator. For all seasonal analyses, we divided the data into seasonal groups with time bins of three months each.
**Normality of Data**
To check if the climate and pollution variables we are investigating are normally distributed, we first made QQ-plots to visualize the data, as presented in the figures below. For the variables that did not have a very linear-looking QQ plot, a variety of basic transformations (such as log, inverse, square, square-root) were applied to the data to determine (visually) if the QQ-plot would become more linear. If the QQ-plot did look more linear, transformed or "normalized" versions of the variables were used for some analyses, as described in the sections below.
Extreme values of the pollution indicators were analyzed by looking at the maximum values of NO2, PM10, and PM2.5 concentrations. An outlier test was used to determine if the maximum values are indeed outliers. The probabilities of getting values as high as the maximum values were calculated using a normal distribution and a Gumbel-right skewed distribution. The formula for the outlier test used is: $X_H=\bar{X}+K_nS_X$, where $X_H$ is the upper threshold for outliers, $\bar{X}$ is the mean and $S_X$ is the standard deviation, and $K_n$ is defined as:$K_n\approx 1.055+0.981 log_{10}n$
A one-sided Kolmogorov-Smirnov (KS) test was used to test if the pollution indicators (NO2, PM10, and PM2.5 concentrations) follow a normal or right-skewed Gumbel distribution. A two-sided KS test was used to test if the pollution indicators follow the same distribution as each other. The one-sided test was run on both the full dataset as well as randomly selected subsets of the dataset because KS tests are less likely to reject that a dataset follows a given distribution when the number of datapoints is small. The results of the KS tests were compared with KS test results for generated random variables following a normal or Gumbel distribution.
**Ordinary Least Squares Linear Regression**
To make a model of how climate variables might influence pollution levels, we used the ordinary least squares (OLS) method in the Python statsmodels library to generate linear regressions for each pollution indicator (dependent variable) based on the climate variables (independent variables). Relative humidity was not used in the linear regression because it had low correlation to all the pollution indicators and high correlation with precipitation. Therefore the climate variables to select from in building the linear regression were air temperature, windspeed, and precipitation.
Generally, the normalized (transformed) versions of the variables were used for making the linear regressions. The exception to this is that while the log-root transformation of precipitation appears more linear in the QQ-plot, because precipitation is often zero and log-zero is undefined, this transformation was not useful in making regressions because of the many NaN values created by the transformation. Therefore, the unmodified precipitation values were used for the regressions.
To select which climate variables would be included in the linear regression, first the 2 most highly correlated single climate variables were each used individually to make a linear regression. Then a linear regression was made using both of those variables. Lastly, a linear regression was made using all three climate variables. The models were compared using the $R^2$ value, Akaike information criteria (AIC), and Bayesian information critera (BIC) to select one best model. If two models were hard to differentiate between, the coefficients were all checked for significance at the $\alpha=0.05$ level and then the model with the lowest BIC was selected.
**Principal Component Analysis**
The PCA model package from scikit-learn was used to do the principal component analysis (PCA) analysis. A PCA deconstruction was run using both 2 and 3 principal components. A model was made for each pollution indictator using an OLS linear regression with the principal components. The linear model made from the principal components was compared with the linear model made from climate variables as described in the above section.
**Machine Learning**
The Random Forest Regressor from the scikit-learn package was used to make a model of NO2 concentration based on climate variables. The test set was constructed using 100 randomly selected data points from the year of data used in this study. This model was then used to make a prediction for the full year of data. This prediction was then compared with the prediction made by the OLS linear regression method described above.
## **Results**
**Normality of Data**
The results of visual inspection of the QQ-plots and normalizing transformations of the data are represented in the graphs below.
*QQ plot of the original data and the log-transformed data*

The QQ plot is helpful for seeing when a transformation (in this case log-transformation) is likely to make the distribution of the variable more normal. The closer the plot is to being a straight, diagonal line, the closer the data is to being normally distributed. From the first and third row of the plot, we see that log transformation of temperature and relative humidity is actually less close to normally distributed. By contrast, windspeed and pollutant concentrations (PM10, PM2.5 and NO2) look more normal after log transformation.
*QQ plot of $log( \sqrt{precip})$*

The precipitation data also seems it may become a bit more normal after log transformation, but the result is still not very linear. Log transformation of the square root of the precipitation data is the closest to normal of the transformations tested. However, as mentioned before, log transformation is not very useful for precipitation data because of the many zero values which become undefined.
*Correlation table including the normalized data*

This correlation table reexamines correlations between the variables, including the log transformations that looked promising in the QQ plots above. We can see that the pollution indicators are still highly correlated with each other after log transformation. Log transformation of the wind also seems to have strengthened it's correlation with the pollution indicators. Temperature is similarly correlated with the pollution indicators before and after they are log transformed.
**KS tests**
The one-sided Kolmogorov-Smirnov (KS) test for determining if the pollution indicators follow a normal or Gumbel distribution did not give a significant result for any of the pollution indicators following either a normal or Gumbel distribution. The two-sided KS test did not show that any of the pollution indicators follow the same distribution. This means we should be cautious interpretting the results of the OLS linear regression which makes assumptions about the normality of the data being used for it. This said, the KS test can be quite strict, and even with generated large datasets of normally distributed random variables, the KS test does not always yield very high probabiliities that the random values are normally distributed, nor follow the same distribution.
**Extremes and Outliers**
The maximum values of the pollution indicators all exceeded the outlier thresholds. The following were the probabilities of the maximum values given a normal and a right-skewed Gumbel distribution:
Probability of Maximum Value based on a Normal Distribution:
NO2: 1.36e-13
PM10: 2.9e-14
PM2.5: 1.29e-11
Probability of Maximum Value based on a Gumbel Distribution:
NO2: 0.0007
PM10: 0.0005
PM2.5: 0.0013
The maximum values, while unlikely in both distributions, were much more likely to occur in the Gumbel distribution, which makes sense given the positive-skew of the pollution indictors.
**Ordinary Least Squares Linear Regression**
For NO2, the best linear regression model was the one using the climate variables of temperature and wind, based on the relatively high $R^2$ value of 0.583 and relatively low BIC. The formula for the best fitting model is as follows:
logNO2 ~ -0.03(T) + -0.61(logWind) + 4.3
For PM10, the best linear regression model was the one using all three climate variables of temperature, wind and precipitation. This model had the highest $R^2$ ($R^2$=0.342) and lowest BIC of the ones tested. It is an interesting model because the coefficient for precipitation is very small, however from the summary table below we can see that the confidence interval does not include zero, suggesting that while it is a small coefficient it is significantly different from zero. The formula for the best fitting model is as follows:
logPM10 ~ -0.02(T) + -0.36(logWind) + -0.001(Precip) + 3.3
For PM2.5, the best linear regression model was the one using the climate variables of temperature and wind. For this model the $R^2= 0.469$ and the model formula is:
logPM2.5 ~ -0.02(T) + -0.54(logWind) + 2.8, R2: 0.469
The tables below show the summaries of the best fitting linear models as described above.
*Summary table of linear regression model for log(NO2) with temperature and log(wind):*

*Summary table of linear regression model for log(PM10) with temperature, log(wind) and precipitation:*

*Summary table of linear regression model for log(PM2.5) with temperature and log(wind):*

**Principal Component Analysis**
The PCA using two principal components (PC2) yielded a pretty interesting model. The PC1 explained variance ratio was 0.481 and the PC2 explained variance ratio was 0.281. From the scatterplot below we can see that the vectors go in different directions, and that precipitation and temperature are quite orthogonal.
*Scatterplot showing the relationship between the 2 Principal Components identified by the PCA analysis*

The PCA using three principal components yielded a model that was pretty similar to the one with two principal components. The following were the explained variance ratios of the three principal components:
PC1 explained variance ratio: 0.481
PC2 explained variance ratio: 0.281
PC3 explained variance ratio: 0.238
When two or three principal components (PC3) were used to make a linear regression model for the pollution indicators, the models yielded very similar results. The PC2 has a lower BIC, as expected, and so is generally the preferable of the two models.
The following are comparisons of the $R^2$ values, AIC, and BIC for the principal component OLS linear regression and the climate variables OLS linear regression for each pollution indicator:
**NO2**
NO2 ~ PC2:
R2: 0.499 AIC: 3187.23 BIC: 3198.92
NO2 ~ PC3:
R2: 0.499 AIC: 3189.19 BIC: 3204.77
NO2 ~ Temp and Wind:
R2: 0.583 AIC: 312.87 BIC: 324.56
**PM10**
PM10 ~ PC2:
R2: 0.258 AIC: 2899.39 BIC: 2911.08
PM10 ~ PC3:
R2: 0.261 AIC: 2899.64 BIC: 2915.23
PM10 ~ Temp, Wind and Precipitation:
R2: 0.342 AIC: 409.26 BIC: 424.85
**PM2.5**
PM2.5 ~ PC2:
R2: 0.35 AIC: 2380.25 BIC: 2391.95
PM2.5 ~ PC3:
R2: 0.351 AIC: 2381.72 BIC: 2397.32
PM2.5 ~ Temp and Wind:
R2: 0.469 AIC: 325.62 BIC: 337.32
For all the pollution indicators, the linear regressions based on climate variables had higher $R^2$ values than those based on principal components. However, the principal components also had reasonable predictive power, especially in the center-region of the dataset that doesn't include the peak pollution measures. The figures below show comparisons between the PC models and earlier OLS models with climate variables. The graphs at the bottom also show comparisons between the PC2 and PC3 models which are very similar.
*Comparison between OLS using PCA with 2 components and OLS using climate variables (aka "OLS est" in legend) for predicting NO2 concentrations*

*Comparison between OLS using PCA with 2 components and OLS using climate variables (aka "OLS est" in legend) for predicting PM10 concentrations*

*Comparison between OLS using PCA with 2 components and OLS using climate variables (aka "OLS est" in legend) for predicting PM2.5 concentrations*

*Comparision between PC2 and PC3 models for each pollution indicator:*

**Machine Learning**
The Random Forest Regressor (RFR) was used to generate a model for NO2 based on all the climate variables in the dataset. The RFR model showed very similar predictive power to the best OLS linear regression models above. The Pearson correlation coefficient between observed NO2 and predicted NO2 using RFR is 0.8, compared to the correlation between observed NO2 and the prediction using the OLS regression which is 0.76. The figure below compared the RFR and OLS models by overlaying the predicted values with the observed values, and we can see the two models perform quite similarly.
*Comparison between RFR and OLS models:*

## **Discussion**
With our three different models for predicting air pollution using weather data, the model using a machine learning Random Forest Regression method seems to perform best, as this one's prediction has the highest correlation with the original data. Overall, all three models work pretty well for predicting pollution indicators from climate variables, especially considering that factors like temperature, wind, and precipitation do not actually create the pollutants.
Throughout this analysis we make many assumptions. For example, we assume that the MET weather station and NILU air pollution station, while not located in the same place, are close enough that spatial variation does not need to be considered in the analysis. We also assume that examining the data on a daily time scale is equally relevant for climatic and pollution variables. Furthermore, we treat the data collected as "perfect", and do not consider measurement errors in our analysis. In a more detailed analysis for the purposes of peer-reviewed publication, some of these underlying assumptions ought to be examined in more detail.
One assumption which we discussed in more detail is about the normality of the data for performing an linear regression. While the assumptions underlying the linear regression model were not strictly met, this model still demonstrated predictive power with this dataset. It was fortunate that the climate variables were less correlated with each other than with the pollution indicators, which suggests that these were reasonable predictors to use in the regression models.
The Random Forest Regression (RFR) method provides an attractive alternative method to the OLS linear regression because we saw how much the data did not meet the underlying assumptions of the linear regression. RFR does not rely on any underlying assumptions about the distribution of the data and for this reason is quite powerful with difficult to analyse datasets. Also, seeing that RFR yielded similar results to the OLS lends credibility to the use of the linear regression model in spite of violation of assumptions.
None of the models presented were able to predict the high peaks and extreme values in the observed pollution data. This is likely because we lack sufficient explanatory variables to describe the development of air pollution concentrations. Also, the time series used is quite short for being able predict extreme events which would have low frequency. This is to be expected, as we are likely to need more information to be able to predict when inversions will occur. In addition, air pollutant concentration depend on anthropogenic factors as well atmospheric conditions, particularly traffic levels.
## **Conclusion**
In this project we have considered air pollution data in comparison to four different atmospheric variables: tempature, wind, relative humidity and precipitation. We have found that air pollution is most correlated with tempature, followed by wind, and less correlated with precipitation and relative humidity. We have seen that the correlation with wind increases when applying a logarithmic transformation to the air pollution indicators and wind speeds. These variables (temperature, logarithm of wind, and in some cases precipitation) have been used as predictors for different models of air pollution. The models described include linear regressions, principal component analyses and random forest regression with a machine learning approach. The random forest regression performs best for predicting air pollution from climate variables, with a Pearson correlation of 0.8 between the observed data and the prediction. The linear regression models showed similar performance. Further investigations need to be made to derive a more comprehensive understanding of how climate variables interact with air pollutants.
## **References**
(1) World Health Organisation: https://www.who.int/health-topics/air-pollution#tab=tab_1. Last accessed 15-November-2020.
(2) California Air Resources Board: https://ww2.arb.ca.gov/resources/inhalable-particulate-matter-and-health. Last accessed 15-November-2020.
(3) Norwegian Environment Agency: https://www.miljodirektoratet.no/globalassets/publikasjoner/m129/m129.pdf. Last accessed 15-November-2020.