# Contributors (please add your names and email addresses)
Ann Smith a.smith@hud.ac.uk
CJ Clarke courtney.clarke@ucdconnect.ie
Jaeseong Jeong jsngjeong123@gmail.com
Kate Finucane kate.finucane@ucdconnect.ie
Noriszura Ismail ni@ukm.edu.my
Ruzanna Ab Razak ruzanna@ukm.edu.my
Thais Pacheco Menezes thais.pachecomenezes@ucdconnect.ie
Wenxuan Liu wenxuan.liu@ucdconnect.ie
Brady Metherall metherall@maths.ox.ac.uk
Yang Zhou yz3259@bath.ac.uk
Constantin Octavian Puiu constantin.puiu@maths.ox.ac.uk
Markus Ferdinand Dablander markus.dablander@maths.ox.ac.uk
William Lee w.lee@hud.ac.uk
# Problem Presenter
Joshua Ryan-Saha joshua.ryan-saha@ei.ed.ac.uk
Traveltech Scotland
# Problem Statement
## Forecasting Edinburgh's Tourism Demand
### Introduction
Over tourism and unpredictable visitor fluctuations have become pressing issues for Edinburgh's tourism industry. Managing resources amidst volatility in demand hampers sustainability efforts and strains local small and medium enterprises (SMEs). Forecasting models, leveraging tourism data, economic trends, and environmental impact projections, could enable data-driven planning and resilient, sustainable growth. We would like to develop an accessible mathematical model that provides key insights to support tourism SMEs and policymakers.
### Problem Definition
The core problem this modelling challenge aims to address is unpredictable tourism demand and its cascading effects on sustainability and local businesses. Current resource allocation and planning practices lack robust quantitative methods to forecast visitor volume and associated environmental impacts. This results in reactive responses, waste from under-or over-preparation, missed opportunities, and uncoordinated policies. SMEs in particular need targeted decision support to optimise operations based on demand signals.
### Objectives
The objectives are to develop a practical forecasting tool providing visitor volume and visitor segmentation projections. These could look at any or all of the following time horizons:
* 1 week in advance: Enables SMEs to fine-tune staffing levels, inventory orders, and operational capacity for the upcoming week.
* 3 months in advance: Supports SME and destination organisations marketing plans,seasonal capacity adjustments, and resource allocation for the quarter ahead.
* 1 year in advance: Allows destination organisations and SMEs to strategically budget,manage capital projects, and coordinate with policymakers for major events/seasons.
The model should leverage historical tourism data, economic indicators, and environmental impact projections to provide actionable insights. The primary aim is equipping SMEs across the tourism industry with accurate short, medium and long-term demand forecasts to improve financial planning, and staffing, marketing and sustainable practices. Secondarily, the model should inform infrastructure development and policies supporting sustainable tourism growth
# Literature Search
Tourism prediction models commonly rely on search intensity indices. In a recent study, Andariesta and Wasesa (2022) developed predictive models to forecast international arrivals in Indonesia using multisource Internet data from prominent platforms like Trip Advisor and Google Trends. Their models incorporated query indices and historical records of tourist arrivals. The predictive models explored in their research were Random Forests (RF), Artificial Neural Networks (ANNs), and Support Vector Machines (SVMs). Among these, the Random Forest model demonstrated the highest prediction accuracy.
Another study conducted by Yu and Chen (2022) focused on constructing a tourism demand forecasting model. They developed the SAE-LSTM prediction model, which outperformed the conventional Long Short Term Memory (LSTM) model. This superiority was observed in their dataset, comprising monthly tourist volume of a specific city along with relevant influential factors.
Addressing the limitations of commonly used models like LSTM and Recurrent Neural Networks (RNNs) for tourism demand prediction in South Korea, et al. (2021) proposed a novel approach called multi-head attention Convolutional Neural Network (MHAC). Their method aimed to enhance prediction accuracy and reliability in forecasting tourism demand for South Korea.
In summary, recent research in tourism prediction models has seen advancements in the incorporation of diverse data sources and the development of innovative techniques to achieve improved forecasting accuracy in various tourist destinations.
# Approaches
# Data Collation/ Data dictionary
## Useful data sources
* Google Trends
* Data Thistle
* Edinburgh Castle
* National Museum of Scotland
* Skyscanner
## Google Trends Data Clustering
We classify words containing "edinburgh" among 232 search keywords.
**Accommodation**: including "hotels", "accommodation", "airbnb", "inn", "travelodge", "trivago", "booking", "ryanair", "hilton"
Also places.to.stay.in.edinburgh is included.
**Tourism**: including "castle", "hub", "museum", "zoo", "holyrood", "restaurant"
Also things.to.do.in.edinburgh, edinburgh.christmas.market are included.
**Transportation**: including "car", "train", "flight", "skyscanner", "easyjet", "parking", "airport".
Also london.to.edinburgh.UK are included.

Fig. Number of events and visitors of castles and museums in Edinburgh.

Fig. Number of Google search about transportation and accommodation in Edinburgh.

Fig. Number of Google search about activities in Edinburgh and skyscanner of flight to Edinburgh.
# EDA
## Skyscanner data
International tourism has been severely affected by numerous unpredictable events, indicating the vulnerability of tourism to negative shocks, such as financial crises and COVID lockdown.
For our exploratory study, we use Skyscanner data which is a popular online travel agency and metasearch engine that helps users find and compare flight, hotel, and car rental deals. The Skyscanner has 182,005 rows of flights data in years 6/5/2019 until 2/5/2022. The origin and destination airports are provided in terms of airport codes (eg: EDI, ABZ, INV, GLA, PIK, DND). We transform the codes into airport cities and categorized them according to countries.
The results of this exploratory study provide the number of flight searches and the top 20 source countries of departure flights to Edinburgh. The top 5 countries are UK, Spain, USA, Italy and China. It is hoped that this exploratory study may give some rough ideas to plan and design tourism events and attractions, and could develop marketing strategies for the specific prospective tourists.
We summarize data by monthly Skyscanner flight route search volume (by departure time) from 2019-01 to 2023-05.

Fig. Number of flight searches by yearly
It can be seen that the search volume is high mainly in summer. Search volume declined dramatically during the pandemic. We can see it more clearly through the monthly graph.

Fig. Number of flight searches by monthly

Fig. Most searched departure airport
We classified the departure airports by country and derived the number of searches by country. (by using https://www.nationsonline.org/oneworld/IATA_Codes/airport_code_list.htm)

Fig. Most searched departure countries.

We tried to find the correlation between Skyscanner data(2019~2023) and Google Trends data(2017~2021), but it is difficult to expect meaningful results because the overlapping period of the two data is the pandemic period(2019~2021).
## Google Trends and Edinburgh Castle
We grouped 160 Edinburgh-related Google trends from 2017 to 2021 in the UK or around the world into three categories: Accommodation, Transportation and Tourism. (see also Data Collection)
First, we look at the trends in the UK. The following figure is the correlation of Google Trends(UK) and Visitors.

Fig. Correlations of Google Trends(UK) and Visitors to Edinburgh Castle
The correlation coefficient between Google Trends (UK) and Visitors are above moderate (>0.50).

Fig. Correlations of Google Trends(World) and Visitors to Edinburgh Castle
The correlation coefficient between Google Trends (World) and Visitors are also above moderate (>0.50).
We can check the consistency between search trends and actual castle visitors with a lineplot graph.

Fig. Google Trends(World) and Visitors to Edinburgh Castle

## Airbnb
Proportion of available listings on Airbnb.
Median availablility is 36%.
There is particularly low availability (<20%) on the following dates: June 25th-29th 2019, July 5th-6th 2019, August 2nd-17th 2019 and August 23rd 2019, on which the following events occurred: Modern Portrait event (June and July) and Edinburgh's Fringe Festival (August).

Fig. Proportion of avaialable Airbnb listings.
# Time Series forecasting models
## Introduction
We focus on the number of visits to Edinburgh Castle because it is one of the main tourist attractions in Edinburgh.
We address the challenge of forecasting tourist visits when the data is missing or incomplete during COVID-19 era. We provide data preparation, data segmentation (pre-COVID and post-COVID), and the use of time series that accommodate the forecasting of tourist demand. We also provide R program for our data preparation and data analysis.
## Data
We use visitors data to Edinburgh from two sources - Edinburgh Castle weekly data and Google Trends (GT) Edinburgh Castle Popularity Index weekly data. The visits data provides weekly aggregate footfall numbers to Edinburgh Castle for the period 09/04/2010 until 19/03/2021. The GT indices are normalized data and tell us information about search interest relative to all search interests ("popularity") across specific time periods and locations for the period 04/06/2017 until 24/04/2022.
## Results
Figures 1-2 respectively show the number of weekly visits to Edinburgh Castle (9/4/2010-19/3/2021) and the GT weekly indices for Edinburgh Castle (04/06/2017-24/04/2022). The pre-covid trends for the number of visits are similar yearly, where the number peaks around the 30th week (end of July) each year.

Figure 1: Number of visits to Edinburgh Castle

Figure 2: GT indices for Edinburgh Castle
We then find the change rate of visits using log differences. The change rate data are shown in Figures 3-4. Overall, the rates of visits and GT indices have significant fluctuations and were affected by extreme events, such as COVID-era in the year 2020.

Figure 3: Change rate of visits (2010-2021)

Figure 4: Change rate of GT indices (2017-2022)
*R program for plots and calculating log differences*
```r=1
##load data: weekly_ed_castle
ecweekly <- weekly_ed_castle #renama data
ecweeklyprecovid <- weekly_ed_castle[1:508,] #separating data
ecweeklycovid <- weekly_ed_castle[509:572,]
#converting data to a time series
ecw.ts<- ts(ecweekly$Visitors,start=c(2010,4),frequency=52)
ecwpc.ts<- ts(ecweeklyprecovid$Visitors,start=c(2010,4),frequency=52)
ecwc.ts<- ts(ecweeklycovid$Visitors,start=c(2020,1),frequency=52)
autoplot(ecw.ts, xlab = "", ylab = "") #ts plot of visits #figure 1
autoplot(diff(log(ecw.ts)), xlab = "", ylab = "") #figure 3
d.ecwpc.ts<-diff(log(ecwpc.ts)) #change rate for precovid tseries
autoplot(d.ecwpc.ts, xlab = "", ylab = "") #figure 5
d.ecwc.ts <- diff(log(ecwc.ts)) #change rate for postcovid tseries
autoplot(d.ecwc.ts, xlab = "", ylab = "") #figure 6
##load data Google_UK
guk.ts<- ts(Google_UK$Edinburgh castle,start=c(2017,6),frequency=52) #converting to tseries
autoplot(guk.ts, xlab = "", ylab = "") #ts plot for figure 2
autoplot(diff(log(guk.ts)), xlab = "", ylab ="") #figure 4
#trying to match Google_UK Search and ecweekly
ecweekly2<-ecweekly[374:572,] #set 2017-06-02 until 2021-03-19
ecw2.ts <- ts(ecweekly2$Visitors,start=c(2017,6),frequency=52)
Google_UK2 <- Google_UK[1:199,] # set 2017-06-04 until 2021-03-21
guk2.ts <- ts(Google_UK2$Edinburgh castle,start=c(2017,6),frequency=52)
d.ecw2.ts <- diff(log(ecw2.ts)) #change rate for visitors
d.guk2.ts <- diff(log(guk2.ts)) #change rate for GT indices
#tseries plot of both series
autoplot(d.ecw2.ts)+
autolayer(d.guk2.ts, series="") +
xlab("") +
ylab("")
```
The change rate of visits in pre-COVID (prior to 3/1/2020) and post-COVID years (3/1/2020 onwards) are shown in Figures 5-6. From Figure 6, the visits has missing data (zero visits) due to the lockdown during COVID (Mar 2020 - Dis 2021). On the contrary, Figure 4 shows that people still search for Edinburgh Castle via Google websites during lockdown.

Figure 5: Change rate of visits (pre-covid)

Figure 6: Change rate of visits (post-covid)
We are interested in comparing the change rate of visits and GT indices in the same period (June 2017-Mar 2021), and the comparison is shown in Figure 7. The red line represents GT indices and the black line represents visitors. It is interesting to see that the change rates of both data are similar, they peak and reach the lowest level around the same week of the year.

Figure 7: Change rate of visits (black line) and GT indices (red line) in 2017-2021
The summary statistics for both data (2017-2021) are shown in Table 1. Negative skewness in GT indices (-0.756) suggests that low change rates are more common, while a small positive skewness in visits (0.019) indicates that high change rates are slightly more frequent. Positive kurtosis in both data (4.383 and 1.365), also known as leptokurtic or "heavy-tailed" distribution, indicates that the distribution has more extreme values than a normal distribution. The Jarque–Bera test for each time series is non-normally distributed, and the null hypothesis is rejected.
Table 1: Summary statistics and correlation
| Detail | Visits | Search index |
| ------------- | ----------:| ------------:|
| Mean | -0.0002 | 0.0014 |
| Minimum | 0.0170 | 0.0274 |
| Maximum | -1.0190 | -0.6614 |
| Std dev | 0.9007 | 0.2877 |
| Skewness | 0.2529 | 0.1579 |
| Kurtosis | 4.3832 | 1.3646 |
| Jarque-Bera | 112.8 | 24.5 |
| p-value | < 2.20e-16 | 4.87e-06 |
| # observation | 134 | 134 |
| Correlation | 0.44 | - |
We also calculate Pearson correlation between both time series, which is provided in the same table (Table 1). The correlation is positive (0.44) with a p-value of 0.00, thus confirming the positive movement between the two time series. One may perceive Google search as ‘people who dream to visit the Edinburgh castle, and due to lockdown, are unable to do so’. Our results show that the search data may be used as a ‘proxy’ for visits data which has zero visits (missing values) during lockdown.
*R program for summary statistics and Pearson correlation*
```r=35
##continue from the previous program
ecweekly3 <- ecweekly2[1:135,] #precovid data
Google_UK3 <- Google_UK2[1:135,] #precovid data
d.ecweekly3 <- diff(log(ecweekly3$Visitors)) #change rate
d.Google_UK3 <- diff(log(Google_UK3$Edinburgh castle)) #change rate
#descriptive statistics
summary(d.ecweekly3)
summary(d.Google_UK3)
sd(d.ecweekly3) #std dev
sd(d.Google_UK3)
skewness(d.ecweekly3) #package e1071
skewness(d.Google_UK3)
kurtosis(d.ecweekly3) #package e1071
kurtosis(d.Google_UK3)
jarque.bera.test(d.ecweekly3) #package e1071
jarque.bera.test(d.Google_UK3)
corr.test(d.ecweekly3,d.Google_UK3) #package psych
```
## Forecasting Visits to Edinburgh Castle using Holt-Winters Model
The Holt-Winters model is suitable for time series with trend and seasonality, where the seasonality effect is relatively constant across different time periods and is independent of the series level and trend. We use the Holt-Winters model for forecasting visits to Edinburgh Castle because it incorporates the trend and seasonality components. The model does not require an extensive historical dataset to generate forecasts, making it suitable for cases where limited historical data is available.
We use the weekly data of visits to Edinburgh Castle for the period 9/4/2010 until 21/12/2018 for fitting procedure. The actual and forecasted values were compared and tested for the period 28/12/2018 until 27/12/2020.
Table 2 provides the accuracy measures. The Holt-Winters model has small values of error for MAPE and MASE. Therefore, it is an adequate model to forecast weekly visits to Edinburgh Castle.
Table 2: Accuracy measures for weekly visits
| Detail | RMSE | MAE | MAPE | MASE |
| ------------ | ------- | ------- | -----:| ----:|
| Training set | 4666.33 | 3202.62 | 12.55 | 0.77 |
| Test set | 9329.92 | 7717.65 | 19.64 | 1.85 |
Figure 8 shows the actual vs. forecast series where the black line represents the actual data while the red line represents the forecast data. The plots show small differences between actual and forecast data.
We would like to highlight here that this is not a perfect model. We can also consider several time series models such as ARIMA and SARIMA, depending on the trend and seasonality of the data. Alternative methods using AI (Artificial Intelligence) can also be considered such as ANN (Artificial Neural Network) and SVM (Support Vector Machine).

Figure 8: Actual vs forecast series (2010-2020)
*R program for Holt Winter model*
```r=53
#continuation from previous program
ecwpc.tstr<- ts(ecweeklyprecovid$Visitors[1:455],start=c(2010,4),frequency=52) #training data
ecwpc.tste<- ts(ecweeklyprecovid$Visitors[456:508],start=c(2018,43),frequency=52) #test data
fit1 <- HoltWinters(ecwpc.tstr, gamma=FALSE) #with trend, without seasonal
fit1
fit2 <- HoltWinters(ecwpc.tstr, gamma=TRUE) #with trend & seasonal
fit2
forecastfit1 <- forecast(fit1, h = 52)
forecastfit2 <- forecast(fit2, h = 52)
plot(as.vector(forecastfit2$residuals[53:455]), xlab = "", ylab="") # residuals checking
# should appear as scattered, not showing trend
accuracy(forecastfit1, ecwpc.tste) #forecast evaluation
accuracy(forecastfit2, ecwpc.tste)
autoplot(ecwpc.ts)+
autolayer(forecastfit2, series="",PI=FALSE) +
xlab("") +
ylab("")
```
# Regression Models (none)
# Graph-based Dynamic Model of Tourist Flow and Clogging
## Basic Model
$$
x_1 + ... + x_n + s = m \in \mathbb{N}\\
\frac{dx_i}{dt} = \mu_{si}s-\mu_{is} x_i\\
\frac{ds}{dt} = \sum_{i = 1}^{n} \mu_{is}x_{i} - \mu_{si}s
$$
## With maximum capacities
$$
x_1 + ... + x_n + s = m \in \mathbb{N}\\
\frac{dx_i}{dt} = \mu_{si}(1-\frac{x_i}{c_i})s-\mu_{is} x_i\\
\frac{ds}{dt} = \sum_{i = 1}^{n} \mu_{is}x_{i} - \mu_{si}(1-\frac{x_i}{c_i})s
$$


## With maximum capacities and entrance areas.
$$
\frac{dx_i}{dt} = \sum_{j \in \mathcal I_X}\mu_{ji} x_j\cdot\eta(\mu_{ji} x_j;\, \kappa_{ji}) - \tau_{x_i\to y_i}^{(i)}-\sum_{j \in \mathcal I_X}\mu_{ij}x_i\cdot\eta(\mu_{ij}x_i;\, \kappa_{ij})\\
\frac{dy_i}{dt} = \mu_{x_i\to y_i}(c_i-y_i)x_i - \zeta_i y_i =:\tau_{x\to y}^{(i)}
$$
# Historic Cycle Scheme data
The Edinburgh Cycle Hire Scheme (ECHS) was launched by Serco in September 2018. The scheme ran for the contracted three year period and closed in September 2021.
## Cycle Data X01 January 2019
Data given for journey start and end points does not gvie information on multiple visits per hire, so does not include evidence of possible interim attractions visited.

Fig : Cycle Route Start and End point (straight line connection) no information regarding route between these points. Data file X01. Arrows indicate journey direction.

Fig. Heatmap indicating popularity of destinations (blue to red increasing popularity), January.
## Cycle Data X07 July 2019

Fig. Heatmap July 2019.
# Findings
* The pre-covid trends for number of visits to Edinburgh Castle are similar yearly (pre-COVID years), where the number peaks around the 30th week (end July) each year.
* The change rates of number of visits and Google indices are similar, they peak and reach the lowest level around the same week of the year (pre-COVID and post-COVID years).
* One may perceive Google indices (Google search) as ‘people who dream to visit Edinburgh castle, and due to lockdown, are unable to do so’. The search data may be used as a ‘proxy’ for visits data if we encounter the problem of limited/insufficient data in the future.
* Holt-Winters model has small values of MAPE and MASE for the weekly visits data, indicating that it is an adequate model for forecasting when limited historical data is available.
* The flight search data provides information on prospective tourists' depature location/origin. The results provide the number of flight searches and the top 20 source countries of departure flights to Edinburgh. The top 5 countries are UK, Spain, USA, Italy and China.
* This exploratory study may give some rough ideas to plan and design tourism events and attractions, and could develop marketing strategies for specific prospective tourists.
# Conclusions
# Further Work
## Modelling
## Data limitations: recommendations
# References
Andariesta, D. T. and Wasesa, M. "Machine learning models for predicting international tourist arrivals in Indonesia during the COVID-19 pandemic: a multisource Internet data approach." *Journal of Tourism Futures*, 2022.
Yu, N. and Chen, J. "Design of Machine Learning Algorithm for Tourism Demand Prediction." *Comput Math Methods Med.*, 2022.
Kim, D-K. et al. "A daily tourism demand prediction framework based on multi-head attention CNN: The case of the foreign entrant in South Korea." *2021 IEEE Symposium Series on Computational Intelligence (SSCI)*, 2021.
ECHS (18th September 2018 to 17th September 2021)
https://democracy.edinburgh.gov.uk/documents/s40133/7.5%20-%20Edinburgh%20Cycle%20Hire%20Scheme.pdf
https://www.visitscotland.com/places-to-go/edinburgh/things-to-do.
Choi, K.-H., Kim, I. 2021, Co-Movement between Tourist Arrivals of Inbound Tourism Markets in South Korea: Applying the Dynamic Copula Method Using Secondary Time Series Data. Sustainability. 13: 1283. https://doi.org/10.3390/su13031283
Petrevska, B. 2017. Predicting tourism demand by ARIMA models. Economic Research - Ekonomska Research istRaživanja. 30(1): 939–950. https://doi.org/10.1080/1331677X.2017.1314822