# Modelling Business Space Occupancy
The team, and our skillsets:
https://www.geeksforgeeks.org/ways-to-import-csv-files-in-google-colab/
| Name | Specialities |
| -------- | -------- |
| Ellie | Data Analysis |
| Bea | Data Analysis, Machine Learning |
| Aldi | Stochastic Analysis |
| Santi | Game Theory, Networks, ML |
| Giulia | Numerical #stuff |
Here are the relevant links:
* [Google drive](https://drive.google.com/drive/folders/1xz9hn_Ex4lOIxfe5UbZ04nyQnrsa_ZTy)
* [Google colab](https://colab.research.google.com/drive/1Jg5LPudsLjbXD2sqhCcMoC0JJRxiA__M)
* [Overleaf](https://www.overleaf.com/project/62bec48688a16fb83d80db23)
## Tuesday
### Initial to do list
1. Clean data (remove repeating group labels)
2. Create some simple plots
* pie chart of building types
* building type vs. rate
* building type vs. distance from city centre
* distance from city centre vs. rate
* histogram of length of occupancy
3. Group property categories together
* Shops
* Cafes + pubs + restaurants
* Offices
* Factories + warehouses + workshops
* Other (things we aren't interested in, at least for now... )
### Interpreting the data as a graph
We can represent the properties in Luton as nodes of a graph, and connect nodes based on their euclidean distance (connected if euclidean distance <x, can vary x and see how it effects the graph). Each node will be encoded as occupied/unoccupied. Additionally can encode by the property type.
*Hypothesis:* When a property/node becomes unoccupied it has a knock on effect for nearby properties.
Thinking to simplify things we can start by just looking at a retail graph - i.e. shops + restaurants.
First thing is to create the graph. We can simplify this process using [GriSPy](https://grispy.readthedocs.io/en/latest/).
### Occupied or Empty Sites
|  |
|:--:|
| <b> Map of empty (red) and occupied (blue) businesses. </b>|
||
|:--:|
| <b> Map of empty (red) and occupied (blue) businesses, combined into one map. </b>|
### Location by business type

### Histograms with the new categories
## Wednesday
### Graph
Working on embedding the graph onto a map.

* Thinking we can use this for feature mining using the graph structure.
### Random Forest
Built random forest classifier. The input features are:
* Business type (SHOP AND PREMISES, RESTAURANT AND PREMISES, OFFICE AND PREMISES, FACTORY AND PREMISES, OTHER)
* Rateable value
* Distance from city centre (city centre was chosen to be the town hall)
Example of the random forest tree:

*Results*
On test set got accuracy score of 0.72.
On (bootstrapped) ensemble test set got accuracy score of 0.76.
Have confusion matrix:

This is for the first Random Forest tree, and the results of the ensemble test.
Decided to add <span style="color:blue">additional </span> features:
* Business type (SHOP AND PREMISES, RESTAURANT AND PREMISES, OFFICE AND PREMISES, FACTORY AND PREMISES, OTHER)
* Rateable value <span style="color:blue">(binned into range categories)</span>
* Distance from city centre (city centre was chosen to be the town hall)
* <span style="color:blue">Radial distance from city centre </span>
Example of the random forest tree with the <span style="color:blue">additional </span> features:

and have new conversion matrix:

### Including data from other sources
Map of Luton colour coded according to the deprivation index:

## Thursday
### Random Forest
Decided to add <span style="color:red">more additional </span> features:
* Business type (SHOP AND PREMISES, RESTAURANT AND PREMISES, OFFICE AND PREMISES, FACTORY AND PREMISES, OTHER)
* Rateable value (binned into quartiles)
* Distance from city centre (city centre was chosen to be the town hall)
* Radial distance from city centre
* <span style="color:red">deprivation index </span>
* <span style="color:red">house price index </span>
The businesses that were found to be more likely to become unoccupied are shown in the following figure:

We can see some clustering in the location of businesses of the same type. For example, four of the factories in this category (orange) are in the same area, to the west of the city centre along the river.
It should be noted that all of the places classified as *vulnerable* were small businesses. They appear to be the ones that are the most susceptible to the other features that were used in the model.

This map shows vacant sites that our model predict to have the highest probability to be occupied. Again, we can see patterns in their location. In this case, some of them are alligned along the city's main arteries, indicating them as the best places for future development.