Areal Interpolation with **R**

--- title: "Areal Interpolation with **R**" author: "V. H. Do, T. Laurent, A. Vanhems" date: "April, 2020" output: html_document: highlight: tango number_sections: yes toc: yes toc_depth: 2 keep_tex: yes --- <link href="markdown7.css" rel="stylesheet"> ```{r setup, include=FALSE} library(knitr) opts_chunk$set(echo = TRUE) opts_chunk$set(fig.align = "center", fig.width = 10, cache = F, cache.lazy = FALSE ) options(kableExtra.latex.load_packages = FALSE) ``` Packages needed: ```{r, eval = F} install.packages(c("sf", # spatial data structure "tidyverse", # tidyverse universe "gridExtra", # ggplot2 extra plot "ggspatial", # plot fanzy map "readxl", # import excel files "corrplot", # correlation plot "rpart", # regression tree "rattle" # fanzy regression tree plot ) ) ``` ```{r, message = F} library(ggspatial) library(readxl) library(sf) library(tidyverse) library(corrplot) library(rattle) library(rpart) ``` # Introduction # ## Motivations ## We consider a practical problem in political science. We would like to explain the results obtained by a party in a given election by some socio-demographic covariates. The results of the election are given at the voting place level by the "Ministère de l'intérieur" whereas the socio-demographic covariates are provided by the French statistical institute INSEE at different levels: cells with different sizes (since 2019), the iris (a subdivision of the commune), the communes, etc., but not at the voting place level. Hence, the INSEE data (the source) needs to be expressed at the voting place level (the target). To do this, one needs to use statistical methods for estimating the data of interest at the target level by using information given at the source level and using potentially auxiliary information given on control zones. The aim of this chapter is to illustrate the methods proposed by Do et al. (2015) to electoral data. The codes have been implemented with the **R** software (**R** Core Team, 2019) and most of the functions used are included in the **sf** package (Pebesma, 2018). ## Context ## We consider the Departemental election which occured in 2015 (see https://fr.wikipedia.org/wiki/%C3%89lections_d%C3%A9partementales_fran%C3%A7aises_de_2015 to obtain more informations about that election). In a French election, voters are supposed to vote in a polling place. The results of the election are released by the Ministry of Interior in open access (source: https://www.data.gouv.fr/fr/datasets/elections-departementales-2015-resultats-par-bureaux-de-vote/) at the polling place. In our study, the volling place are the target zones denoted $\{T_t\}_{t=1,2,\ldots,J}$. In Nguyen et al. (2018), the authors use Compositional Data Regresion Analysis for explaining the extrem right votes obtained in the Departemental election in 2015 by some socio economic covariates. They used data provided by different sources of open data like the French Statistical Institute (INSEE) or Open Street Map. To translate the data from a source level to a target level, the authors used mainly the areal weighting interpolation (DAW) method. The aim of this article is to present the different interpolation methods used in Do et al. (2015) and give recommandations for using one method rather than another. We consider the same data election than Nguyen et al. (2018). The objective is to find open data at different statistical levels and estimate it at the target level which is in our case the polling place level. # The data We focus in this chapter on the results obtained at the first round in the city of Toulouse. Indeed, in rural cities, most of the time, there is only one polling place per commune. In that case, as the INSEE provides its data by commune, the sources and the targets coincide at the commune level. The problem of interpolation data occurs when sources and targets do not coincide, which is the case in the big cities. In urban areas, the polling place levels are delimited such that it contains a limited number of voters. The delimitations are made with respect to the street layout. Excepted for the elections, this geographical level is not used by any other public administration. That is why we need to use Areal Interpolation methods for estimating the data at the polling place level. ## Targets Our targets are the polling place in the 2015'Departemental elections. To get the administrative boundaries of the polling place, we use the maps provided by the Cartelec project (http://cartelec.univ-rouen.fr/). We will focus here on the polling places in Toulouse, but our codes could be used for any other regions. ```{r} # we define the name of the city city_name_chapter <- "Toulouse" # import the data bv <- read_sf("./data/BV_grandes_villes_2015/BV_grandes_villes_2015.shp") # We only keep in this data basis the variables **NOM** and **BUREAU** bv <- select(bv, CODE, NOM, BUREAU) # We extract the rows corresponding to the chosen city : bv_sample <- bv[grep(city_name_chapter, bv$NOM), ] # We change the codification of the **BUREAU** variable such that # it is the same across the different data bases: bv_sample$BUREAU <- sapply(strsplit(bv_sample$BUREAU, "_"), function (x) x[2]) bv_sample$BUREAU <- ifelse(nchar(bv_sample$BUREAU) == 4, bv_sample$BUREAU, paste("0", bv_sample$BUREAU, sep = "")) ``` There are $T=256$ targets (voting places) in Toulouse. We can see below that the areas are various. ```{r} summary(st_area(bv_sample)) ``` Indeed, the targets in the center are smaller than in the periphery which can be explained by the fact that the population density is higher in the center than in the periphery. The smallest target is equal to $23~846m^2$ and the largest is equal to $816~842m^2$. This last one is included in an industrial zone where the population density is low. We will see that one method could be more appropriated than another with respect to the area of the targets. ```{r} summary(st_area(bv_sample)) ``` The next step is to associate the "Departementales 2015" election results to that geographical data basis. We import the "Departementales 2015" election results directly from that link https://www.data.gouv.fr/fr/datasets/elections-departementales-2015-resultats-par-bureaux-de-vote/ : ```{r, message = F} link_bv <- "https://www.data.gouv.fr/s/resources/elections-departementales-2015-resultats-par-bureaux-de-vote/" don_dep_2015 <- read_csv2(paste0(link_bv, "/20150925-104418/DP15_Bvot_T1T2.txt")) # We clean up properly the geographic code of the communes : don_dep_2015$CODGEO <- paste(don_dep_2015$CODDPT, ifelse(nchar(don_dep_2015$CODSUBCOM)==1, paste("00", don_dep_2015$CODSUBCOM, sep = ""), ifelse(nchar(don_dep_2015$CODSUBCOM)==2, paste("0", don_dep_2015$CODSUBCOM, sep = ""), don_dep_2015$CODSUBCOM)), sep = "") # We also focus on the polling places corresponding to the chosen city : sample_2015 <- don_dep_2015[intersect(grep(city_name_chapter, don_dep_2015$LIBSUBCOM), grep(city_name_chapter, don_dep_2015$LIBCAN)), ] # we first separate the results from the tour 1 and tour 2 sample_2015_T1 <- filter(sample_2015, NUMTOUR == 1) sample_2015_T2 <- filter(sample_2015, NUMTOUR == 2) # We reshape the data : sample_2015_T1_bv <- sample_2015_T1 %>% select(CODDPT, CODCAN, CODGEO, CODBURVOT, NBRINS, CODNUA, NBRVOIX) %>% pivot_wider(names_from = "CODNUA", values_from = "NBRVOIX") # We check that the number of polling place in the city chosen is the same # than in the previous data basis : # nrow(sample_2015_T1_bv) == nrow(bv_sample) # We change the name of some variables: ind_BC <- grep("BC", names(sample_2015_T1_bv)) names(sample_2015_T1_bv)[ind_BC] <- sapply(strsplit(names(sample_2015_T1_bv)[ind_BC], "-"), function(x) paste(x, collapse = "_")) # We now merge the geographical data with the election data. bv_sample <- merge(bv_sample, sample_2015_T1_bv, by.x = "BUREAU", by.y = "CODBURVOT") # We create the total number of voters by summing the results obtained # by each party represented in the chosen city : bv_sample$nb_voters <- bv_sample %>% st_set_geometry(NULL) %>% select(contains("BC_")) %>% rowSums(na.rm = TRUE) # We create the percentage of radical right votes and the percentage of turnout bv_sample <- bv_sample %>% mutate(taux_fn = BC_FN / nb_voters * 100, turnout = (1 - nb_voters / NBRINS) * 100) # Define the CRS st_crs(bv_sample) <- "+proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 +x_0=700000 +y_0=6600000 +ellps=GRS80 +units=m +no_defs" ``` At this step, we have created these following variables that can be represented in maps : * **taux_fn**: the dependent variable, ```{r} ggplot(data = bv_sample) + labs(fill = "XR score") + geom_sf(aes(fill = taux_fn)) + annotation_scale(location = "bl", width_hint = 0.5) ``` * **turnout**: the percentage of turnout. ```{r} ggplot(data = bv_sample) + labs(fill = "Turnout") + geom_sf(aes(fill = turnout)) + annotation_scale(location = "bl", width_hint = 0.5) ``` We also represent the scatter plot of the two variables and it seems that the turnout has a positive effect on the extreme right score: ```{r} ggplot(data = bv_sample) + aes(x = turnout, y = taux_fn) + geom_point() + geom_smooth(method = "lm", col = "red") + labs(x = "Turnout", y = "XR score") ``` ```{r} stargazer::stargazer(round(t(sapply(st_drop_geometry(bv_sample[, c("taux_fn", "turnout")]), function(x) c(mean(x), sd(x), min(x), max(x)))), 2)) ``` We are now presenting the second source of data which is given at another geographical level. ## 1st kind of sources: data at the cell level The cell level is provided by INSEE since few years ago. This technique consists of cutting the territory into tiles to disseminate statistical information at a weakly aggregated level. The INSEE data are obtained thanks to the income tax files (source: https://www.insee.fr/fr/statistiques/4176281?sommaire=4176305). There are two different levels of cell. * The finest level is the cell of dimension $200m\times 200m$. There are $S=2~027$ different sources. The area of the cells are all the same and equal to $40~000m^2$. For confidentiality reason, data can be noised in the cells which contain only few inhabitants. ```{r, eval = F} cell_200m <- read_sf("./data/Filosofi2015/Filosofi2015_carreaux_200m_metropole.shp") # select the cells falling in Toulouse cell_200m <- cell_200m %>% filter(Depcom == "31555") %>% select(-Id_carr1km, -Id_carr_n, -Id_car2010, -Groupe, -Depcom) # change the crs cell_200m <- st_transform(cell_200m, st_crs(bv_sample)) ``` ```{r grid_1, echo=FALSE, fig.cap="Sources at the finest level", out.width = '50%', fig.align = "center"} knitr::include_graphics("./Figures/cell_tlse1.png") ``` * The second level is more aggregated so that the sources have enough inhabitants to solve the confidentiality problems. The number of sources is equal in that case to $S=591$ and the cells have different sizes (473 cells of size $200m\times200m$, 109 cells of size $1000m\times1000m$ and 9 cells of size $2000m\times2000m$) : ```{r, eval = F} cell_big <- read_sf("./data/Filosofi2015_naturel/Filosofi2015_carreaux_niveau_naturel_metropole.shp") # change the crs cell_big <- st_transform(cell_big, st_crs(bv_sample)) ind_touches <- st_intersects(cell_big, bv_sample) ind_touches <- apply(ind_touches, 1, any) cell_big <- cell_big[ind_touches, ] # change the crs cell_big <- st_transform(cell_big, st_crs(bv_sample)) ``` ```{r grid_2, echo=FALSE, fig.cap="Sources at the aggregated level", out.width = '50%', fig.align = "center"} knitr::include_graphics("./Figures/cell_tlse2.png") ``` ```{r, echo = F} load("cell_tlse.RData") ``` ### Which levels should we choose? In theory, the finest the sources are, the better is. However, with the cells of dimension $200m\times 200m$, we know that the data have been noised in the cells which have a confidential issue. Moreover, in practice, it is not necessary that the sources are given at a detailed level. In our work, we will consider the two kind of sources (small cells and big cells) and will compare the results obtained with respect to the method used. **Remark 1:** note that the two datasets are not nested. The total area covered by the small cells is lower than the area covered by the big cells. For the small cells, the total area is equal to : ```{r} sum(st_area(cell_200m)) ``` whereas in the case of the big cells, it is equal to: ```{r} sum(st_area(cell_big)) ``` **Remark:** we can do the same kind of observation if we focus on the target variables. For example, the total number of inhabitants observed on the small cells is equal to : ```{r} sum(cell_200m$Ind) ``` whereas in the case of the big cells, it is equal to: ```{r} sum(cell_big$Ind) ``` ### What should I do if the sources and targets are not nested ? In theory, the area defined by the sources should coincide with the area defined by the targets. In our example, the sources are not nested with the targets. Indeed, the zones defined by the targets concern only the city of Toulouse (voters are necessarly registered in a commune). This is not the case with the data provided by INSEE at the cell levels where a cell is independant from the administrative boundary. In other term, a cell can belong to one or several communes. In practice, sources and targets are not necessarly nested and we will see that this will lead us to adapt our methods. ### Variables observed The variables of interest provided by the INSEE are the following: * **Ind**: Number of individuals * **Men**: Number of households * **Men_pauv**: Number of poor households * **Men_prop**: Number of owner households * **Men_coll**: Number of households in collective dwellings * **Men_mais**: Number of house households * **Log_ap90**: Number of dwellings built since 1990 * **Log_soc**: Number of social housing * **Ind_0_3**: Number of individuals aged 0 to 3 * **Ind_4_5**: Number of individuals aged 4 to 5 * **Ind_6_10**: Number of individuals 6 to 10 years old * **Ind_11_17**: Number of individuals aged 11 to 17 * **Ind_18_24**: Number of individuals aged 18 to 24 * **Ind_25_39**: Number of individuals aged 25 to 39 * **Ind_40_54**: Number of individuals aged 40 to 54 * **Ind_55_64**: Number of individuals aged 55 to 64 * **Ind_65_79**: Number of individuals aged 65 to 79 * **Ind_80p**: Number of individuals aged 80 or over * **Ind_inc**: Number of individuals whose age is unknown ```{r, results='asis'} stargazer::stargazer(round(t(sapply(st_drop_geometry(cell_200m[, c("Ind", "Men", "Men_pauv", "Men_prop", "Men_coll", "Men_mais", "Log_ap90", "Log_soc", "Ind_0_3", "Ind_4_5", "Ind_6_10", "Ind_11_17", "Ind_18_24", "Ind_25_39", "Ind_40_54", "Ind_55_64", "Ind_65_79", "Ind_80p", "Ind_inc")]), function(x) c(mean(x), sd(x), min(x), max(x)))), 0)) ``` The particularity of this data is that all of the variables are extensive. However, in the regression model used by Nguyen et al (2018), most of the variables used to explain the extrem right vote are intensive like the unemployement rate, the proportion of people who own assets, etc. Actually, most of the intensive variables can be seen as a ratio of two extensive variables. Hence, in this work, to estimate the intensive variables at the targets levels, we will use the estimate of the extensive variables and then, we will aply the ratio or rates on the estimates. In practice, one could work on intensive variable without any informations on the extensive variables which define them. For pedagogical purposes, we create in the data basis two intensive variables and we will treat them as if we have no informations on the extensive variables which define them. The intensive variables that we consider are: * **prop_Ind_under_18**, the proportion of inhabitants under 18 years old, which is the ratio of the two extensive variables : **Ind_0_17** (number of inhabitants with less than 18 years old) and **Ind** (number of inhabitants). * **pop_dens**, the population density which corresponds to the number of inhabitants **Ind** divided by the area of the sources **area** that we get with the variable **t_maille** wich contains the length of a cell. Here, we create the variables **Ind_0_17**, **area**, **prop_Ind_under_18** and **pop_dens**: ```{r, eval = F} cell_200m$area <- as.numeric(st_area(cell_200m)) / 1000 ^ 2 cell_big$area <- as.numeric(st_area(cell_big)) / 1000 ^ 2 cell_200m <- cell_200m %>% mutate( Ind_0_17 = (Ind_0_3 + Ind_4_5 + Ind_6_10 + Ind_11_17), prop_Ind_under_18 = Ind_0_17 / Ind, pop_dens = Ind / area) cell_big <- cell_big %>% mutate( Ind_0_17 = (Ind_0_3 + Ind_4_5 + Ind_6_10 + Ind_11_17), prop_Ind_under_18 = Ind_0_17 / Ind, pop_dens = Ind / area) ``` Summary statistics : ```{r} stargazer::stargazer(round(t(sapply(st_drop_geometry(cell_200m[, c("prop_Ind_under_18", "pop_dens")]), function(x) c(mean(x), sd(x), min(x), max(x)))), 2)) ``` We define the labels of the extensive variables on one hand and the label of the intensive variable on other hand: ```{r} var_extensive <- c("Men", "Men_pauv", "Men_1ind", "Men_5ind", "Men_prop", "Men_fmp", "Ind_snv", "Men_surf", "Men_coll", "Men_mais", "Log_av45", "Log_45_70", "Log_70_90", "Log_ap90", "Log_inc", "Log_soc", "Ind_0_3", "Ind_4_5", "Ind_6_10", "Ind_11_17", "Ind_18_24", "Ind_25_39", "Ind_40_54", "Ind_55_64", "Ind_65_79", "Ind_80p", "Ind_inc", "Ind_0_17", "Ind", "area") var_intensive <- c("prop_Ind_under_18", "pop_dens") ``` ```{r, echo = F, eval = F} save(cell_200m, cell_big, file = "cell_tlse.RData") ``` ### Variables to estimate Inspiring the work of Nguyen et al. (2018) and thanks to this data set at the cell levels, our aim is to get the following covariates to explain the extreme right vote : * population density (ratio of **Ind** to the area), * percentage of individuals less than 18 (Sum of **Ind_0_3**, **Ind_4_5**, **Ind_6_10**, **Ind_11_17** divided by **Ind**) * percentage of individuals between 18 and 40 (Sum of **Ind_18_24**, **Ind_25_39** divided by **Ind**), * percentage of individuals between 40 and 64 (Sum of **Ind_40_54**, **Ind_55_64** divided by **Ind**), * percentage of individuals upper than 65 (Sum of **Ind_65_79**, **Ind_80p** divided by **Ind**), * percentage of poor households (ratio of **Men_pauv** to **Men**), * percentage of owners (ratio of **Men_prop** to **Men**), * percentage of recent dwellings (ratio of **Log_ap90** to Sum of **Men_coll**, **Men_mais**), **Remark:** In this work, we illustrate the methods Point-in-Polygon, Areal weighting interpolation (DAW) method and Dasymetric method with auxiliary variable XX (DAX) with these two data sets at the cell levels. ## 2nd kind of sources: data at the iris level The data set at the cell level and especially the one with cell of size $200m\times200m$ are very interesting because the area of the sources are much more smaller than the area of the targets: that case is favourable for the problem of areal interpolation, because it can be seen as an aggregation case. Unfortunatly, the data set contains only few variables and we would like to use additionnal covariates in our model to explain the extreme right score, like the unemployement rate, the proportion of foreign or immigrate people. The finest level that we can use after the cell level in Open Access is the iris level, which is a subdivision of the commune. The data can be found here: * Geographical boundaries: https://public.opendatasoft.com/explore/dataset/contours-iris-2015/export/?flg=fr&sort=iris&refine.nom_com=Toulouse * data set 1 which corresponds to the population in 2015: https://www.insee.fr/fr/statistiques/3627376#consulter * data set 2 which corresponds to the residents in 2015 (in particularly the unemployement): https://www.insee.fr/fr/statistiques/2386631#consulter We first import the data: ```{r} # import the boundaries iris <- read_sf("./data/iris/contours-iris-2015.geojson") iris <- iris %>% mutate(IRIS = paste0(insee_com, iris)) # Import the data 1 iris_don <- read_excel("./data/iris/base-ic-evol-struct-pop-2015.xls", skip = 5) # Import the data 2 iris_don_2 <- read_excel("./data/iris/base-ic-activite-residents-2013.xls", skip = 5) # Flter the data iris_don <- iris_don %>% filter(COM == "31555") iris_don_2 <- iris_don_2 %>% filter(COM == "31555") %>% select(P13_CHOM1564, P13_ACT1564, IRIS) # Merge polygons and data iris <- merge(iris, iris_don, by = "IRIS") iris <- merge(iris, iris_don_2, by = "IRIS") # Change the CRS iris <- st_transform(iris, st_crs(bv_sample)) ``` The number of sources is equal in that case to $S=153$. The distribution of the area of the sources is equal to: ```{r} summary(st_area(iris)) ``` ```{r, results='asis'} area_df <- rbind( target = round(as.numeric(summary(st_area(bv_sample))), 0), cell_200m = round(as.numeric(summary(st_area(cell_200m))), 0), cell_big = round(as.numeric(summary(st_area(cell_big))), 0), iris = round(as.numeric(summary(st_area(iris))), 0) ) stargazer::stargazer(area_df) ``` In the following figure, we zoom on the maps to represent in red few sources and in dotted line the targets. It appears that the sources are in general bigger than the targets and in that case, the situation can be seen as a disaggragation problem. In other term, the variables on the targets should be splitted to the intersection zones first and finally to the sources. ```{r} plot(st_geometry(iris[40:50, ]), lwd = 2, border = "red") plot(st_geometry(iris), lwd = 2, border = "red", add = T) plot(st_geometry(bv_sample), add = T, col = scales::alpha("grey", alpha = 0.3), lty = 2) ``` The variables of interest in this data set are: * **C15_POP15P**: number of people (15 years or older) * **C15_POP15P_CS1**: number of Farmers (15 years or older) * **C15_POP15P_CS2**: number of self-employed workers (15 years or older) * **C15_POP15P_CS3**: number of highly qualified workers (15 years or older) * **C15_POP15P_CS4**: number of Intermediate (15 years or older) * **C15_POP15P_CS5**: number of Lower supervisory and technical (15 years or older) * **C15_POP15P_CS6**: number of workers/labour (15 years or older) * **C15_POP15P_CS7**: number of retired (15 years or older) * **C15_POP15P_CS8**: number of long-term unemployed (15 years or older) * **P15_POP**: population number * **P15_POP_IMM**: number of immigates These variables are all extensive. ```{r} var_extensive_rp <- c("C15_POP15P", "C15_POP15P_CS1", "C15_POP15P_CS2", "C15_POP15P_CS3", "C15_POP15P_CS4", "C15_POP15P_CS5", "C15_POP15P_CS6", "P15_POP", "P15_POP_IMM", "P13_CHOM1564", "P13_ACT1564") ``` ```{r, results='asis'} stargazer::stargazer(round(t(sapply(st_drop_geometry(iris[, var_extensive_rp]), function(x) c(mean(x), sd(x), min(x), max(x)))), 0)) ``` Thanks to these variables, we aim to estimate on the targets the following variables that we suspect to be related to the extreme right score: * **prop_unemploy**: it corresponds to the ratio of variable **P13_CHOM1564** to **P13_ACT1564** * **prop_immi**: it corresponds to the ratio of variable **P15_POP_IMM** to **P15_POP** * **prop_csp_1**: it corresponds to the ratio of variable (**C15_POP15P_CS1** + **C15_POP15P_CS2** to **P15_POP**) * **prop_csp_2**: it corresponds to the ratio of variable (**C15_POP15P_CS3** + **C15_POP15P_CS4** to **P15_POP**) * **prop_csp_3**: it corresponds to the ratio of variable (**C15_POP15P_CS5** + **C15_POP15P_CS6** to **P15_POP**) **Remark:** In this work, we illustrate the Dasymetric method control zone on this data set. # Point-in-polygon method In the point-in-polygon method, the sources are points: This is naturally the case when the sources are for example an adress. If the source is a polygon, it has to be replaced by a point. One could choose for example the centroid of the polygon. All the points falling in a target will be aggregated to the target. In the case when the number of sources is larger than the number of targets, this method could be adapted for two reasons: * First, if a source is much more smaller than a target, the consequence is that the probability that a source is completely included in a target is high. In other term, a source is not shared between several targets and it is obvious that the source will be encompassed by the target. Hence, the values observed for the variable of interest will be fully given to the targets * Another reason is that this method is not costly in a computational point of view. Indeed, the operation needed here is the intersection of a point with a polygon which is less demanding than the intersection between two polygons which needs to build new geometries (the intersection zones). In the following figure, when we consider the small cells as the sources, we observe that the targets (in full line) are bigger than the sources (the cells in dashed lines). The method point-in-polygon is usually appropriated for those kind of geometries. ```{r, echo = F, message = F, warning = F} plot(st_geometry(bv_sample[195, ])) plot(st_geometry(bv_sample), add = T) plot(st_geometry(cell_200m), add = T, lty = 2) plot(st_geometry(st_centroid(cell_200m)), add= T, col = "red") ``` For using this method, we consider the cell as a point by choosing the centroid of the cell for using the point-in-polygon method (this is actually what is done by the INSEE which has provided a **R** program for estimating the variables at differents targets levels). The function *st_centroid()* permits to transform the sources from cell to points and the function *st_intersects()* permits to determine if a source intersects or not with a target. ## Extensive variables In the case of extensive variables, it is obvious that the aggregation consists simply in summing the values of the points falling in the same target. For the case of the regular cells, it consists in executing this code: ```{r, eval = T, warning = F} # convert the cell to points cell_200m_as_points <- st_centroid(cell_200m) # intersection of points and polygons temp_inters_cell_200m <- st_intersects(bv_sample, cell_200m_as_points, sparse = F) # we detect the null intersections intersect_yes <- apply(temp_inters_cell_200m, 1, function (x) any(x != F)) # aggregate the data for (i in which(intersect_yes)) { bv_sample[i, paste0(var_extensive, "_pip_cells")] <- apply(st_drop_geometry(cell_200m_as_points) [temp_inters_cell_200m[i,], var_extensive], 2, sum) } # NA value for the targets with no values bv_sample[!intersect_yes, paste0(var_extensive, "_pip_cells")] <- NA ``` We have done the same estimates for the big cells, but as the code is identical to the previous one, we do not print it. ```{r, eval = T, warning = F, echo = F} # convert the cell to points cell_big_as_points <- st_centroid(cell_big) # intersection of points and polygons temp_inters_big <- st_intersects(bv_sample, cell_big_as_points, sparse = F) # we detect the null intersections intersect_yes <- apply(temp_inters_big, 1, function (x) any(x != F)) # aggregate the data for (i in which(intersect_yes)) { bv_sample[i, paste0(var_extensive, "_pip_big")] <- apply(st_drop_geometry(cell_big_as_points) [temp_inters_big[i,], var_extensive], 2, sum) } # NA value for the non intersected values bv_sample[!intersect_yes, paste0(var_extensive, "_pip_big")] <- NA ``` ### Border effects It could happen that some points do not fall into any sources due to border effects as presented in the next figure. This is something which may happen when the targets are not nested into the sources which is the case in our data, especially for the big cells. ```{r} plot(st_geometry(cell_big_as_points[1:5, ])) plot(st_geometry(cell_big[1:5, ]), add = T) plot(st_geometry(bv_sample), add = T) ``` In that case, there are two options for the point-in-polygon method : * **exclude** : user supposes that this sources should not be included in the targets because there are outside the space of targets. In that case, the sum of the extensive variables obtained on the targets should be lower than the sum of the extensive variables obtained on the sources. * **include** : user supposes that every source should be associated to a target. It is possible to use the algorithm of the nearest neighbor to attribute a source to its closest target. In that second case, we use the following two steps : * Identify the sources which do not fall into targets : ```{r} temp_inters_mat <- st_intersects(cell_big_as_points, bv_sample, sparse = T) ind_with_no_inter <- which(sapply(temp_inters_mat, length) == 0) ``` * use function *st_nearest_feature()* from **sf** to detect the closest neighbour and add the values to the concerned targets: ```{r, message = F, warning =F} bv_sample[, paste0(var_extensive, "_pip_big_2")] <- bv_sample[, paste0(var_extensive, "_pip_big")] for (k in ind_with_no_inter) { ind_k <- st_nearest_feature(cell_big_as_points[k, ], bv_sample) bv_sample[ind_k, paste0(var_extensive, "_pip_big_2")] <- rowSums(cbind( as.numeric(st_drop_geometry(bv_sample[ind_k, paste0(var_extensive, "_pip_big_2")])), as.numeric(st_drop_geometry(cell_big_as_points)[k, var_extensive]) ), na.rm = T) } ``` We have done the same thing for the small cells but we do not present the codes as it is similar to the previous codes. ```{r, message = F, echo = F, warning =F} temp_inters_mat <- st_intersects(cell_200m_as_points, bv_sample, sparse = T) ind_with_no_inter <- which(sapply(temp_inters_mat, length) == 0) bv_sample[, paste0(var_extensive, "_pip_cells_2")] <- bv_sample[, paste0(var_extensive, "_pip_cells")] for (k in ind_with_no_inter) { ind_k <- st_nearest_feature(cell_200m_as_points[k, ], bv_sample) bv_sample[ind_k, paste0(var_extensive, "_pip_cells_2")] <- rowSums(cbind( as.numeric(st_drop_geometry(bv_sample[ind_k, paste0(var_extensive, "_pip_cells_2")])), as.numeric(st_drop_geometry(cell_200m_as_points)[k, var_extensive]) ), na.rm = T) } ``` We compare the total number of inhabitants (variable **Ind**) in the sources and the targets with respect to the method proposed above. In the case of **exclude**, the estimated number of inhabitants is lower than the total number of inhabitants observed at the source level, wheras in the case of **include**, all the inhabitants have been affected to the targets. For the small cells, the total number of individuals observed on the sources is equal to: ```{r} sum(cell_200m$Ind) ``` The total number of individuals observed on the targets is equal to the total on sources when including and slightly lower when excluding: ```{r} sum(bv_sample$Ind_pip_cells, na.rm = T) sum(bv_sample$Ind_pip_cells_2, na.rm = T) ``` For the big cells, the total number of individuals observed on the sources is equal to: ```{r} sum(cell_big$Ind) ``` Because the cells are bigger, the total number of individuals observed on the targets is much more lower when we exclude the sources outside the delimited area of the sources. ```{r} sum(bv_sample$Ind_pip_big, na.rm = T) sum(bv_sample$Ind_pip_big_2, na.rm = T) ``` ### Which method should I choose ? A motivation to include the values of all the sources in the targets is that user wants to keep the total observed on the sources. It could be the case when the geometries of sources and targets differ, but they cover the same population of interest. In our example, it is not justified because the sources are independant from the communes. For example, a source which shares the city of Toulouse and another one, contains the information relative to the people living in this area independantly from the city. Hence, we keep the method **"exclude"** for estimating the covariates that will be used in our regression model. ### Comparaison between small and big cells taken as sources We compare the results obtained when considering the small cells or the big cells as sources. We have kept the method which have excluded the points falling outside the zone delimited by the sources. We focus on the variable number of inhabitants. ```{r, warning = F} ggplot(bv_sample) + aes(x = Ind_pip_cells, y = Ind_pip_big) + geom_point() + labs(x = "Estimates when Sources = Small cells", y = "Estimates when Sources = Big cells") ``` We also represent in the following maps the estimates when the sources are the small (resp. big) cells. ```{r} ggplot(data = bv_sample) + labs(title = "Estimates when Sources = small cells", fill = "Inhabitants") + geom_sf(aes(fill = Ind_pip_cells)) + annotation_scale(location = "bl", width_hint = 0.5) ggplot(data = bv_sample) + labs(title = "Estimates when Sources = big cells", fill = "Inhabitants") + geom_sf(aes(fill = Ind_pip_big)) + annotation_scale(location = "bl", width_hint = 0.5) ``` **Comments:** first, there are 57 targets (represented in grey in the map) which have not been estimated when considering the big cells as sources, whereas there are only two missing values when considering the small cells as sources. Indeed, when a source intersects with several targets, only one target will be charged In the case of the big cells, many targets are not charged Secondly, we observe that even if some estimates are identical (the points which are located on the regression line $y=x$ on the scatter plot), the differences can be huge. For example, three targets have a value higher than $7~500$ inhabitants with the big cells as sources whereas their values are smaller than to $4~000$ with the small cells as sources. ### Are our estimates good or not ? Unfortunatly, we are in an unsupervised situation. We are not able to compare our estimations with the true values. We could make the assumption that the number of voters in the voting places should not be too far from the number of people with more than 18 years old. However, in practice, it is not the case, because many people who have the right to vote do not register or because people who vote in a polling place are not necessarly living in the same place... ## Intensive variables ### The extensive variables defining the intensive variable are known Once the aggregation has been made on the extensive variables, if an intensive variable is defined as the ratio of two estensive variables, it is then possible to make the ratio of the two estimates. For exemple, we estimate the PIP method for the two variables number of inhabitants lower than 18 years old (**prop_Ind_under_18**) and population density (**pop_dens**): ```{r} bv_sample <- bv_sample %>% mutate( prop_Ind_under_18_pip_cells = Ind_0_17_pip_cells / Ind_pip_cells , pop_dens_pip_cells = Ind_pip_cells / area_pip_cells, prop_Ind_under_18_pip_big = Ind_0_17_pip_big / Ind_pip_big, pop_dens_pip_big = Ind_pip_big / area_pip_big ) ``` We represent the estimates of the two intensive variables computed when the sources correspond to the small cells. ```{r} ggplot(data = bv_sample) + geom_sf(aes(fill = pop_dens_pip_cells)) + labs(fill = "Pop Dens") + annotation_scale(location = "bl", width_hint = 0.5) ggplot(data = bv_sample) + geom_sf(aes(fill = prop_Ind_under_18_pip_cells)) + annotation_scale(location = "bl", width_hint = 0.5) + labs(fill = 'Pct < 18') ``` ### The extensive variables defining the intensive variable are unknown If the two extensive variables which define the intensive variable are unknown, one could estimate the intensive variable at the targets by using the mean of the points falling in one target. In that case, it supposes that each point has the same weight, which is rarely the case. For example, let consider a first point having 1000 inhabitants and a proportion of inhabitants with less than 18 years old equal to $10\%$, and a second point of 10 inhabitants with a ratio equal to $50\%$, the estimate is $30\%$ if we do not use the weight and $10.4\%$ otherwise. In the following codes, we estimate the intensive variables without considering the weights and we compare it to the ones obtained previously. Hence, we remark that with no weight, the estimates give different results. In the case of the small cells taken as sources, even if the results are correlated, the estimates are slightly different. ```{r, warning = F} # aggregate the data for (i in 1:nrow(bv_sample)) { bv_sample[i, paste0(var_intensive, "_pip_cells_bad")] <- apply(st_drop_geometry(cell_200m) [temp_inters_cell_200m[i,], var_intensive], 2, mean) } ggplot(bv_sample) + aes(x = prop_Ind_under_18_pip_cells, y = prop_Ind_under_18_pip_cells_bad) + geom_point() + labs(title = "Estimates of people < 18 (sources = small cells)", x = "extensive variables are known", y = "extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` For the big cells, it seems that there are less differences. This could be due to the fact that the number of sources falling inside the targets is lower than for the small cells. Indeed, let suppose that exactly one source falls inside one target. In that case, the two methods would be equivalent. ```{r, warning = F} # aggregate the data for (i in 1:nrow(bv_sample)) { bv_sample[i, paste0(var_intensive, "_pip_big_bad")] <- apply(st_drop_geometry(cell_big) [temp_inters_big[i, ], var_intensive], 2, mean) } ggplot(bv_sample) + aes(x = prop_Ind_under_18_pip_big, y = prop_Ind_under_18_pip_big_bad) + geom_point() + labs(title = "Estimates of people < 18 (sources = Big cells)", x = "extensive variables are known", y = "extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` ## Limitation of the point-in polygon method The limitations of this method are illustrated in the two following figures * In the first figure, around $75\%$ of a source (represented in red) is falling in the target in green. However, as the centroid is falling into the target neighbour, the number of household (486 hereafter) is given to the target neighbour. In this example, the size of the targets and the sources are close. In that case, this is preferable to use another method, like DAW or DAX. ```{r, echo = F} plot(st_geometry(bv_sample[1, ])) plot(st_geometry(bv_sample), add = T) plot(st_geometry(bv_sample[1, ]), col = "green", add = T) plot(st_geometry(cell_big), add = T, lty = 2) plot(st_geometry(cell_big[336, ]), add = T, border = "red") plot(st_geometry(cell_big_as_points[336, ]), add = T, col = "red") text(st_coordinates(cell_big_as_points)[336, 1], st_coordinates(cell_big_as_points)[336, 2], paste0("Household: ", as.numeric(st_drop_geometry(cell_big_as_points)[336, "Men"])), pos = 3) ``` * In the second figure, wa can see that the estimates in the target in green is equal to 0, because there are no red points falling inside the green polygon. One more time, this is due to the fact that the big cells ($1km \times 1km$ and $2km \times 2km$) are bigger than the targets. ```{r, echo = F} plot(st_geometry(bv_sample[54, ])) plot(st_geometry(bv_sample), add = T) plot(st_geometry(bv_sample[54, ]), col = "green", add = T) plot(st_geometry(cell_big), add = T, lty = 2) plot(st_geometry(cell_big_as_points), add = T, col = "red") ``` In our application, 57 (resp. 2) targets have not been estimated when using the big (resp. small) cells as sources. ```{r} length(which(is.na(bv_sample[, "pop_dens_pip_big"]))) ``` To conclude this section, the method PIP seems efficient iif sources are much more smaller than the targets which is more or less the case in our example with the small cells taken as sources. Besides, the problem of borders should be taken into account when sources and targets do not coincide. In our example, we decided to not take into account the sources whose centroids do not fall into the targets. Finally, the method works for both extensive and intensive variables. If the intensive variables are defined from extensive variables known, it is better to use the estimates on the extensive variables first and then redefine the intensive variable. # Areal weighting interpolation (DAW) method ## Extensive variable If we consider the previous example where one source intersects at least two targets, the Areal weighting interpolation consists of disaggregating the variables between the two targets proportionnaly to the intersected zones area. For example, if the number of household is equal to 486 in the source, $75\%$ of the area is intersected with target 1 and $25\%$ is intersected with target 2, $486 \times 0.75$ households are thus affected to target 1 and $486 \times 0.25$ households are affected to target 2. For extensive variables $\hat Y_{s,t}=\frac{|A_{s,t}|}{|S_s|}Y_s$, and then : $\hat Y_{t}=\sum_s\frac{|A_{s,t}|}{|S_s|}Y_s$. For doing this computation, one needs to compute the areas shared between sources and targets on the intersected zones $\{A_{st}\}$ between $\{S_s\}$ and $\{T_t\}$. We can use the functions *st_intersection()* to get the intersected zones and *st_area()* to get the areas of the intersected zones. ### Border effect issue As for the point-in-polygon method, the issue of border effect still occurs. For example if a source intersects $30\%$ with target 1, $40\%$ with target 2, and $30\%$ with an empty zone, how should we disaggregate the variable $x=100$ inhabitants. There are two possibilities : * **exclude** : 30 inhabitants are affected to target 1, 40 inhabitants are affected to target 2, and 30 inhabitants are not affected. In that case, the sum of $x$ on the targets will be lower to the sum of $x$ on the sources. * **include** : $\frac{30}{30+40} \times 100=42.86$ inhabitants are affected to target 1 and $\frac{40}{30+40} \times 100 =57.14$ inhabitants are affected to target 2. In our illustration, we compare the two options. We present here the codes in the case of the small cells taken as sources: ```{r, warning = F, echo = T} # intersection of sources and targets temp_inters_cell_200m <- st_intersection(bv_sample[, "BUREAU"], cell_200m[, c(var_extensive, "IdINSPIRE")]) # compute the areas of the intersections divided by the area of the source temp_inters_cell_200m$Area_intersect <- as.numeric(st_area(temp_inters_cell_200m)) / 1000 ^ 2 # compute the sum of the intersected zones per source sum_intersect_cell_200m <- temp_inters_cell_200m %>% select(Area_intersect, IdINSPIRE) %>% group_by(IdINSPIRE) %>% summarise(sum_area = sum(Area_intersect)) # merge with the interesected zones temp_inters_cell_200m <- merge(temp_inters_cell_200m, st_drop_geometry(sum_intersect_cell_200m), by = "IdINSPIRE") temp_inters_cell_200m$Area_share_1 <- temp_inters_cell_200m$Area_intersect / temp_inters_cell_200m$area temp_inters_cell_200m$Area_share_2 <- temp_inters_cell_200m$Area_intersect / temp_inters_cell_200m$sum_area # Multiply the variables by the proportions # * for method 1 temp_inters_cell_200m[, paste0(var_extensive, "_1")] <- sapply(st_drop_geometry(temp_inters_cell_200m[, var_extensive]), function(x) x * temp_inters_cell_200m$Area_share_1) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_cell_200m[, paste0(var_extensive, "_1")], by = list(temp_inters_cell_200m$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(paste0(var_extensive, "_1")), ~ paste0(var_extensive, "_daw_cells")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") # * for method 2 temp_inters_cell_200m[, paste0(var_extensive, "_2")] <- sapply( st_drop_geometry(temp_inters_cell_200m[, var_extensive]), function(x) x * temp_inters_cell_200m$Area_share_2) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_cell_200m[, paste0(var_extensive, "_2")], by = list(temp_inters_cell_200m$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(paste0(var_extensive, "_2")), ~ paste0(var_extensive, "_daw_cells_2")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") ``` We have done the same thing for the big cells but we do not present the codes here as it is the same than the one seen previously. ```{r, warning = F, echo = F} # intersection of sources and targets temp_inters_big <- st_intersection(bv_sample[, "BUREAU"], cell_big[, c(var_extensive, "Id_carr_n")]) # compute the areas of the intersections divided by the area of the source temp_inters_big$Area_intersect <- as.numeric(st_area(temp_inters_big)) / 1000 ^ 2 # compute the sum of the intersected zones per source sum_intersect_big <- temp_inters_big %>% select(Area_intersect, Id_carr_n) %>% group_by(Id_carr_n) %>% summarise(sum_area = sum(Area_intersect)) # merge with the interesected zones temp_inters_big <- merge(temp_inters_big, st_drop_geometry(sum_intersect_big), by = "Id_carr_n") temp_inters_big$Area_share_1 <- temp_inters_big$Area_intersect / temp_inters_big$area temp_inters_big$Area_share_2 <- temp_inters_big$Area_intersect / temp_inters_big$sum_area # Multiply the variables by the proportions # * for method "exclude" temp_inters_big[, paste0(var_extensive, "_1")] <- sapply(st_drop_geometry(temp_inters_big[, var_extensive]), function(x) x * temp_inters_big$Area_share_1) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_big[, paste0(var_extensive, "_1")], by = list(temp_inters_big$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(paste0(var_extensive, "_1")), ~ paste0(var_extensive, "_daw_big")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") # * for method "include" temp_inters_big[, paste0(var_extensive, "_2")] <- sapply(st_drop_geometry( temp_inters_big[, var_extensive]), function(x) x * temp_inters_big$Area_share_2) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_big[, paste0(var_extensive, "_2")], by = list(temp_inters_big$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(paste0(var_extensive, "_2")), ~ paste0(var_extensive, "_daw_big_2")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") ``` ### Comparaison between the two sources of data Now we compare the DAW results with respect to the choice of the sources (small cells or big cells). Few observations fit with the regression line $y=x$, but we observe differences between the two estimates. As for the PIP method, we recommand to use if possible the sources with the geographical level the most detailed. ```{r} ggplot(bv_sample) + aes(x = Men_daw_cells, y = Men_daw_big) + geom_point() + labs(title = "Estimates of the number of households", x = "Sources = small cells", y = "Sources = big cells") + geom_abline(intercept = 0, slope = 1) ggplot(bv_sample) + aes(x = Log_soc_daw_cells, y = Log_soc_daw_big) + geom_point() + labs(title = "Estimates of the number of social accommodation", x = "Sources = small cells", y = "Sources = big cells") + geom_abline(intercept = 0, slope = 1) ``` ## Comparaison between PIP and DAW We compare the results obtained with point-in-polygon and with DAW on the variables number of households. First, we present the figure in the case where the big cells have been taken as sources. ```{r, warning = F} ggplot(bv_sample) + aes(x = Men_pip_big, y = Men_daw_big) + labs(title = "Households estimates (sources = big cells)", x = "Method = PIP", y = "Method = DAW") + geom_point() + geom_abline(intercept = 0, slope = 1) ``` Hence, it appears that there are huge differences. It seems that the DAW method corrects the approximation made by the PIP method (1 source is affected to 1 and only target). We now look at the differences in the case of the small cells taken as sources: ```{r, warning = F} ggplot(bv_sample) + aes(x = Men_pip_cells, y = Men_daw_cells) + labs(title = "Households estimates (sources = small cells)", x = "Method = PIP", y = "Method = DAW") + geom_point() + geom_abline(intercept = 0, slope = 1) ``` In the case of small cells, the differences are less obvious because the approximation made by the PIP method (1 source is affected to 1 and only target) has less impact when the sources are smaller than the targets, because 1 source has more chance to fall fully inside 1 target. In other term, the more the sources are smaller compared to the targets, the more PIP and DAW methods should be similar. **Remark**: The DAW method has been implemented by Pebsema (2018) in function *st_interpolate_aw()*. It can be used like this. The scatter plot shows that the program coincide with what we have done in the case where we exclude the non intersected zones. However, the *st_interpolate_aw()* does not permit to keep the identification of the sources and targets which define the intersected zones. ```{r, warning = F} a1 <- st_interpolate_aw(cell_200m["Men"], bv_sample, extensive = TRUE) plot(a1$Men, bv_sample$Men_daw_cells, xlab = "st_interpolate_aw() function", ylab = "R codes presented in this chapter") ``` ## Intensive variable ### Extensive variables are known If the extensive variables which define the intensive variables are known, the DAW method is applied first on the extensive variable to get the estimates at the target levels and then intensive variables are re-created on the estimates. For exemple, we estimate the DAW method for the two variables number of inhabitants lower than 18 years old (**prop_Ind_under_18**) and population density (**pop_dens**): ```{r} bv_sample <- bv_sample %>% mutate( prop_Ind_under_18_daw_cells = Ind_0_17_daw_cells / Ind_daw_cells, pop_dens_daw_cells = Ind_daw_cells / area_daw_cells, prop_Ind_under_18_daw_big = Ind_0_17_daw_big / Ind_daw_big, pop_dens_daw_big = Ind_daw_big / area_daw_big ) ``` We represent the estimates of the two variables by using the small cells as sources. ```{r} ggplot(data = bv_sample) + geom_sf(aes(fill = pop_dens_daw_cells)) + labs(fill = "Pop Dens") + annotation_scale(location = "bl", width_hint = 0.5) ggplot(data = bv_sample) + geom_sf(aes(fill = prop_Ind_under_18_daw_cells)) + annotation_scale(location = "bl", width_hint = 0.5) + labs(fill = 'Age < 18') ``` ### Extensive variables are unknown If it is not the case and the extensive variable are not known, we suppose that $\hat Y_{s,t}=Y_s$. Then $\hat Y_t=\sum \frac{|A_{s,t}|}{|T_t|}Y_s$. ```{r, message = F, warning = F, echo = T} # intersection of sources and targets temp_inters_cells <- st_intersection(bv_sample[, "BUREAU"], cell_200m[, c(var_intensive, "area", "IdINSPIRE")]) # compute the areas of the intersections divided by the area of the source temp_inters_cells$Area_intersect <- as.numeric(st_area(temp_inters_cells)) / 1000 ^ 2 # compute the sum of the intersected zones per targets sum_targets <- bv_sample %>% mutate(sum_area = as.numeric(st_area(bv_sample)) / 1000 ^ 2) # merge with the intersected zones temp_inters_cells <- merge(temp_inters_cells, st_drop_geometry(sum_targets), by = "BUREAU") temp_inters_cells$Area_share <- temp_inters_cells$Area_intersect / temp_inters_cells$sum_area temp_inters_cells[, var_intensive] <- sapply( st_drop_geometry(temp_inters_cells[, var_intensive]), function(x) x * temp_inters_cells$Area_share) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_cells[, var_intensive], by = list(temp_inters_cells$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(var_intensive), ~ paste0(var_intensive, "_daw_cells_bad")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") ``` We have done the same things when the sources are the big cells, but we do not present the codes as it is the same that the ones presented above. ```{r, message = F, warning = F, echo = F} # intersection of sources and targets temp_inters_big <- st_intersection(bv_sample[, "BUREAU"], cell_big[, c(var_intensive, "area", "Id_carr_n")]) # compute the areas of the intersections divided by the area of the source temp_inters_big$Area_intersect <- as.numeric(st_area(temp_inters_big)) / 1000 ^ 2 # merge with the intersected zones temp_inters_big <- merge(temp_inters_big, st_drop_geometry(sum_targets), by = "BUREAU") temp_inters_big$Area_share <- temp_inters_big$Area_intersect / temp_inters_big$sum_area temp_inters_big[, var_intensive] <- sapply( st_drop_geometry(temp_inters_big[, var_intensive]), function(x) x * temp_inters_big$Area_share) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_big[, var_intensive], by = list(temp_inters_big$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(var_intensive), ~ paste0(var_intensive, "_daw_big_bad")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") ``` ### Comparing the two methods #### Sources = small cells We compare the estimates when using "extensive variables are known" wether "extensive variables are unknown". We do it first for the case where the sources are the small cells, and for the population density variable. We observe that the two methods seem similar. ```{r} ggplot(bv_sample) + aes(x = pop_dens_daw_cells, y = pop_dens_daw_cells_bad) + geom_point() + labs(title = "DAX method for population density (sources = small cells)", x = "Extensive variables are known", y = "Extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` Let's have a look now for the proportion of people with less than 18 years old. The method "extensive variables are unknown" seems clearly underestimated for some targets. ```{r} ggplot(bv_sample) + aes(x = prop_Ind_under_18_daw_cells, y = prop_Ind_under_18_daw_cells_bad) + geom_point() + labs(title = "DAX method for percent of young people (sources = small cells)", x = "Extensive variables are known", y = "Extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` Let's try to understand better what is happening for these cases. We have represented in red a target where the percent of young people is estimated to $34.9\%$ with the case "extensive variables are known" and $6.7\%$ with the case "extensive variables are unknown" ```{r, warning = F} op <- par(oma= c(0, 0, 0, 0), mar = c(0, 0, 0, 0), mai = c(0, 0, 0, 0)) plot(st_geometry(bv_sample[187,]), border = "red") plot(st_geometry(bv_sample), border = "black", add = T, lty = 2) plot(st_geometry(bv_sample[187, ]), border = "red", add = T) plot(st_geometry(cell_200m), border = "lightblue", add = T) par(op) ``` It appears that the non intersected zone between sources and targets is important. In that case, $\hat Y_t$ will be underestimated because of the term $\frac{|A_{s,t}|}{|T_t|}$ and more particularly by the term $|T_t|$ which includes the non intersected zones. This situation is comparable to the previous case: **include** or **exlude** the non intersected zones. We suggest that when the intensive variable depends on the area, it has sense to include the non intersected zones. Indeed, in our case, the non intersected zones correspond to zones with no inhabitants and this information should be teken into account when estimating for example the population density. However, when we consider an intensive variable which does not depend on the area (like the proportion of people with less than 18 years old), the formula creates biaises. One issue should consist in replacing the term $|T_t|$ by $\sum_s|A_{s,t}|$. This is actually what is done by the function *st_interpolate_aw()* with the option **extensive=F**. ```{r, warning = F} a1 <- st_interpolate_aw(cell_200m["prop_Ind_under_18"], bv_sample, extensive = FALSE) bv_sample$prop_Ind_under_18_daw_sf <- a1$prop_Ind_under_18 ``` In that case, we observe less differences between the cases "extensive variables are known" or "extensive variables are unknown" ```{r} ggplot(bv_sample) + aes(x = prop_Ind_under_18_daw_cells, y = prop_Ind_under_18_daw_sf) + geom_point() + labs(title = "DAX method for percent of young people (sources = small cells)", x = "Extensive variables are known", y = "Extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` This is actually what is done by the function *st_interpolate_aw()* with the option **extensive=F**. #### Sources = big cells When the sources are the big cells, the correlation seems stronger. It confirms the remark which has been done with the PIP method. When the variable is intensive, if sources and targets have more or less the same size, the methods "extensive variables are known" VS "extensive variables are unknown" should give similar results. For example, for the population density: ```{r} ggplot(bv_sample) + aes(x = pop_dens_daw_big, y = pop_dens_daw_big_bad) + geom_point() + labs(title = "DAX method for pop density (sources = big cells)", x = "Extensive variables are known", y = "Extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` For the variable "percentage of people with less than 18 years old", we use the function *st_interpolate_aw()* with the option **extensive=F**. ```{r, warning = F} a1 <- st_interpolate_aw(cell_big["prop_Ind_under_18"], bv_sample, extensive = FALSE) bv_sample$prop_Ind_under_18_daw_big_sf <- a1$prop_Ind_under_18 ``` The two methods "extensive variables are known" VS "extensive variables are unknown" are also correlated. ```{r} ggplot(bv_sample) + aes(x = prop_Ind_under_18_daw_big, y = prop_Ind_under_18_daw_big_sf) + geom_point() + labs(title = "DAX method (sources = big cells)", x = "Extensive variables are known", y = "Extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` ## Conclusion The DAW method seems to correct many weakness of the PIP method. When the size of the sources are much more smaller than the targets PIP is still valid. However, when the sizes of source and target are comparable, DAW is much more precise. We have treated in our illustration the case where the sources and targets are not nested. If the variable is extensive, we have decided in this application to not take into account the part of the source which does not intersect with the target. This choice could be different if user wants that the sum observed on the source equals the sum of the estimates obtained on the targets. When the variable is intensive, when it is possible, user should estimate the extensive variables first. If it is not possible, he should take care of the nature of the intensive variable: does it depend on the area or not? If yes, the formula should depend on the non intersected zones. If not, user can use directly the function *st_interpolate_aw()* to make the estimates. If the sources are smaller than the targets, the differences between the two methods should be higher. One inconvenient of DAW method could be that the variables are not related to the area. The idea is to replace the area by an auxiliary information. # Dasymetric method with auxiliary variable $X$ (DAX) For using this method, user needs to get the auxiliary variable $X$ at the sources and at the intersections levels (Note that if sources and targets are nested, when user has information at the intersections levels, by aggregating the data, it has also the information at the source levels). Hence, it is difficult to find such a variable. Indeed, in our application, it seems very difficult to get socio-economic variables at the intersection levels. The INSEE has individual data concerning housing (see https://www.insee.fr/fr/metadonnees/definition/c1815) that could be used to obtain information at the intersection levels, but it is not in Open acess. To get data at the intersection levels, we can use for instance road data (Source : https://www.data.gouv.fr/fr/datasets/filaire-de-voirie-toulouse-metropole/) or OSM data. These data are not areal: it is given at the point or the lines level but it is easy to transpose the data at the intersections levels. If data are points, user can get the information at the intersection levels by using PIP method. For the road data, we simply compute the length of the roads in the intersection zones. By using this data, we make the assumption that the INSEE variables are correlated to the density of the roads. In other terms, the more there are roads, the more there are people. ```{r, eval = T, message = F, warning = F} # We import the boundaries of the road maps in Toulouse: voieries <- read_sf("./data/voierie/filaire-de-voirie.geojson") voieries <- st_transform(voieries, crs = st_crs(bv_sample)) ``` ## Extensive variables We apply the formula $\hat Y_{s,t}=\frac{X_{s,t}}{X_s}Y_s$, and then : $\hat Y_{t}=\sum_s\frac{X_{s,t}}{X_s}Y_s$. In term of programmation, the method DAX is very close to DAW. The terms related to area are replaced by the auxiliary variables. The computational time to get the intersections between the roads and the intersected zones $A_{st}$ is demanding, but it can be done in an acceptable time (few seconds). Here, we do not take into account the part of the sources which does not intersect with any target (in other term, we keep the case **exclude** seen previously). We present here the codes used in DAX method when the sources are the small cells. ```{r, echo = F} load("bv_sample_dax.RData") ``` ```{r, eval = F, echo = T, message = F, warning = F} # We find the intersections between : # - (intersection of sources and targets) and road maps # intersection of sources and targets temp_inters_cells <- st_intersection(bv_sample[, "BUREAU"], cell_200m[, c(var_extensive, "IdINSPIRE")]) temp_inters_sourcestargets_roads <- st_intersection( temp_inters_cells[, c("BUREAU", "IdINSPIRE")], voieries[, "id_troncon"]) # compute the length of the roads temp_inters_sourcestargets_roads$length <- as.numeric(st_length(temp_inters_sourcestargets_roads)) # aggregate the lengths by sources temp_inters_sources_roads <- st_intersection( cell_200m[, "IdINSPIRE"], voieries[, "id_troncon"]) temp_inters_sources_roads$length <- as.numeric(st_length(temp_inters_sources_roads)) temp_sources <- aggregate(temp_inters_sources_roads[, "length"], by = list(temp_inters_sources_roads$IdINSPIRE), FUN = sum) temp_sources <- temp_sources %>% rename(length_s = length) # aggregate the lengths by (intersection of sources and targets) temp_intersects <- aggregate(temp_inters_sourcestargets_roads[, "length"], by = list(temp_inters_sourcestargets_roads$BUREAU, temp_inters_sourcestargets_roads$IdINSPIRE), FUN = sum) # add the variable Xst temp_inters_cells <- merge(temp_inters_cells, st_drop_geometry(temp_intersects), by.x = c("BUREAU", "IdINSPIRE"), by.y = c("Group.1", "Group.2")) # add the variable Xs temp_inters_cells <- merge(temp_inters_cells, st_drop_geometry(temp_sources), by.x = "IdINSPIRE", by.y = "Group.1") temp_inters_cells <- temp_inters_cells %>% mutate(ratio = length / length_s) # apply the ratio with extensive variables # * for method 2 temp_inters_cells[, paste0(var_extensive)] <- sapply( st_drop_geometry(temp_inters_cells[, var_extensive]), function(x) x * temp_inters_cells$ratio) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_cells[, var_extensive], by = list(temp_inters_cells$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(var_extensive), ~ paste0(var_extensive, "_dax_cells")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") ``` We have done the same computations for the big cells taken as sources but we do not present here the codes because it is similar to what has been done previously. ```{r, eval = F, message = F, warning = F, echo = F} # We find the intersections between : # - (intersection of sources and targets) and road maps # intersection of sources and targets temp_inters_big <- st_intersection(bv_sample[, "BUREAU"], cell_big[, c(var_extensive, "Id_carr_n")]) temp_inters_sourcestargets_roads <- st_intersection( temp_inters_big[, c("BUREAU", "Id_carr_n")], voieries[, "id_troncon"]) # compute the length of the roads temp_inters_sourcestargets_roads$length <- as.numeric(st_length(temp_inters_sourcestargets_roads)) # aggregate the lengths by sources temp_inters_sources_roads <- st_intersection( cell_big[, "Id_carr_n"], voieries[, "id_troncon"]) temp_inters_sources_roads$length <- as.numeric(st_length(temp_inters_sources_roads)) temp_sources <- aggregate(temp_inters_sources_roads[, "length"], by = list(temp_inters_sources_roads$Id_carr_n), FUN = sum) temp_sources <- temp_sources %>% rename(length_s = length) # aggregate the lengths by (intersection of sources and targets) temp_intersects <- aggregate(temp_inters_sourcestargets_roads[, "length"], by = list(temp_inters_sourcestargets_roads$BUREAU, temp_inters_sourcestargets_roads$Id_carr_n), FUN = sum) # add the variable Xst temp_inters_big <- merge(temp_inters_big, st_drop_geometry(temp_intersects), by.x = c("BUREAU", "Id_carr_n"), by.y = c("Group.1", "Group.2")) # add the variable Xs temp_inters_big <- merge(temp_inters_big, st_drop_geometry(temp_sources), by.x = "Id_carr_n", by.y = "Group.1") temp_inters_big <- temp_inters_big %>% mutate(ratio = length / length_s) # apply the ratio with extensive variables temp_inters_big[, paste0(var_extensive)] <- sapply( st_drop_geometry(temp_inters_big[, var_extensive]), function(x) x * temp_inters_big$ratio) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_big[, var_extensive], by = list(temp_inters_big$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(var_extensive), ~ paste0(var_extensive, "_dax_big")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") ``` ### Comparaison between the two sources of data We compare the DAX results with respect to the choice of the sources (small cells or big cells) on the variable "Number of households". Few observations fit with the regression line $y=x$, but we observe differences between the two estimates. As for the PIP or DAW method, we recommand to use the sources with the geographical level the most detailed. ```{r} ggplot(bv_sample) + aes(x = Men_dax_cells, y = Men_dax_big) + geom_point() + labs(title = "DAX estimates of the number of households", x = "Sources = small cells", y = "Sources = big cells") + geom_abline(intercept = 0, slope = 1) ``` ## Comparaison between DAW and DAX We compare the results obtained with DAX and with DAW on the variable number of households. First, we present the figure in the case where the small cells have been taken as sources. ```{r} ggplot(bv_sample) + aes(x = Men_dax_cells, y = Men_daw_cells) + labs(title = "Households estimates (sources = small cells)", x = "Method = DAX", y = "Method = DAW") + geom_point() + geom_abline(intercept = 0, slope = 1) ``` DAW and DAX seem to give similar results. We now look at the case of the big cells taken as sources. ```{r, warning = F} ggplot(bv_sample) + aes(x = Men_dax_big, y = Men_daw_big) + labs(title = "Households estimates (sources = big cells)", x = "Method = DAX", y = "Method = DAW") + geom_point() + geom_abline(intercept = 0, slope = 1) ``` In that case, it appears that the variability between the two methods is higher. Let's try to understand better the differences between the two methods. We have represented in red the target which has the higher difference between DAX and DAW estimates. For this target, the value estimated is 1460 with DAX and 2376 with DAW. We observe that the density of the roads is very low in the target, whereas the area covered by the target is important. It explains why the value is underestimated with DAX compared to DAW method. ```{r, warning = F} plot(st_geometry(bv_sample[70,]), border = "red") plot(st_geometry(bv_sample), border = "black", add = T, lty = 2) plot(st_geometry(bv_sample[70,]), border = "red", add = T) plot(st_geometry(cell_big), border = "lightblue", add = T) text(st_coordinates(st_centroid(cell_big))[, 1], st_coordinates(st_centroid(cell_big))[, 2], cell_big$Men) plot(st_geometry(voieries), col = "grey", add = T) ``` ## Intensive variables ### Extensive variables are known If the extensive variables which define the intensive variables are known, the DAW method is applied first on the extensive variable to get the estimates at the target levels and then intensive variables are re-created on the estimates. ```{r} bv_sample <- bv_sample %>% mutate( prop_Ind_under_18_dax_cells = Ind_0_17_dax_cells / Ind_dax_cells, pop_dens_dax_cells = Ind_dax_cells / area_dax_cells, prop_Ind_under_18_dax_big = Ind_0_17_dax_big / Ind_dax_big, pop_dens_dax_big = Ind_dax_big / area_dax_big ) ``` ### Intensive variable It works similarly than for the DAW method excepted that we replace the area by the auxiliary information. The formula is $\hat Y_t=\sum \frac{x_{s,t}}{\sum_sx_{st}}Y_s$. Note that we have choosen $\sum_sx_{st}$ instead of $x_t$ for the reason that we have seen in the DAW section. ```{r, eval = F, message = F, warning = F} # We find the intersections between : # - (intersection of sources and targets) and road maps # intersection of sources and targets temp_inters_cells <- st_intersection(bv_sample[, "BUREAU"], cell_200m[, c(var_intensive, "IdINSPIRE")]) temp_inters_sourcestargets_roads <- st_intersection( temp_inters_cells[, c("BUREAU", "IdINSPIRE")], voieries[, "id_troncon"]) # compute the length of the roads temp_inters_sourcestargets_roads$length <- as.numeric(st_length(temp_inters_sourcestargets_roads)) # aggregate the lengths by targets temp_targets <- aggregate(temp_inters_sourcestargets_roads[, "length"], by = list(temp_inters_sourcestargets_roads$BUREAU), sum) temp_targets <- temp_targets %>% rename(length_t = length) # aggregate the lengths by (intersection of sources and targets) temp_intersects <- aggregate(temp_inters_sourcestargets_roads[, "length"], by = list(temp_inters_sourcestargets_roads$BUREAU, temp_inters_sourcestargets_roads$IdINSPIRE), FUN = sum) # add the variable Xst temp_inters_cells <- merge(temp_inters_cells, st_drop_geometry(temp_intersects), by.x = c("BUREAU", "IdINSPIRE"), by.y = c("Group.1", "Group.2")) # add the variable Xs temp_inters_cells <- merge(temp_inters_cells, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") temp_inters_cells <- temp_inters_cells %>% mutate(ratio = length / length_t) # apply the ratio with extensive variables temp_inters_cells[, paste0(var_intensive)] <- sapply( st_drop_geometry(temp_inters_cells[, var_intensive]), function(x) x * temp_inters_cells$ratio) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_cells[, var_intensive], by = list(temp_inters_cells$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(var_intensive), ~ paste0(var_intensive, "_dax_cells_bad")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") ``` We have done the same computations for the small cells taken as sources but we do not present here the codes because it is similar to what has been done previously. ```{r, eval = F, message = F, warning = F, echo = F} # We find the intersections between : # - (intersection of sources and targets) and road maps # intersection of sources and targets temp_inters_big <- st_intersection(bv_sample[, "BUREAU"], cell_big[, c(var_intensive, "Id_carr_n")]) temp_inters_sourcestargets_roads <- st_intersection( temp_inters_big[, c("BUREAU", "Id_carr_n")], voieries[, "id_troncon"]) # compute the length of the roads temp_inters_sourcestargets_roads$length <- as.numeric(st_length(temp_inters_sourcestargets_roads)) # aggregate by targets temp_targets <- aggregate(temp_inters_sourcestargets_roads[, "length"], by = list(temp_inters_sourcestargets_roads$BUREAU), sum) temp_targets <- temp_targets %>% rename(length_t = length) # aggregate the lengths by (intersection of sources and targets) temp_intersects <- aggregate(temp_inters_sourcestargets_roads[, "length"], by = list(temp_inters_sourcestargets_roads$BUREAU, temp_inters_sourcestargets_roads$Id_carr_n), FUN = sum) # add the variable Xst temp_inters_big <- merge(temp_inters_big, st_drop_geometry(temp_intersects), by.x = c("BUREAU", "Id_carr_n"), by.y = c("Group.1", "Group.2")) # add the variable Xt temp_inters_big <- merge(temp_inters_big, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") temp_inters_big <- temp_inters_big %>% mutate(ratio = length / length_t) # apply the ratio with extensive variables temp_inters_big[, paste0(var_intensive)] <- sapply( st_drop_geometry(temp_inters_big[, var_intensive]), function(x) x * temp_inters_big$ratio) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_big[, var_intensive], by = list(temp_inters_big$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(var_intensive), ~ paste0(var_intensive, "_dax_big_bad")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") ``` ```{r, echo = F, eval = F} save(bv_sample, file = "bv_sample_dax.RData") ``` ### Comparaison between the two methods #### Sources = small cells We compare the estimates when using "extensive variables are known" wether "extensive variables are unknown". We do it first for the variable population density. ```{r} ggplot(bv_sample) + aes(x = pop_dens_dax_cells, y = pop_dens_dax_cells_bad) + geom_point() + labs(title = "Population density DAX method (sources = small cells)", x = "Extensive variables are known", y = "Extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` We remark that the scatter plot fits well around the curve $y=x$. The same remark can be done for the variable "Percentage of people under 18" even if there is a little more variability here. ```{r} ggplot(bv_sample) + aes(x = prop_Ind_under_18_dax_cells, y = prop_Ind_under_18_dax_cells_bad) + geom_point() + labs(title = "People < 18 with DAX method (sources = small cells)", x = "Extensive variables are known", y = "Extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` #### Sources = big cells When the sources are the big cells, the two methods "extensive variables are known" wether "extensive variables are unknown" fit quite well. ```{r} ggplot(bv_sample) + aes(x = pop_dens_dax_big, y = pop_dens_dax_big_bad) + geom_point() + labs(title = "Population density with DAX method (sources = big cells)", x = "Extensive variables are known", y = "Extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` ```{r} ggplot(bv_sample) + aes(x = prop_Ind_under_18_dax_big, y = prop_Ind_under_18_dax_big_bad) + geom_point() + labs(title = "People < 18 with DAX method (sources = big cells)", x = "Extensive variables are known", y = "Extensive variables are unknown") + geom_abline(intercept = 0, slope = 1) ``` ## Comparaison between DAX and DAW For the big cells: ```{r} ggplot(bv_sample) + aes(x = prop_Ind_under_18_dax_big, y = prop_Ind_under_18_daw_big) + geom_point() + labs(title = "People < 18 (sources = big cells)", x = "DAX estimates", y = "DAW estimates") + geom_abline(intercept = 0, slope = 1) ``` For the small cells: ```{r} ggplot(bv_sample) + aes(x = prop_Ind_under_18_dax_cells, y = prop_Ind_under_18_daw_cells) + geom_point() + labs(title = "People < 18 (sources = small cells)", x = "DAX estimates", y = "DAW estimates") + geom_abline(intercept = 0, slope = 1) ``` ## Conclusion In our application, even if we observe few differences, DAX and DAW methods seem to give very close estimates. The choice of one method rather another one depends on the variables. If the variables are related to the area, DAW should be precise enough. If it is not the case and user has additional auxiliary informations related to the variables, DAX should be more efficient. Moreover, in the case of extensive variables, the more the sources are small, the more DAW and DAW will give similar results. # Dasymetric method with control zones In this part, we consider the iris as the sources, the targets are the polling places and the control zones are the cell data at the smallest level. We use the two-step DAC algorithm which consists in: * **Step 1:** Estimating at the intersection level (between iris and polling places), the variable "number of inhabitants" which is available on the control zone. For doing this, we use a DAW method. * **Step 2:** Using the DAX method between the iris and the polling place. The auxiliary information is given by the number of inhabitants estimated at the first step ## Step 1: DAW method The aim is to apply the DAW method by using the control zones as sources and the intersections between iris and polling places as targets. The variable of interest is the "number of individuals". We define first the geometries of the intersections between iris and polling places: ```{r} # The targets control <- st_intersection(bv_sample[, "BUREAU"], iris[, c(var_extensive_rp, "IRIS")]) control <- control %>% mutate(ID = paste0(IRIS, "_", BUREAU)) ``` We can then apply the DAW method by considering the small cells as the sources and the intersections between iris and polling places as the targets. ```{r} # intersection of sources and targets temp_inters_cell_200m <- st_intersection(control[, "ID"], cell_200m[, c("Ind", "area", "IdINSPIRE")]) # compute the areas of the intersections divided by the area of the source temp_inters_cell_200m$Area_intersect <- as.numeric(st_area(temp_inters_cell_200m)) / 1000 ^ 2 # compute the sum of the intersected zones per source sum_intersect_cell_200m <- temp_inters_cell_200m %>% select(Area_intersect, IdINSPIRE) %>% group_by(IdINSPIRE) %>% summarise(sum_area = sum(Area_intersect)) # merge with the interesected zones temp_inters_cell_200m <- merge(temp_inters_cell_200m, st_drop_geometry(sum_intersect_cell_200m), by = "IdINSPIRE") temp_inters_cell_200m$Area_share_1 <- temp_inters_cell_200m$Area_intersect / temp_inters_cell_200m$area # Multiply the variables by the proportions temp_inters_cell_200m[, "Ind"] <- sapply(st_drop_geometry(temp_inters_cell_200m[, "Ind"]), function(x) x * temp_inters_cell_200m$Area_share_1) # Aggregate the variables by target temp_targets <- aggregate(temp_inters_cell_200m[, "Ind"], by = list(temp_inters_cell_200m$ID), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename(Ind_daw = Ind) # Merge with the targets control <- merge(control, st_drop_geometry(temp_targets), by.x = "ID", by.y = "Group.1") ``` ## Step 2: DAX method The aim is to use the iris as sources and the polling places as targets. We use here the auxiliary information created at the step 1: the number of inhabitants estimated at the intersection levels. We apply these methods on the extensive variables. ```{r} # aggregate the X by sources control_sources <- aggregate(control[, "Ind_daw"], by = list(control$IRIS), FUN = sum) control_sources <- control_sources %>% rename(Ind_daw_s = Ind_daw) # aggregate the lengths by (intersection of sources and targets) # add the variable Xs control <- merge(control, st_drop_geometry(control_sources), by.x = "IRIS", by.y = "Group.1") control <- control %>% mutate(ratio = Ind_daw / Ind_daw_s) # apply the ratio with extensive variables control[, paste0(var_extensive_rp)] <- sapply( st_drop_geometry(control[, var_extensive_rp]), function(x) x * control$ratio) # Aggregate the variables by target temp_targets <- aggregate(control[, var_extensive_rp], by = list(control$BUREAU), FUN = sum) # Rename the variables temp_targets <- temp_targets %>% rename_at(vars(var_extensive_rp), ~ paste0(var_extensive_rp, "_dac")) # Merge with the targets bv_sample <- merge(bv_sample, st_drop_geometry(temp_targets), by.x = "BUREAU", by.y = "Group.1") ``` ### Comparaison between DAC and DAX We have also computed the estimates for DAX method. We do not present the codes because it is similar to what have been presented in the previous section. ```{r} source("codes_DAX_iris.R") ``` We compare the results for the two methods DAX and DAW for the variable "number of unemployed person". ```{r} # Comparaison DAC VS DAX ggplot(bv_sample) + aes(x = P13_CHOM1564_dax, y = P13_CHOM1564_dac) + geom_point() + labs(title = "Estimates of the number of unemployement", x = "DAX", y = "DAC") + geom_abline(intercept = 0, slope = 1) ``` We observe some targets with very different estimates. Let's have a look on what is happening for one of this target represented in red in the following figure. The estimated value with DAX is equal to 674 and to 197 with DAC. In the DAX method, the main contribution to this value comes from the source represented in violet and which shares $82\%$ of the roads represented in grey with the target. ```{r} plot(st_geometry(bv_sample[c(239, 241), ]), lty = 2, border = "white") plot(st_geometry(bv_sample[239, ]), border = "red", add = T) plot(st_geometry(voieries), col = "grey", add = T, lty = 2, lwd = 0.5) plot(st_geometry(iris), border = "black", add = T) plot(st_geometry(iris[iris$IRIS == "315554702", ]), border = "violet", add = T, lwd = 1.5) ``` What is happening with the DAC 2 step method? We remark in the following figure that the intersected zone between the target in red and the source in violet does not contain a lot of population with regard to the control zones represented in lightblue. In that case, the source in violet will attribute the main part of its value to the non intersected zone with the target, which is a zone with a lot of population. ```{r} plot(st_geometry(bv_sample[c(239, 241), ]), lty = 2, border = "white") plot(st_geometry(bv_sample[239, ]), border = "red", add = T) plot(st_geometry(cell_200m), border = "lightblue", add = T) text(st_coordinates(st_centroid(cell_200m))[, 1], st_coordinates(st_centroid(cell_200m))[, 2], paste0(cell_200m$Ind), cex = 0.5) plot(st_geometry(voieries), col = "grey", add = T) plot(st_geometry(iris), border = "black", add = T) plot(st_geometry(iris[iris$IRIS == "315554702", ]), border = "violet", add = T, lwd = 1.5) ``` ## Conclusion When some control zones can be used and permit to obtain an auxiliary information (supposed to be related to the target variables) at a more detailed geographical level than the sources, the DAC 2 step should improve the estimation with regards to the DAX or DAW method. This is this method that will be used to estimate the covariates used in our model and provided by the INSEE at the iris level. # Regression Modelling ## Construction of the covariates We construct the covariates population density, percentage of individuals less than 18, percentage of individuals between 18 and 40, percentage of individuals between 40 and 64, percentage of individuals upper than 65, percentage of poor households, percentage of owners, percentage of recent dwellings, thanks to the variables provided by the INSEE at the cell level, by using the DAX method. We use the estimates obtained on the extensive variables and we then compute the ratio. For the small cells: ```{r} bv_sample <- bv_sample %>% mutate( pop_dens = Ind_dax_cells / area_dax_cells, prop_Ind_18_40 = (Ind_18_24_dax_cells + Ind_25_39_dax_cells) / Ind_dax_cells, prop_Ind_40_64 = (Ind_40_54_dax_cells + Ind_55_64_dax_cells) / Ind_dax_cells, prop_Ind_up_65 = (Ind_65_79_dax_cells + Ind_80p_dax_cells) / Ind_dax_cells, prop_pour_hos = Men_pauv_dax_cells / Men_dax_cells, prop_owner = Men_prop_dax_cells / Men_dax_cells, prop_recent = Log_ap90_dax_cells / Men_dax_cells ) ``` We do the same for PIP method: ```{r} bv_sample_pip <- bv_sample bv_sample_pip <- bv_sample_pip %>% mutate( pop_dens = Ind_pip_cells / area_pip_cells, prop_Ind_18_40 = (Ind_18_24_pip_cells + Ind_25_39_pip_cells) / Ind_pip_cells, prop_Ind_40_64 = (Ind_40_54_pip_cells + Ind_55_64_pip_cells) / Ind_pip_cells, prop_Ind_up_65 = (Ind_65_79_pip_cells + Ind_80p_pip_cells) / Ind_pip_cells, prop_pour_hos = Men_pauv_pip_cells / Men_pip_cells, prop_owner = Men_prop_pip_cells / Men_pip_cells, prop_recent = Log_ap90_pip_cells / Men_pip_cells ) ``` We do the same for DAW method: ```{r} bv_sample_daw <- bv_sample bv_sample_daw <- bv_sample_daw %>% mutate( pop_dens = Ind_daw_cells / area_daw_cells, prop_Ind_18_40 = (Ind_18_24_daw_cells + Ind_25_39_daw_cells) / Ind_daw_cells, prop_Ind_40_64 = (Ind_40_54_daw_cells + Ind_55_64_daw_cells) / Ind_daw_cells, prop_Ind_up_65 = (Ind_65_79_daw_cells + Ind_80p_daw_cells) / Ind_daw_cells, prop_pour_hos = Men_pauv_daw_cells / Men_daw_cells, prop_owner = Men_prop_daw_cells / Men_daw_cells, prop_recent = Log_ap90_daw_cells / Men_daw_cells ) ``` For the big cells: ```{r} bv_sample_big <- bv_sample bv_sample_big <- bv_sample_big %>% mutate( pop_dens = Ind_dax_big / area_dax_big, prop_Ind_18_40 = (Ind_18_24_dax_big + Ind_25_39_dax_big) / Ind_dax_big, prop_Ind_40_64 = (Ind_40_54_dax_big + Ind_55_64_dax_big) / Ind_dax_big, prop_Ind_up_65 = (Ind_65_79_dax_big + Ind_80p_dax_big) / Ind_dax_big, prop_pour_hos = Men_pauv_dax_big / Men_dax_big, prop_owner = Men_prop_dax_big/ Men_dax_big, prop_recent = Log_ap90_dax_big / Men_dax_big ) ``` We do the same for PIP method: ```{r} bv_sample_pip_big <- bv_sample bv_sample_pip_big <- bv_sample_pip_big %>% mutate( pop_dens = Ind_pip_big / area_pip_big, prop_Ind_18_40 = (Ind_18_24_pip_big + Ind_25_39_pip_big) / Ind_pip_big, prop_Ind_40_64 = (Ind_40_54_pip_big + Ind_55_64_pip_big) / Ind_pip_big, prop_Ind_up_65 = (Ind_65_79_pip_big + Ind_80p_pip_big) / Ind_pip_big, prop_pour_hos = Men_pauv_pip_big / Men_pip_big, prop_owner = Men_prop_pip_big/ Men_pip_big, prop_recent = Log_ap90_pip_big / Men_pip_big ) ``` We do the same for DAW method: ```{r} bv_sample_daw_big <- bv_sample bv_sample_daw_big <- bv_sample_daw_big %>% mutate( pop_dens = Ind_daw_big / area_daw_big, prop_Ind_18_40 = (Ind_18_24_daw_big + Ind_25_39_daw_big) / Ind_daw_big, prop_Ind_40_64 = (Ind_40_54_daw_big + Ind_55_64_daw_big) / Ind_daw_big, prop_Ind_up_65 = (Ind_65_79_daw_big + Ind_80p_daw_big) / Ind_daw_big, prop_pour_hos = Men_pauv_daw_big / Men_daw_big, prop_owner = Men_prop_daw_big/ Men_daw_big, prop_recent = Log_ap90_daw_big / Men_daw_big ) ``` We then construct the covariates unemployement rate, immigration rate, percentage of farmers, percentage of intermediate/highly qualified jobs, percentage of workers, percentage of retired people, thanks to the variables provided by the INSEE at the iris level, by using the DAC 2 step method. We use the estimates obtained on the extensive variables and we then compute the ratio. ```{r} bv_sample <- bv_sample %>% mutate(prop_immi = P15_POP_IMM_dac / P15_POP_dac, prop_unemploy = P13_CHOM1564_dac / P13_ACT1564_dac, prop_csp_1 = (C15_POP15P_CS1_dac + C15_POP15P_CS2_dac) / C15_POP15P_dac, prop_csp_2 = (C15_POP15P_CS3_dac + C15_POP15P_CS4_dac) / C15_POP15P_dac, prop_csp_3 = (C15_POP15P_CS5_dac + C15_POP15P_CS6_dac) / C15_POP15P_dac ) ``` ## Exploratory Analysis The correlation plot indicates that the extrem right vote is **positively** correlated with: * proportion of people younger than 18, * proportion of immigrates * turnout, * proportion of workers, * proportion of recent housing It is **negatively** correlated with * percentage of high qualified profession * population density * percentage of people between 18 and 40 * percentage of people up to 65 ```{r} my_data <- st_drop_geometry(bv_sample[, c("taux_fn", "turnout", "pop_dens", "prop_pour_hos", "prop_owner", "prop_recent", "prop_Ind_18_40", "prop_Ind_40_64", "prop_Ind_up_65", "prop_unemploy", "prop_immi", "prop_csp_1", "prop_csp_2", "prop_csp_3"), ]) M <- cor(my_data) res1 <- cor.mtest(my_data, conf.level = .95) corrplot(M, p.mat = res1$p, method = "color", type = "upper", sig.level = c(.001, .01, .05), pch.cex = .9, insig = "label_sig", pch.col = "white") ``` **Remark 1:** the covariates are strongly correlated, which let presume that the linear model should include colinearity. **Remark 2:** the relatioship between the extreme right vote and the covariate are not necessarly linear. For example, if we look at the scatter plot of the extreme right voting with respect to the unemployement rate, the relationship is not linear. ```{r} ggplot(data = bv_sample) + aes(x = prop_unemploy, y = taux_fn) + geom_point() + geom_smooth(method = "loess", col = "red") + labs(x = "unemployement rate", y = "XR score") ``` Linear modelling is based on the hypothesis that covariates are supposed linearly independant which is not the case in our case. For that reason, we will propose two regression modelling: * Linear modelling * Regression tree ## Regression modelling ### Linear modelling ```{r} res_lm_1_big <- lm(taux_fn ~ turnout + pop_dens + prop_pour_hos + prop_owner + prop_recent + prop_Ind_18_40 + prop_Ind_40_64 + prop_Ind_up_65, data = bv_sample_pip_big) res_lm_2_big <- lm(taux_fn ~ turnout + pop_dens + prop_pour_hos + prop_owner + prop_recent + prop_Ind_18_40 + prop_Ind_40_64 + prop_Ind_up_65, data = bv_sample_daw_big) res_lm_3_big <- lm(taux_fn ~ turnout + pop_dens + prop_pour_hos + prop_owner + prop_recent + prop_Ind_18_40 + prop_Ind_40_64 + prop_Ind_up_65, data = bv_sample_big) res_lm_1 <- lm(taux_fn ~ turnout + pop_dens + prop_pour_hos + prop_owner + prop_recent + prop_Ind_18_40 + prop_Ind_40_64 + prop_Ind_up_65, data = bv_sample_pip) res_lm_2 <- lm(taux_fn ~ turnout + pop_dens + prop_pour_hos + prop_owner + prop_recent + prop_Ind_18_40 + prop_Ind_40_64 + prop_Ind_up_65, data = bv_sample_daw) res_lm_3 <- lm(taux_fn ~ turnout + pop_dens + prop_pour_hos + prop_owner + prop_recent + prop_Ind_18_40 + prop_Ind_40_64 + prop_Ind_up_65, data = bv_sample) ``` #### Full Model ```{r} res_lm <- lm(taux_fn ~ turnout + pop_dens + prop_pour_hos + prop_owner + prop_recent + prop_Ind_18_40 + prop_Ind_40_64 + prop_Ind_up_65 + prop_unemploy + prop_immi + prop_csp_1 + prop_csp_2 + prop_csp_3, data = bv_sample) ``` ```{r, results='asis'} stargazer::stargazer(res_lm) ``` ```{r, results='asis'} stargazer::stargazer(res_lm_1_big, res_lm_2_big, res_lm_3_big, res_lm_1, res_lm_2, res_lm_3, res_lm) ``` The Adjusted $R^2$ in our model is not so bad with a value equal to $56.58\%$. In the estimates of the parameters, the signs associated to the variable unemployement does not correspond to what we could expect. This is probably due to the colinearity observed between the covariates and the non linearity link between the depend variable and the covariate. We now try a regression tree which. ### Regression tree ```{r} my_tree <- rpart(taux_fn ~ turnout + pop_dens + prop_pour_hos + prop_owner + prop_recent + prop_Ind_18_40 + prop_Ind_40_64 + prop_Ind_up_65 + prop_unemploy + prop_immi + prop_csp_1 + prop_csp_2 + prop_csp_3, data = bv_sample) fancyRpartPlot(my_tree, sub = "") pred_tree <- predict(my_tree) ``` We focus first on the last nodes which correspond to the polling places with the highest predicted value (equal here to $26\%$). It corresponds to 9 polling places that we represent on the map and there are located at the periphery of the city in all the directions. If we follow the tree from the top to this terminal node, it corresponds to observations with proportion of workers upper to $29\%$ and with a proportion of high qualified jobs lower to $14\%$. ```{r} par(mar= c(0,0,0,0)) plot(st_geometry(bv_sample)) plot(st_geometry(bv_sample[pred_tree == max(pred_tree), ]), add = T, col = "red") ``` On contrary, if we focus now on the first node which corresponds to the polling places with the lowest predicted value (equal here to $11\%$), it corresponds to 52 polling places represented in red in the following map and mainly located at the centre of the city. If we follow the tree from the top to the bottom, it corresponds to polling place with a proportion of workers lower than $20\%$ (at the first node, it is $28\%$ then at the third node, it appears again with a value equal to $20\%$), a population density larger than 2887 inhabitants per kilometer square, and a strong proportion of people with an age between 18 and 40 years old. ```{r} par(mar= c(0,0,0,0)) plot(st_geometry(bv_sample)) plot(st_geometry(bv_sample[pred_tree == min(pred_tree), ]), add = T, col = "red") ``` **References** * Do V.H., Thomas-Agnan C. and A. Vanhems (2015). Spatial reallocation of areal data: a review, *RERU*. * Nguyen T.H.A, Laurent T., Thomas-Agnan C. and A. Ruiz-Gazen (2018). Analyzing the impacts of socio-economic factors on French departmental elections with CODA methods, *TSE Working Paper*. * Pebesma, E. (2018). Simple Features for **R**: Standardized Support for Spatial Vector Data. *The **R** Journal*, **10** (1), 439-446, https://doi.org/10.32614/RJ-2018-009 * **R** Core Team (2019). ***R**: A language and environment for statistical computing*. **R** Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.