Spatial data analysis with R

Schedule

Unfold day to see schedule

Monday

9:00-10:30 Handling and plotting vector data in R
10:30-10:45 Coffee break
10:45-12:00 Handling and plotting cont.
12:00-13:00 Lunch break
13:00-14:30 Spatial operations (intersection, clipping, conversions etc)
14:30-14:45 Coffee break
14:45-16:00 Spatial operations cont.

Tuesday

9:00-10:30 Spatial analysis of vector data (clustering, density surfaces, autocorrelation)
10:30-10:45 Coffee break
10:45-12:00 Spatial analysis cont.
12:00-13:00 Lunch break
13:00-14:15 Visualizing spatial data
14:15-14:30 Coffee break
14:30-16:00 Spatial analysis with R in Puhti supercluster, inc running R code in parallel

Wednesday

9:00-10:30 Raster basics with R
10:30-10:45 Coffee break
10:45-12:00 Raster data manipulation
12:00-13:00 Lunch break
13:00-14:30 Map algebra
14:30-14:45 Coffee break
14:45-16:00 Spatial modelling with raster data

Thursday

13:00-15:00 Using Puhti supercomputer with R for spatial analysis and code parallization hands-on workshop. Intro to the new Puhti web interface. Optional for course participants, open to everybody.
Zoom link

Course materials

Exercise environment: RStudio

Notebooks RStudio

For exercises we will use RStudio in CSCs Notebook environment, for which you will only need an updated webrowser (Firefox, Chrome or Safari are recommended, others may or may not work)

  • If you want to use this service, please try logging in to https://notebooks.csc.fi/ using your HAKA, Virtu or CSC user account. If you do not have a HAKA or CSC user account (=LUKE) see separate email (coming latest Friday before course start).
  • First day only, navigate to 'Account' in the top bar and 'join group' using the code geospatial_r-2o4c9 (please do not share this code outside this course).
  • Now you should see 'Spatial data analysis with R course' in the list on the dashboard.
  • Click Launch new under 'Spatial data analysis with R course'
  • Wait for a moment, so that Open in Browser link shows up and follow that link.
  • Log in to RStudio with username: rstudio and password from clipboard.
  • Clone Github repository, in Terminal (next to Console): git clone https://github.com/csc-training/r-spatial-course.git

After each course day

  • Close the web-browser tab with RStudio
  • Click Destroy in Notebooks Dashboard. Even if you do not do this, the RStudio session will be automatically end after 10h from starting it.
  • If you want to store what you have done on your computer, you can mark what you want to store, eg the full r spatial course directory in the file browser panel in RStudio, click on 'More' and find 'Export', give it a name in the popup and click 'download'. This should lead to another popup in which you can save the compressed files to your computer.

Local RStudio

If you want, you can also follow the course in your own RStudio on your computer.

Practicalities

  • 2 screens helps to keep Zoom, HackMD/Material and RStudio in workable sizes.
  • If you have only one screen, consider using a tablet/smartphone as a second screen.
  • Please ask the presenter to increase font size etc, if not readable for you
  • Headphones / Headset, for better audio.

During the course

  • If you have questions:
    • Preferably ask with audio
    • If less related to current topic / also later answer ok, ask here in HackMD
  • If you need help during exercises:
    • Ask with audio or chat.
    • Share your screen if asked for troubleshooting
  • If you are ready with an exercise, change your name in Zoom participants list, for example to "Maria E(xercise)3 done"
  • Please mute yourself when not talking but keep your video on during the whole course, if your Internet connection allows
  • During the breaks your are also welcome to have free chat in Zoom

Optional course 'day': Thursday

unfold to see information for Thursday
  • On Thursday we will provide an extra optional session for exercises and questions around using R on Puhti for parallel code execution (13:00-15:00) with a separate Zoom link, this part is open and free for everyone, in case you have interested colleagues.

Questions and Answers

  • Ask a question like this
    • Anyone can answer like this
      • Ask more about it like this

Day 1, Session 1, questions from Zoom chat

  • Is there a way to see this r markdown “visualized” in rstudio like we read normal markdown in eg. github?

    • There is a possibility to 'knit' the Rmd file to eg html
    • You can find the knit buttom in the top bar of the Rmd window in Rstudio
    • When knitting to html it will create a html file in your current workspace which can then be opened in your webbrowser
  • Permission issues, paths not found that are clearly there?

    • something like this: Warning in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : GDAL Error 4: sqlite3_open(../Data/villages.gpkg) failed: unable to open database file Creating dataset ../Data/villages.gpkg failed. Error in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : Creation failed
    • rename the cloned directory in your workspace
    • clone again
    • git clone https://github.com/csc-training/r-spatial-course.git
  • Instead of maps I got plots. What is wrong?

    • maybe the sf library is not loaded?
    • if sf is not loaded, R will not know what to do with spatial information and choose some other way of plotting the data
    • to load the sf library, try library(sf) in the console

Note: if you want to run all code in an Rmd you can go on the triangle button next to 'Run' in the upper right corner of the Rmd script window within RStudio and choose 'Run all', or use the keyboard shortcut Control + Alt + R

  • so feature is like an observation and field is like a variable?

    • yes, each row in the datasets are observations -> features in geospatial terms
    • and variables are the columns -> attributes in geospatial terms
  • I had a weirdish issue: i managed to load the nicely, but now i ran the code again and it says the file doesn't exist. the same code worked 10min ago. :)

    • answer coming soon
  • st_sf can not be used directly, st_sfc needs to be used first?

    • no, extra step is needed to create simple feature column before using st_sf
  • When is it necessary to use the package before the tool you are using?

    • if you have not loaded library using library command you will need to use library::function

Exercise 1 (D1S1) solutions:

Solutions to the exercises you can now find in https://github.com/csc-training/r-spatial-course/tree/main/Day 1

Solutions
read_sf("../Data/villages.gpkg") %>% dplyr::select(vil_id, HumanFootprint, IncomeDependency) %>% dplyr::filter(IncomeDependency > 50)
villages%>% mutate(HFP = ifelse( HumanFootprint > mean(HumanFootprint),1,0)) %>% group_by(HFP) %>% summarise(mean(HumanFootprint)))
five_villages <- select(villages, vil_id) %>% dplyr::slice(1:5) new_point <- st_pint(c(1,1)) %>% sf_sfc() %>% st_sf() %>% mutate(vil_id = 1) %>% st_set_crs(st_ers(five_villages)) r_bind(five_villages, new_point)

Questions Day 1, session 2

It is good practise to clean workspace (broom symbol in environment panel)
Also helps to restart R (in tabs: Session > restart R) before starting new project/session to make sure everything is removed

  • I actually got the following error when running code in RMarkdown document: ”Error in plot.xy(xy.coords(x, y), type = type, ) : plot.new has not been called yet”

    • Rmd: has trouble with building plots over several lines, so you will need to run all plot commands at once, ie running the full chunck
    • or via command line
    • this is specific to RStudio
  • Error in plot.new() : figure margins too large

    • try: resizing the Rmd window
    • copy command that throws the error and run it in the console
    • you can check the margins with par("mar")
    • and you can change the margins with par(mar=c(1,1,1,1))
    • also make sure your Rmd panel (where the Rmd file is loaded) is large enough to hold the plot

If you want you can also see the plot in separate window (where the plot shows up in the Rmd, right upper corner left most button 'show in new window')

  • difference between $ and []?

    • [] gives a simple feature collection, still a sf object, keeps geometry
    • with $ you get a vector, only accesses the contents, does not keep the sf object
    • -> for plotting spatial data always use []
  • If someone else has problems with knitting second part .Rmd to html: it seems to be out out memory error https://stackoverflow.com/questions/51410248/r-markdown-to-html-via-knitr-pandoc-error-137

    • And i guess it’s due to 1gb memory limit in the notebook that was discussed earlier. Works fine when run local
    • We also got more memory now, so if you destroy the instance in the notebooks webinterface now and launch a new one, you should get 4GB instead :)
  • I re-launched the notebook, but I don’t remember what was the command to get the material from the git repository.

    • run in terminal : git clone https://github.com/csc-training/r-spatial-course.git

for plotting multiple plots with ggplot use |, eg p1 |p2 to have p1 and p2 side by side
guides= "collect" to get a legend with information from all plots

  • do you prefer mapview over tmap?
    • MK: no preference
      • interactive exploration done with QGIS mostly
      • mapview is faster to type, as in tmap shape and esthetics have to be defined first

Exercise 2 (D1S2)

Solutions to the exercises you can now find in https://github.com/csc-training/r-spatial-course/tree/main/Day 1

Solutions
# for CSC notebooks  replace ".." with "/home/rstudio/r-spatial-course"
villages <- read_sf("../Data/villages.gpkg")
admin <- read_sf("../Data/LAO_adm1.shp")
rivers <- read_sf("../Data/rivers.gpkg")

ggplot() +
    geom_sf(data = admin, fill = "transparent") + 
    geom_sf(data = rivers, lty = 2, col = "darkblue") + 
    geom_sf(data = villages, aes(size = IncomeDependency))


2

bbox <- admin %>% 
    dplyr::filter(NAME_1 == "Vientiane_Capital") %>% 
    st_bbox()
bbox

ggplot() +
    geom_sf(data = admin, fill = "transparent") + 
    geom_sf(data = rivers, lty = 2) + 
    geom_sf(data = villages, size = 0.1) +
    coord_sf(xlim = c(bbox[1],bbox[3]), ylim = c(bbox[2],bbox[4]))

Questions Day 1 Session 3 (D1S3)

Some general info about CRS (Coordinate Reference System): https://rspatial.org/raster/spatial/6-crs.html

R does not provide any 'on-the-fly' CRS transformation (like QGIS etc)!

You can take a look at the metadata of any variable by typing the variable name in the chunk to be executed

class(variablename) to get the class information of a variable

Note: library rmapshaper does not currently work in the notebook environment because it is relying on some external library which we could not make work. Sorry for the inconvenience. In normal Desktop RStudio on your computer this should work fine. (Let us know if you encounter any problems)

"Creating regular grids" part was skipped in class

  • what does reset=FALSE do in the plot? line 265 in D1S3

    • makes it possible to add something to the same plot, without it adding something else to does not work
  • after villages <- sf::read_sf("/home/rstudio/r-spatial-course/Data/villages.gpkg") %>% sf::st_transform(32648) it gives Error in UseMethod("st_transform") : no applicable method for 'st_transform' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

    • it looks like the data is not loaded correctly, ie without sf library loaded?
    • or an issue with the villages.gpkg file, try recreating the file (done in D1S1)
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • running this: ( vil_sel <- dplyr::filter(basins, HYBAS_ID == 4061080120) %>% sf::st_intersection(villages) ) ends up in the error: Error in geos_op2_geom("intersection", x, y, ) : st_crs(x) == st_crs(y) is not TRUE and I cannot run the rest

    • basins and villages do not seem to have the same CRS
    • -> look at st_crs(villages) and st_crs(basins)
      • these should be same
      • if not, run st_transform to transform one of them to the crs of the other

Note: https://proj.org/operations/projections/aea.html for exercise

  • what does st_transform do?
    • reprojection
    • for defining projection: sf::set_crs (see line 100)

Exercise 3 (D1S3)

Solutions to the exercises you can now find in https://github.com/csc-training/r-spatial-course/tree/main/Day 1

Solutions

Note for exercise 1: Use
*"+proj=aea +lat_1=13 +lat_2=22"* instead of *"+proj=aea"*

read_sf("../Data/LAO_adm1.shp") %>% 
  st_transform("+proj=aea +lat_1=13 +lat_2=22") %>% 
  plot()
admin <- read_sf("../Data/LAO_adm1.shp") %>% 
    select(NAME_1) %>% 
    st_transform(32648)

intersections <- major_basins %>% 
    select(MAIN_BAS) %>% 
    st_intersection(admin) %>% 
    mutate(area = sf::st_area(geom))

intersections

intersections %>% 
  group_by(NAME_1) %>% 
  summarise(dplyr::n())

help(st_area)
help(st_length)

intersections %>%
  st_cast("LINESTRING") %>% 
  mutate(length = st_length(geom))

Questions Day1 Session 4, D1S4

Spatial operations do not result in new geometries!

  • Would we get an error if gridcells and villages were reversed? (line 52ff)
    • no, but every value is FALSE because points cannot contain polygons

Skipped some spatial queries chunks in the course

Exercise 4 (D1S4)

Solutions to the exercises you can now find in https://github.com/csc-training/r-spatial-course/tree/main/Day 1

Questions Day 2 Session 1 (D2S1)

Yesterday there was some issue with point sizes in ggplot, today we will learn how to scale the points (use scale_size_continuous(range=c(0.2,2)))

  • Spatial autocorrelation for random points

  • Why unclass?

    • see (class( unclass(villages)))
    • villages becomes a numeric matrix instead of some (complicated) class
  • How to interpret Moran's?

    • >0.3 would be considered strong autocorrelation
    • 0.2: there is autocorrelation, but not strong, but depends on own interpretation
    • 0.5: very strong autocorrelation; weights have strong influence (no rule of thumb on how to set the weights, comes from experience)
    • -0.9: assumption: close to a checkerboard pattern
  • I got error on plot(vil_voronoi[,"Morans_p"])
    Error in h(simpleError(msg, call)) :
    error in evaluating the argument 'x' in selecting a method for function 'plot': undefined columns selected

    • try changing $ selection to using dplyr::pull function to (local_moran,5) around line 122:
    • vil_voronoi <- vil_voronoi %>% dplyr::mutate(Morans_I = local_moran$Ii, Morans_p = dplyr::pull(local_moran,5))
  • My session on the Rstudio Server seems to be jammed because I tried to run something too large. Is there a way to stop a process instead of closing the whole session?

    • find red square in top right of the chunk you are running to stop the chunk process
    • in console, there might also be a stop button, but it only exists
    • or Session > interrupt

Questions Day 2 Session 2 (D2S2)

line 68 needed to be adjusted during course: (remove scalefillmanual because of library issues in notebooks) :

# 4 clusters, with a maximum of 100 iterations
clusters <- kmeans(data, 4, iter.max = 100)

voronoi$cluster <- clusters$cluster
plot(voronoi[,"cluster"])
  • Why do I get different clusters from teacher?

    • kmeans is undeterministic, so it is a little bit different everytime it is run
  • When reasonable to scale coordinates?

    • check what happens if unscaled coordinates are used?
      • large range
      • clustering is entirely defined by coordinates , which we do not want
    • we could also eg control the strength of coordinates on the clustering results by scaling them to 050 instead of 0100
  • can the number of initial values be controlled? to not get so different values when running kmeans multiple times.

    • many algorithms run kmeans multple times
    • here we only run kmeans once, initial values are controlled by the seed set

ctrl + alt + i to create new chunk in Rmd

gather function creates a so-called long format

for errors regarding cluster must not be duplictaed: code chunks should only be run once!

Questions Day 2 Session 3 (D2S3)

  • Spatial regression, do all variables need to be converted to spatial format?

    • no, you only need the distance matrix (corresponding to weight matrix)
    • the rest can be non-spatial
  • If there is no distance matrix given, what happens?

    • it computes it itself based on the data given
  • how do SAR and GWR differ from "simple" linear models (e.g. gls) with a spatial correlation structure (correlation=…)?

  • Do we get any statistical inference if change in coefficients is significant?

    • not directly
    • but all information is available from object
    • SDF can be used to attach the statistics
  • literature suggestions?

arange(vil_id) in beginning of file, to keep the order of villages

For those interested, Jaakko Madetoja's dissertation is dealing with GWR and uncertainty in GWR models. It has a nice introduction to GWR in general, too (page 39 -> ). https://aaltodoc.aalto.fi/bitstream/handle/123456789/29575/isbn9789526078052.pdf?sequence=1&isAllowed=y


Questions Day 3 Session 1 (D3S1)

  • number of NA, counted within the boundingbox?
    • yes, empty cells in raster within boundingbox

find all colorscales: palettes_c_names

scico : scientific colorscale that do not hamper interpretation

  • How you can set same temperature scale for all plots in the stack?

    • define breaks as sequence (brk <- seq(-11,20, length.out = 10)) and add to the plot command breaks= brk
  • Benefit of using paleteer_c instead of directly using scico?

    • no benefit
    • paleteer_c is just collecting all pallettes together
      • gives common way to access them
      • makes it a bit easier to search for nice ones

Questions Day 3 Session 2 (D3S2)

project can be used with a projection (eg EPSG code) or using other raster as target (project will then reproject the raster to same extent and crs of the raster given)

always try to rather reproject vectors if you have a choice (vectors can be reprojected back, rasters not)

  • will missing values be problem with mean?

    • yes, with missing values the output will be missing value
    • adding na.rm=TRUE helps
  • How would you make a smoothing window that is determinend by the central cell?

    • ie central cell defining the impact
    • you could define own function instead of using mean
  • how does the self-made weights and NAs behave with mean?

    • same as before if na.rm is set to TRUE
    • with missing values, own mean function would need to be defined, otherwise weights would be scaled differently
  • What was the command for finding help when having cursor over some function?

    • clicking F1 brings up the help. you can also use help("function_name") or ?function_name

Some information about the scientific colour schemes, like the Turku one we used earlier: https://www.fabiocrameri.ch/colourmaps/

khroma::smooth_rainbow -a rainbow scheme that is optimized also for colorblind people, not recommended for continuous data, but ok for discrete

A list for additional reading about R and GIS: https://docs.csc.fi/apps/r-env-for-gis/#references

Questions Day 3 Session 3 (D3S3)

  • does cv refer to your fuction or of R?

    • own function, basic R does not have built-in cv function
  • merge instead of left_join?

    • yes can be used also, just more information needed

Questions Day 3 Session 4 (D3S4)

  • does here -package work with these notebooks?

    • Seems to work, after installing with install.packages("here")
  • How about puting building and testing 50:50?

    • also works, just make sure to have enough data in both training and testing dataset
  • Raster data and ggplot2; a threat or a possibility?

    • somewhere in between
    • use tmap for plotting raster and vector data together

Do not write below this line!

Select a repo