Spatial Software Standards

--- title: Spatial Software Standards tags: statistical-software robots: noindex, nofollow ---  ## Spatial Software Standards for spatial software begin with a consideration and standardisation of domains of applicability. Following that we proceed to standards according to which spatial software is presumed to perform one or more of the following steps: 1. Accept and validate input data 2. Apply one or more analytic algorithms 3. Return the result of that algorithmic application 4. Offer additional functionality such as printing or summarising return results 5. Testing Each standard for spatial software is prefixed with “**SP**”. ### 1 Spatial Domains Many developers of spatial software in R, including many of those those featured on the CRAN Task view on [“Analysis of Spatial Data”](https://cran.r-project.org/web/views/Spatial.html), have been primarily focussed on geographic data; that is, data quantifying positions, structures, and relationships on the Earth and other planets. Spatial analyses are nevertheless both broader and more general than geography alone. In particular, spatial software may be *geometric* – that is, concerned with positions, structures, and relationships in space in any general or specific sense, not necessarily confined to geographic systems alone. It is important to distinguish these two domains because many algorithms and procedures devised in one of these two domains are not necessarily (directly) applicable in the other, most commonly because geometric algorithms presume space to be rectilinear or Cartesian, while geographic algorithms (generally) presume it be have a specific curvilinear form (commonly spherical or elliptical). Algorithms designed for Cartesian space may not be directly applicable in curvilinear space, and vice-versa. Moreover, spatial software and algorithms might be intended to apply in spaces of arbitrary dimensionality. The phrase “Cartesian” refers to any space of arbitrary dimensionality in which all dimensions are orthogonal and described by straight lines; dimensions in a curvilinear space or arbitrary dimensionality are described by curved lines. A planar geometry is a two-dimensional Cartesian space; a spherical geometry is a two- (or maybe three-)dimensional curvilinear space. One of the earliest and still most widely used R spatial packages, [`spatstat`](https://cran.r-project.org/web/packages/spatstat/) (first released 2002), describes itself as, “\[f\]ocused mainly on two-dimensional point patterns, including multitype/marked points, in any spatial region.” Routines from this package are thus generally applicable to two-dimensional Cartesian data only, even through the final phrase might be interpreted to indicate a comprehensive generality. `spatstat` routines may not necessarily give accurate results when applied in curvilinear space. These considerations motivate the first standard for spatial software: - **SP1.0** *Spatial software should explicitly indicate its domain of applicability, and in particular distinguish whether the software may be applied in Cartesian/rectilinear/geometric domains, curvilinear/geographic domains, or both.* We encourage the use of clear and unambiguous phrases such as “planar”, “spherical”, “Cartesian”, “rectilinear” or “curvilinear”, along with clear indications of dimensionality such as “two-” or “three-dimensional.” Concepts of dimensionality should be interpreted to refer explicitly to the dimensionality of independent spatial coordinates. Elevation is a third spatial dimension, and time may also be considered an additional dimension. Beyond those two, other attributes measured at spatial locations do not represent additional dimensions. - **SP1.1** *Spatial software should explicitly indicate its dimensional domain of applicability, in particular through identifying whether it is applicable to two or three dimensions only, or whether there are any other restrictions on dimensionality.* These considerations of domains of applicability permeate much of the ensuring standards, which distinguish “geometric software” from “geographic software”, where these phrases are to be interpreted as shorthand references to software intended for use in the respective domains. ### 2 Input data structures and validation Input validation is an important software task, and an important part of our standards. While there are many ways to approach validation, the class systems of R offer a particularly convenient and effective means. For Spatial Software in particular, a range of class systems have been developed, for which we refer to the CRAN Task view on [“Analysis of Spatial Data”](https://cran.r-project.org/web/views/Spatial.html). Software which uses and relies on defined classes can often validate input through affirming appropriate class(es). Software which does not use or rely on class systems will generally need specific routines to validate input data structures. As for our standards for [Time-Series Software](https://ropenscilabs.github.io/statistical-software-review-book/standards.html#time-series-software), these standards for Spatial Software also suggest that software should use explicit class systems designed and intended for spatial data. New packages may implement new class systems for spatial data, and these may even be as simple as appending a class attribute to a matrix of coordinates. The primary motivation of the following standard is nevertheless to encourage and enhance inter-operability with the rich system of classes for spatial data in R. - **SP2.0** *Spatial software should only accept input data of one or more classes explicitly developed to represent such data.* - **SP2.0a** *Where new classes are implemented, conversion to other common classes for spatial data in R should be documented.* - **SP2.0b** *Class systems should ensure that functions error appropriately, rather than merely warning, in response to data from inappropriate spatial domains.* **Spatial Workflows, Packages, and Classes** Spatial software encompasses an enormous diversity, yet workflows implemented by spatial software often share much in common. In particular, coordinate reference systems used to precisely relate pairs of coordinates to precise locations in a curvilinear space, and in particular to the Earth’s ellipsoid, need to be able to be compared and transformed regardless of the specificities of individual software. This ubiquitous need has fostered the development of the [`PROJ` library](https://proj.org/) for representing and transforming spatial coordinates. Several other libraries have been built on top or or alongside that, notably including the [`GDAL` (“Geospatial Data Abstraction Library”)](https://gdal.org) and [`GEOS` (“Geometry Engine, Open Source”)](https://trac.osgeo.org/geos/) libraries. These libraries are used by, and integrated within, most geographical spatial software commonly used today, and will likely continue to be used. While not a standard in itself, it is expected that spatial software should not, absent very convincing and explicit justification, attempt to reconstruct aspects of these generic libraries. Given that, the following standards aim to ensure that spatial software remains as compatible as possible with workflows established by preceding packages which have aimed to expose and integrate as much of the functionality of these generic libraries as possible. The use of specific class systems for spatial data, and the workflows encapsulated in associated packages, ensures maximal ongoing compatibility with these libraries and with spatial workflows in general. Notable class systems and associated packages in R include [`sp`](https://cran.r-project.org/package=sp), [`sf`](https://cran.r-project.org/package=sf), and [`raster`](https://rspatial.org/raster/), and more recent extensions such as [`stars`](https://cran.r-project.org/package=stars), [`terra`](https://rspatial.org/terra), and [`s2`](https://r-spatial.github.io/s2/). With regard to these packages, the following single standard applies, because the maintainer of sp has made it clear that new software [should build upon sf, not sp](https://github.com/r-spatial/discuss/issues/48#issuecomment-798543339). - **SP2.1** *Spatial Software should not use the [`sp` package](https://cran.r-project.org/package=sp), rather should use [`sf`](https://cran.r-project.org/package=sf).* More generally, - **SP2.2** *Geographical Spatial Software should ensure maximal compatibility with established packages and workflows, minimally through:* - **SP2.2a** *Clear and extensive documentation demonstrating how routines from that software may be embedded within, or otherwise adapted to, workflows which rely on these established packages; and* - **SP2.2b** *Tests which clearly demonstrate that routines from that software may be successfully translated into forms and workflows which rely on these established packages.* This standard is further refined in a number of subsequent standards concerning documentation and testing. - **SP2.3** *Software which accepts spatial input data in any standard format established in other R packages (such as any of the formats able to be read by [`GDAL`](https://gdal.org), and therefore by the [`sf` package](https://cran.r-project.org/package=sf)) should include example and test code which load those data in spatial formats, rather than R-specific binary formats such as `.Rds`.* See the `sf` vignette on [“*Reading, Writing and Converting Simple Features*”](https://cran.r-project.org/web/packages/sf/vignettes/sf2.html) for useful examples. **Coordinate Reference Systems** As described above, one of the primary reasons for the development of classes in Spatial Software is to represent the coordinate reference systems in which data are represented, and to ensure compatibility with the [`PROJ` system](https://proj.org/) and other generic spatial libraries. The [`PROJ`](https://proj.org/) standards and associated software library have been recently (2020) updated (to version number 7) with “breaking changes” that are not backwards-compatible with previous versions, and in particular with the long-standing version 4. The details and implications of these changes within the context of spatial software in R can be examined in [this blog entry](https://www.r-spatial.org//r/2020/03/17/wkt.html) on [`r-spatial.org`](https://r-spatial.org), and in [this vignette](https://cran.r-project.org/web/packages/rgdal/vignettes/PROJ6_GDAL3.html) for the [`rgdal` package](https://cran.r-project.org/web/packages/rgdal/). The “breaking” nature of these updates partly reflects analogous “breaking changes” associated with updates in the [“Well-Known Text” (WKT)](http://docs.opengeospatial.org/is/12-063r5/12-063r5.html) system for representing coordinate reference systems. The following standard applies to software which directly or indirectly relies on geographic data which uses or relies upon coordinate reference systems. - **SP2.4** *Geographical Spatial Software should be compliant with version 6 or larger of* [`PROJ`](https://proj.org/), *and with* `WKT2` *representations. The primary implication, described in detail in the articles linked to above, is that:* - **SP2.4a** *Software should not permit coordinate reference systems to be represented merely by so-called “PROJ4-strings”, but should use at least WKT2.* **General Input Structures** New spatial software may nevertheless eschew these prior packages and classes in favour of implementing new classes. Whether or not prior classes are used or expected, geographic software should accord as much as possible with the principles of these prior systems by according with the following standards: - **SP2.5** *Class systems for input data must contain meta data on associated coordinate reference systems.* - **SP2.5a** *Software which implements new classes to input spatial data (or the spatial components of more general data) should provide an ability to convert such input objects into alternative spatial classes such as those listed above.* - **SP2.6** *Spatial Software should explicitly document the types and classes of input data able to be passed to each function.* - **SP2.7** *Spatial Software should implement validation routines to confirm that inputs are of acceptable classes (or represented in otherwise appropriate ways for software which does not use class systems).* - **SP2.8** *Spatial Software should implement a single pre-processing routine to validate input data, and to appropriately transform it to a single uniform type to be passed to all subsequent data-processing functions.* - **SP2.9** *The pre-processing function described above should maintain those metadata attributes of input data which are relevant or important to core algorithms or return values.* ### 3 Algorithms The following standards will be conditionally applicable to some but not all spatial software. Procedures for standards deemed not applicable to a particular piece of software are described in the [`srr` package](https://github.com/ropenscilabs/srr). - **SP3.0** *Spatial software which considers spatial neighbours should enable user control over neighbourhood forms and sizes. In particular:* - **SP3.0a** *Neighbours (able to be expressed) on regular grids should be able to be considered in both rectangular only, or rectangular and diagonal (respectively “rook” and “queen” by analogy to chess).* - **SP3.0b** *Neighbourhoods in irregular spaces should be minimally able to be controlled via an integer number of neighbours, an area (or equivalent distance defining an area) in which to include neighbours, or otherwise equivalent user-controlled value.* - **SP3.1** *Spatial software which considers spatial neighbours should wherever possible enable neighbour contributions to be weighted by distance (or other continuous weighting variable), and not rely exclusively on a uniform-weight rectangular cut-off.* - **SP3.2** *Spatial software which relies on sampling from input data (even if only of spatial coordinates) should enable sampling procedures to be based on local spatial densities of those input data.* An example of software which would *not* adhere to **SP3.2** would be where input data were a simple matrix of spatial coordinates, and sampling were implemented using the [`sample()` function](https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html) to randomly select elements of those input data (like `sample(nrow(xy), n)`). In the context of an example based on the [`sample()` function](https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html), adhering to the standard would require including an additional `prob` vector where each point was weighted by the local density of surrounding points. Doing so would lead to higher probabilities of samples being taken from central clusters of higher densities than from outlying extreme points. Note that the standard merely suggests that software should *enable* such density-based samples to be taken, not that it must, or even necessarily should by default. Algorithms for spatial software are often related to other categories of statistical software, and it is anticipated that spatial software will commonly also be subject to standards from these other categories. Nevertheless, because spatial analyses frequently face unique challenges, some of these category-specific standards also have extension standards when applied to spatial software. The following standards will be applicable for any spatial software which also fits any of the other listed categories of statistical software. **Regression Software** - **SP3.3** *Spatial regression software should explicitly quantify and distinguish autocovariant or autoregressive processes from those covariant or regressive processes not directly related to spatial structure alone.* **Unsupervised Learning Software** The following standard applies to any spatial unsupervised learning software which uses clustering algorithms. - **SP3.4** *Where possible, spatial clustering software should avoid using standard non-spatial clustering algorithms in which spatial proximity is merely represented by an additional weighting factor in favour of explicitly spatial algorithms.* **Machine Learning Software** One common application in which machine learning algorithms are applied to spatial software is in analyses of raster images. The first of the following standards applies because the individual cells or pixels of these raster images represent fixed spatial coordinates. (This standard also renders **ML2.1** inapplicable). - **SP3.5** *Spatial machine learning software should ensure that broadcasting procedures for reconciling inputs of different dimensions are **not** applied*. A definition of broadcasting is given at the end of the introduction to corresponding [Machine Learning Standards](file:///data/mega/code/repos/ropensci-stats/statistical-software-review-book/docs/standards.html#machine-learning-software), just above *Input Data Specification*. - **SP3.6** *Spatial machine learning software should ensure that test and training data are generated using sampling procedures appropriate to the domain or intended use of that software.* - **SP3.6a** *The effects of generating test and training data using inappropriate sampling procedures should be documented and/or tested.* We note that there are no comparable *General Standard* for [*Machine Learning Software*](#ml-standards), but that such distinction is particularly important for spatial machine learning software because it is frequently inappropriate to distinguish test and training data by taking samples from the same spatial region. One common method employed to generate distinct test and training data is spatial partitioning \[@muenchow\_chapter\_nodate;@brenning\_spatial\_2012;@schratz\_hyperparameter\_2019;@valavi\_blockcv\_2019\]. There may nevertheless be cases in which such sampling from a common spatial region is appropriate, for example for software intended to analyse or model temporally-structured spatial data for which a more appropriate distinction might be temporal rather than spatial. Adherence to this standard merely requires that the distinction between test and training data be explicitly considered and documented (and possibly tested as well). ### 4 Return Results For (functions within) Spatial Software which return spatial data: - **SP4.0** *Return values should either:* - **SP4.0a** *Be in same class as input data, or* - **SP4.0b** *Be in a unique, preferably class-defined, format.* - **SP4.1** *Any aspects of input data which are included in output data (either directly, or in some transformed form) and which contain units should ensure those same units are maintained in return values.* - **SP4.2** *The type and class of all return values should be explicitly documented.* ### 5 Visualization Spatial Software which returns objects in a custom class structure explicitly designed to represent or include spatial data should: - **SP5.0** *Implement default `plot` methods for any implemented class system.* - **SP5.1** *Implement appropriate placement of variables along x- and y-axes.* - **SP5.2** *Ensure that axis labels include appropriate units.* An example of **SP5.1** might be ensuring that longitude is placed on the x-axis, latitude on the y, although standard orientations may depend on coordinate reference systems and other aspects of data and software design. The preceding three standards will generally not apply to software which returns objects in a custom class structure yet which is not inherently spatial. Spatial Software which returns objects with geographical coordinates should: - **SP5.3** *Offer an ability to generate interactive (generally `html`-based) visualisations of results.* ### 6 Testing The following standards apply to all Spatial Software which is intended or able to be applied to data represented in curvilinear systems, notably including all geographical data. The only Spatial Software to which the following standards do not (necessarily) apply would be software explicitly intended to be applied exclusively to Cartesian spatial data, and which ensured appropriate rejection of curvilinear data according to **SP2.0b**. **Round-Trip Tests** - **SP6.0** *Software which implements routines for transforming coordinates of input data should include tests which demonstrate ability to recover the original coordinates.* This standard is applicable to any software which implements any routines for coordinate transformations, even if those routines are implemented via [`PROJ`](https://proj.org). Conversely, software which has no routines for coordinate transformations need not adhere to **SP6.0**, even if that software relies on `PROJ` for other purposes. - **SP6.1** *All functions which can be applied to both Cartesian and curvilinear data should be tested through application to both.* - **SP6.1a** *Functions which may yield inaccurate results when applied to data in one or the other forms (such as the preceding examples of centroids and buffers from ellipsoidal data) should test that results from inappropriate application of those functions are indeed less accurate.* - **SP6.1b** *Functions which yield accurate results regardless of whether input data are rectilinear or curvilinear should demonstrate equivalent accuracy in both cases, and should also demonstrate how equivalent results may be obtained through first explicitly transforming input data.* **Extreme Geographical Coordinates** - **SP6.2** *Geographical Software should include tests with extreme geographical coordinates, minimally including extension to polar extremes of +/-90 degrees.* While such tests should generally confirm that software generates reliable results to such extreme coordinates, software which is unable to generate reliable results to such inputs should nevertheless include tests to indicate both approximate bounds of reliability, and the expected characteristics of unreliable results. The remaining standards for testing Spatial Software extend directly from the preceding Algorithmic Standards (**SP3**), with the same sub-section headings used here. - **SP6.3** *Spatial Software which considers spatial neighbours should explicitly test all possible ways of defining them, and should explicitly compare quantitative effects of different ways of defining neighbours.* - **SP6.4** *Spatial Software which considers spatial neighbours should explicitly test effects of different schemes to weight neighbours by spatial proximity.* **Unsupervised Learning Software** - **SP6.5** *Spatial Unsupervised Learning Software which uses clustering algorithms should implement tests which explicitly compare results with equivalent results obtained with a non-spatial clustering algorithm.* **Machine Learning Software** - **SP6.6** \*Spatial Machine Learning Software should implement tests which explicitly demonstrate the detrimental consequences of sampling test and training data from the same spatial region, rather than from spatially distinct regions.