Defensive Programming

# Defensive Programming [Defensive programming](https://en.wikipedia.org/wiki/Defensive_programming) is an umbrella term for various programming techniques that can help make software more robust and secure. A type of defensive programming known as *assertive programming* is concerned with catching unwanted or incorrect program inputs. Since we will likely be dealing with real-world messy data, it seems prudent to plan ahead and think about how we can ensure that we catch data errors early. This note provides a brief introduction to assertive programming and introduces the R package [assertr](https://github.com/ropensci/assertr) and the Python package [Bulwark](https://github.com/zaxr/bulwark), which are designed for this kind of data verification. ## Introduction [Assertions](https://en.wikipedia.org/wiki/Assertion_(software_development)) are used to check certain properties of variables in a piece of code at runtime. The idea of assertive programming is to employ assertions to ensure a certain level of quality of the data ingested by a program of function. A defensive approach to programming generally aims to ensure a program fails if it encounters an unexpected value, instead of continuing silently (and possibly incorrectly). For instance, we may want to abort the program when a variable ``x`` is infinite or NaN. To do so, in R we could use ```r > stopifnot(is.finite(x)) ``` and in Python ```python >>> assert math.isfinite(x) ``` The ``assertr`` and ``Bulwark`` packages we'll discuss below contain more sophisticated assertion functions, but the basic idea is the same. Some tests that come to mind for the data we'll likely encounter are: * Test whether a feature is non-negative (e.g. *age*, *weight*, *blood pressure*). * Test whether a column has no missing values * Test whether a column is unique (e.g. an *ID* column) * Test whether a presumed categorical column indeed contains no values outside the expected set. * Test whether a numerical column has no values outside say three standard deviations from the mean. Of course, we must bear in mind that just because we have *some* tests on the data in place, there may still be data errors **that we did not anticipate** (see also: [Murphy's Law](https://en.wikipedia.org/wiki/Murphy%27s_law)). ## R: assertr *For a longer introduction to the assertr package, see [the vignette](https://cran.r-project.org/web/packages/assertr/vignettes/assertr.html) or the [documentation](https://docs.ropensci.org/assertr/reference/index.html).* The ``assertr`` package includes, among others, the ``verify``, ``assert``, and ``insist`` functions, which have slight differences in how they operate on the rows and columns of a data frame. The [assertr README](https://github.com/ropensci/assertr/blob/master/README.md) gives an excellent overview of the functionality with an example on the ``cars`` dataset, which we'll summarise here. Consider the ``mtcars`` dataset, available in R. ```r > head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ``` We wish to ensure that: * The ``mpg`` (miles per gallon) column is a positive number * The ``am`` (automatic/manual) column is binary with classes 0 and 1 * The ``mpg`` column does not contain values outside 3 standard deviations of its mean * Each row is jointly unique between the ``mpg``, ``am``, and ``wt`` (weight) columns. We can use the ``assertr`` library as well as the ``dplyr`` chaining functionality, to check these requirements as follows: ```r > library(dplyr) > library(assertr) > > mtcars %>% + verify(mpg > 0) %>% + assert(in_set(0, 1), am) %>% + insist(within_n_sds(3), mpg) %>% + assert_rows(col_concat, is_uniq, mpg, am, wt) > ``` This example illustrates four of the functions available in the assertr library: - ``verify`` takes a logical expression and evaluates it within the context of the dataframe (where the ``mpg`` name is known). - ``assert`` takes a predicate function, one returning a boolean truth value, and an arbitrary number of column names. Here the ``in_set`` predicate function provided by ``assertr`` is used. The predicate function is applied to each element of the provided columns to ensure it returns ``TRUE`` for each. - ``insist`` takes a *predicate-generating* function that creates a separate predicate function for each of the supplied columns. For instance, in the above example the ``within_n_sds(3)`` call creates a predicate function that, when applied to the ``mpg`` column, checks whether each value is within 3 standard deviations from the mean. This somewhat complex mechanism allows the user to construct predicate functions that depend on specific properties of the column they test (in this case, the bounds of the 3-sigma confidence interval). - ``assert_rows`` takes a row reduction function, a predicate function, and an arbitrary number of columns. First the columns are selected, then the row reduction is applied, and finally the predicate function is evaluated similarly to the ``assert`` function. Each of these functions technically takes the dataframe as their first argument. In the example above we have used ``dplyr`` style chaining, which take care of this for us. Try changing ``within_n_sds(3)`` to ``within_n_sds(2)`` in the example above to see what happens when an assertion fails. ## Python: Bulwark *The reference documentation of the Bulwark package can be found [here](https://bulwark.readthedocs.io/en/latest/bulwark.checks.html).* The Bulwark package for Python is inspired by the ``assertr`` package. Besides the functional checks that we can apply on a dataframe, Bulwark also has support for using decorators. We will replicate the above example on the ``mtcars`` dataset using Bulwark with both the functional and decorator approaches, for didactic purposes. Of course in practice only one of these techniques will suffice. The ``mtcars`` dataset is available in the [statsmodels](https://www.statsmodels.org/stable/index.html) package. We present the Python approach as an executable script. ```python #!/usr/bin/env python # -*- coding: utf-8 -*- import bulwark.decorators as dc import bulwark.checks as ck from statsmodels.datasets import get_rdataset def col_within_n_sds(df, spec): """ Check for "within n std" for a specific column """ for col, n in spec.items(): ck.has_vals_within_n_std(df[[col]], n) def joint_unique(df, cols): """ Check for joint uniqueness of columns """ tuples = list(df[list(cols)].itertuples(index=False)) return len(tuples) == len(set(tuples)) @dc.HasValsWithinRange({"mpg": (0, float("inf"))}) @dc.HasValsWithinSet({"am": {0, 1}}) @dc.CustomCheck(col_within_n_sds, {"mpg": 3}) @dc.CustomCheck(joint_unique, {"mpg", "am", "wt"}) def verify(df): # this function could of course do something more complicated with the # validated data. Here we simply return the input return df def verify_func(df): ck.has_vals_within_range(df, {"mpg": (0, float("inf"))}) ck.has_vals_within_set(df, {"am": {0, 1}}) col_within_n_sds(df, {"mpg": 3}) joint_unique(df, {"mpg", "am", "wt"}) def main(): mtcars = get_rdataset("mtcars") mtcars = mtcars.data # First verify with the functional approach verify_func(mtcars) # Next verify with the decorator approach mtcars = verify(mtcars) if __name__ == "__main__": main() ``` Try changing ``col_std(df, {"mpg": 3})`` to ``col_std(df, {"mpg": 2})`` in the script to see what happens when an assertion fails. Note that the Python example required us to define two custom checks to match the checks in ``assertr``. Bulwark does have a check for ``within_n_stds``, but this only supports calling it on the entire data frame. ## Conclusion I hope this brief note gave you some insight into defensive programming, as well as the ``assertr`` and ``Bulwark`` packages. If anything is unclear, please let me know!