# Defensive Programming
[Defensive programming](https://en.wikipedia.org/wiki/Defensive_programming)
is an umbrella term for various programming techniques that can help make
software more robust and secure. A type of defensive programming known as
*assertive programming* is concerned with catching unwanted or incorrect
program inputs. Since we will likely be dealing with real-world messy data, it
seems prudent to plan ahead and think about how we can ensure that we catch
data errors early. This note provides a brief introduction to assertive
programming and introduces the R package
[assertr](https://github.com/ropensci/assertr) and the Python package
[Bulwark](https://github.com/zaxr/bulwark), which are designed for this kind
of data verification.
## Introduction
[Assertions](https://en.wikipedia.org/wiki/Assertion_(software_development))
are used to check certain properties of variables in a piece of code at
runtime. The idea of assertive programming is to employ assertions to ensure a
certain level of quality of the data ingested by a program of function. A
defensive approach to programming generally aims to ensure a program fails if
it encounters an unexpected value, instead of continuing silently (and
possibly incorrectly).
For instance, we may want to abort the program when a variable ``x`` is
infinite or NaN. To do so, in R we could use
```r
> stopifnot(is.finite(x))
```
and in Python
```python
>>> assert math.isfinite(x)
```
The ``assertr`` and ``Bulwark`` packages we'll discuss below contain more
sophisticated assertion functions, but the basic idea is the same.
Some tests that come to mind for the data we'll likely encounter are:
* Test whether a feature is non-negative (e.g. *age*, *weight*, *blood
pressure*).
* Test whether a column has no missing values
* Test whether a column is unique (e.g. an *ID* column)
* Test whether a presumed categorical column indeed contains no values outside
the expected set.
* Test whether a numerical column has no values outside say three standard
deviations from the mean.
Of course, we must bear in mind that just because we have *some* tests on the
data in place, there may still be data errors **that we did not anticipate** (see
also: [Murphy's Law](https://en.wikipedia.org/wiki/Murphy%27s_law)).
## R: assertr
*For a longer introduction to the assertr package, see [the
vignette](https://cran.r-project.org/web/packages/assertr/vignettes/assertr.html)
or the
[documentation](https://docs.ropensci.org/assertr/reference/index.html).*
The ``assertr`` package includes, among others, the ``verify``, ``assert``,
and ``insist`` functions, which have slight differences in how they operate on
the rows and columns of a data frame. The [assertr
README](https://github.com/ropensci/assertr/blob/master/README.md) gives an
excellent overview of the functionality with an example on the ``cars``
dataset, which we'll summarise here.
Consider the ``mtcars`` dataset, available in R.
```r
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```
We wish to ensure that:
* The ``mpg`` (miles per gallon) column is a positive number
* The ``am`` (automatic/manual) column is binary with classes 0 and 1
* The ``mpg`` column does not contain values outside 3 standard deviations of
its mean
* Each row is jointly unique between the ``mpg``, ``am``, and ``wt`` (weight)
columns.
We can use the ``assertr`` library as well as the ``dplyr`` chaining
functionality, to check these requirements as follows:
```r
> library(dplyr)
> library(assertr)
>
> mtcars %>%
+ verify(mpg > 0) %>%
+ assert(in_set(0, 1), am) %>%
+ insist(within_n_sds(3), mpg) %>%
+ assert_rows(col_concat, is_uniq, mpg, am, wt)
>
```
This example illustrates four of the functions available in the assertr
library:
- ``verify`` takes a logical expression and evaluates it within the context of
the dataframe (where the ``mpg`` name is known).
- ``assert`` takes a predicate function, one returning a boolean truth value,
and an arbitrary number of column names. Here the ``in_set`` predicate
function provided by ``assertr`` is used. The predicate function is applied
to each element of the provided columns to ensure it returns ``TRUE`` for
each.
- ``insist`` takes a *predicate-generating* function that creates a separate
predicate function for each of the supplied columns. For instance, in the
above example the ``within_n_sds(3)`` call creates a predicate function
that, when applied to the ``mpg`` column, checks whether each value is
within 3 standard deviations from the mean. This somewhat complex mechanism
allows the user to construct predicate functions that depend on specific
properties of the column they test (in this case, the bounds of the 3-sigma
confidence interval).
- ``assert_rows`` takes a row reduction function, a predicate function, and an
arbitrary number of columns. First the columns are selected, then the row
reduction is applied, and finally the predicate function is evaluated
similarly to the ``assert`` function.
Each of these functions technically takes the dataframe as their first
argument. In the example above we have used ``dplyr`` style chaining, which
take care of this for us.
Try changing ``within_n_sds(3)`` to ``within_n_sds(2)`` in the example above
to see what happens when an assertion fails.
## Python: Bulwark
*The reference documentation of the Bulwark package can be found
[here](https://bulwark.readthedocs.io/en/latest/bulwark.checks.html).*
The Bulwark package for Python is inspired by the ``assertr`` package. Besides
the functional checks that we can apply on a dataframe, Bulwark also has
support for using decorators. We will replicate the above example on the
``mtcars`` dataset using Bulwark with both the functional and decorator
approaches, for didactic purposes. Of course in practice only one of these
techniques will suffice. The ``mtcars`` dataset is available in the
[statsmodels](https://www.statsmodels.org/stable/index.html) package. We
present the Python approach as an executable script.
```python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import bulwark.decorators as dc
import bulwark.checks as ck
from statsmodels.datasets import get_rdataset
def col_within_n_sds(df, spec):
""" Check for "within n std" for a specific column """
for col, n in spec.items():
ck.has_vals_within_n_std(df[[col]], n)
def joint_unique(df, cols):
""" Check for joint uniqueness of columns """
tuples = list(df[list(cols)].itertuples(index=False))
return len(tuples) == len(set(tuples))
@dc.HasValsWithinRange({"mpg": (0, float("inf"))})
@dc.HasValsWithinSet({"am": {0, 1}})
@dc.CustomCheck(col_within_n_sds, {"mpg": 3})
@dc.CustomCheck(joint_unique, {"mpg", "am", "wt"})
def verify(df):
# this function could of course do something more complicated with the
# validated data. Here we simply return the input
return df
def verify_func(df):
ck.has_vals_within_range(df, {"mpg": (0, float("inf"))})
ck.has_vals_within_set(df, {"am": {0, 1}})
col_within_n_sds(df, {"mpg": 3})
joint_unique(df, {"mpg", "am", "wt"})
def main():
mtcars = get_rdataset("mtcars")
mtcars = mtcars.data
# First verify with the functional approach
verify_func(mtcars)
# Next verify with the decorator approach
mtcars = verify(mtcars)
if __name__ == "__main__":
main()
```
Try changing ``col_std(df, {"mpg": 3})`` to ``col_std(df, {"mpg": 2})`` in the
script to see what happens when an assertion fails.
Note that the Python example required us to define two custom checks to match
the checks in ``assertr``. Bulwark does have a check for ``within_n_stds``,
but this only supports calling it on the entire data frame.
## Conclusion
I hope this brief note gave you some insight into defensive programming, as
well as the ``assertr`` and ``Bulwark`` packages. If anything is unclear,
please let me know!