---
title: Exploratory Data Analysis Software Standards
tags: statistical-software
robots: noindex, nofollow
---
<!-- Edit the .Rmd not the .md file -->
## Exploratory Data Analysis
Exploration is a part of all data analyses, and Exploratory Data
Analysis (EDA) is not something that is entered into and exited from at
some point prior to “real” analysis. Exploratory Analyses are also not
strictly limited to *Data*, but may extend to exploration of *Models* of
those data. The category could thus equally be termed, “*Exploratory
Data and Model Analysis*”, yet we opt to utilise the standard acronym of
EDA in this document.
EDA is nevertheless somewhat different to many other categories included
within rOpenSci’s program for peer-reviewing statistical software.
Primary differences include:
- EDA software often has a strong focus upon visualization, which is a
category which we have otherwise explicitly excluded from the scope
of the project at the present stage.
- The assessment of EDA software requires addressing more general
questions than software in most other categories, notably including
the important question of intended audience(s).
Examples of EDA software include:
1. A package rejected by rOpenSci as out-of-scope,
[`gtsummary`](https://github.com/ddsjoberg/gtsummary), which
provides, “Presentation-ready data summary and analytic result
tables.” Other examples include:
2. The [`smartEDA` package](https://github.com/daya6489/SmartEDA) (with
accompanying [JOSS
paper](https://joss.theoj.org/papers/10.21105/joss.01509)) “for
automated exploratory data analysis”. The package, “automatically
selects the variables and performs the related descriptive
statistics. Moreover, it also analyzes the information value, the
weight of evidence, custom tables, summary statistics, and performs
graphical techniques for both numeric and categorical variables.”
This package is potentially as much a workflow package as it is a
statistical reporting package, and illustrates the ambiguity between
these two categories.
3. The [`modeLLtest`
package](https://github.com/ShanaScogin/modeLLtest) (with
accompanying [JOSS
paper](https://joss.theoj.org/papers/10.21105/joss.01542)) is “An R
Package for Unbiased Model Comparison using Cross Validation.” Its
main functionality allows different statistical models to be
compared, likely implying that this represents a kind of meta
package.
4. The [`insight` package](https://github.com/easystats/insight) (with
accompanying [JOSS
paper](https://joss.theoj.org/papers/10.21105/joss.01412) provides
“a unified interface to access information from model objects in R,”
with a strong focus on unified and consistent reporting of
statistical results.
5. The [`arviz` software for
python](https://github.com/arviz-devs/arviz) (with accompanying
[JOSS paper](https://joss.theoj.org/papers/10.21105/joss.01143)
provides “a unified library for exploratory analysis of Bayesian
models in Python.”
6. The [`iRF` package](https://github.com/sumbose/iRF) (with
accompanying [JOSS
paper](https://joss.theoj.org/papers/10.21105/joss.01077) enables
“extracting interactions from random forests”, yet also focusses
primarily on enabling interpretation of random forests through
reporting on interaction terms.
Click on the following link to view a demonstration [Application of
Exploratory Data Analysis
Standards](https://hackmd.io/K8F1RIhdQeuZFqMnzdqNVw).
Reflecting these considerations, the following standards are somewhat
differently structured than equivalent standards developed to date for
other categories, particularly through being more qualitative and
abstract. In particular, while documentation is an important component
of standards for all categories, clear and instructive documentation is
of paramount importance for EDA Software, and so warrants its own
sub-section within this document.
### 1 Documentation Standards
The following refer to *Primary Documentation*, implying in main package
`README` or vignette(s), and *Secondary Documentation*, implying
function-level documentation.
The *Primary Documentation* (`README` and/or vignette(s)) of EDA
software should:
- **EA1.0** *Identify one or more target audiences for whom the
software is intended*
- **EA1.1** *Identify the kinds of data the software is capable of
analysing (see *Kinds of Data\* below).\*
- **EA1.2** *Identify the kinds of questions the software is intended
to help explore.*
Important distinctions between kinds of questions include whether they
are inferential, predictive, associative, causal, or representative of
other modes of statistical enquiry. The *Secondary Documentation*
(within individual functions) of EDA software should:
- **EA1.3** *Identify the kinds of data each function is intended to
accept as input*
### 2 Input Data
A further primary difference of EDA software from that of our other
categories is that input data for statistical software may be generally
presumed of one or more specific types, whereas EDA software often
accepts data of more general and varied types. EDA software should aim
to accept and appropriately transform as many diverse kinds of input
data as possible, through addressing the following standards, considered
in terms of the two cases of input data in uni- and multi-variate form.
All of the general standards for kinds of input (G2.0 - G2.12) apply to
input data for EDA Software.
#### 2.1 Index Columns
The following standards refer to an *index column*, which is understood
to imply an explicitly named or identified column which can be used to
provide a unique index index into any and all rows of that table. Index
columns ensure the universal applicability of standard table join
operations, such as those implemented via the [`dplyr`
package](https://dplyr.tidyverse.org).
- **EA2.0** *EDA Software which accepts standard tabular data and
implements or relies upon extensive table filter and join operations
should utilise an **index column** system*
- **EA2.1** *All values in an index column must be unique, and this
uniqueness should be affirmed as a pre-processing step for all input
data.*
- **EA2.2** *Index columns should be explicitly identified, either:*
- **EA2.2a** *by using an appropriate class system, or*
- **EA2.2b** *through setting an `attribute` on a table, `x`, of
`attr(x, "index") <- <index_col_name>`.*
For EDA software which either implements custom classes or explicitly
sets attributes specifying index columns, these attributes should be
used as the basis of all table join operations, and in particular:
- **EA2.3** *Table join operations should not be based on any assumed
variable or column names*
#### 2.2 Multi-tabular input
EDA software designed to accept multi-tabular input should:
- **EA2.4** *Use and demand an explicit class system for such input
(for example, via the [`DM`
package](https://github.com/krlmlr/dm)).*
- **EA2.5** *Ensure all individual tables follow the above standards
for Index Columns*
#### 2.3 Classes and Sub-Classes
*Classes* are understood here to be the classes define single input
objects, while *Sub-Classes* refer to the class definitions of
components of input objects (for example, of columns of an input
`data.frame`). EDA software which is intended to receive input in
general vector formats (see *Uni-variate Input* section of [*General
Standards*](#general-standards)) should ensure that it complies with
**G2.**, so that vector input is appropriately processed regardless of
input class. An additional standard for EDA software is that,
- **EA2.6** *Routines should appropriately process vector data
regardless of additional attributes*
The following code illustrates some ways by which “metadata” defining
classes and additional attributes associated with a standard vector
object may by modified.
``` r
x <- 1:10
class (x) <- "notvector"
attr (x, "extra_attribute") <- "another attribute"
attr (x, "vector attribute") <- runif (5)
attributes (x)
#> $class
#> [1] "notvector"
#>
#> $extra_attribute
#> [1] "another attribute"
#>
#> $`vector attribute`
#> [1] 0.03521663 0.49418081 0.60129563 0.75804346 0.16073301
```
All statistical software should appropriately deal with such input data,
as exemplified by the `storage.mode()`, `length()`, and `sum()`
functions of the `base` package, which return the appropriate values
regardless of redefinition of class or additional attributes.
``` r
storage.mode (x)
#> [1] "integer"
length (x)
#> [1] 10
sum (x)
#> [1] 55
storage.mode (sum (x))
#> [1] "integer"
```
Tabular inputs in `data.frame` class may contain columns which are
themselves defined by custom classes, and which possess additional
attributes. The ability of software to accept such inputs is covered by
the *Tabular Input* section of the [*General
Standards*](#general-standards).
### 3 Analytic Algorithms
EDA software will generally not directly implement what might be
considered as statistical algorithms in their own right. Where
algorithms are implemented, the following standards apply.
- **EA3.0** *The algorithmic components of EDA Software should enable
automated extraction and/or reporting of statistics as some
sufficiently “meta” level (such as variable or model selection), for
which previous or reference implementations require manual
intervention.*
- **EA3.1** *EDA software should enable standardised comparison of
inputs, processes, models, or outputs which previous or reference
implementations otherwise only enable in some comparably
unstandardised form.*
Both of these standards also relate to the following standards for
output values, visualisation, and summary output.
### 4 Return Results / Output Data
- **EA4.0** *EDA Software should ensure all return results have types
which are consistent with input types.*
Examples of such compliance include ensuring that `sum`, `min`, or `max`
values applied to `integer`-type vectors return `integer` values.
- **EA4.1** *EDA Software should implement parameters to enable
explicit control of numeric precision*
- **EA4.2** *The primary routines of EDA Software should return
objects for which default `print` and `plot` methods give sensible
results. Default `summary` methods may also be implemented.*
### 5 Visualization and Summary Output
Visualization commonly represents one of the primary functions of EDA
Software, and thus visualization output is given greater consideration
in this category than in other categories in which visualization may
nevertheless play an important role. In particular, one component of
this sub-category is *Summary Output*, taken to refer to all forms of
screen-based output beyond conventional graphical output, including
tabular and other text-based forms. Standards for visualization itself
are considered in the two primary sub-categories of static and dynamic
visualization, where the latter includes interactive visualization.
Prior to these individual sub-categories, we consider a few standards
applicable to visualization in general, whether static or dynamic.
- **EA5.0** *Graphical presentation in EDA software should be as
accessible as possible or practicable. In particular, EDA software
should consider accessibility in terms of:*
- **EA5.0a** *Typeface sizes, which should default to sizes which
explicitly enhance accessibility*
- **EA5.0b** *Default colour schemes, which should be carefully
constructed to ensure accessibility.*
- **EA5.1** *Any explicit specifications of typefaces which override
default values provided through other packages (including the
`graphics` package) should consider accessibility*
#### 5.1 Summary and Screen-based Output
- **EA5.2** *Screen-based output should never rely on default print
formatting of `numeric` types, rather should also use some version
of `round(., digits)`, `formatC`, `sprintf`, or similar functions
for numeric formatting according the parameter described in*
**EA4.1**.
- **EA5.3** *Column-based summary statistics should always indicate
the `storage.mode`, `class`, or equivalent defining attribute of
each column.*
An example of compliance with the latter standard is the `print.tibble`
method of the [`tibble` package](https://tibble.tidyverse.org).
#### 5.2 General Standards for Visualization (Static and Dynamic)
- **EA5.4** *All visualisations should ensure values are rounded
sensibly (for example, via `pretty()` function).*
- **EA5.5** *All visualisations should include units on all axes where
such are specified or otherwise obtainable from input data or other
routines.*
#### 5.3 Dynamic Visualization
Dynamic visualization routines are commonly implemented as interfaces to
`javascript` routines. Unless routines have been explicitly developed as
an internal part of an R package, standards shall not be considered to
apply to the code itself, rather only to decisions present as
user-controlled parameters exposed within the R environment. That said,
one standard may nevertheless be applied, which aims to maximise
inter-operability between packages.
- **EA5.6** *Any packages which internally bundle libraries used for
dynamic visualization and which are also bundled in other,
pre-existing R packages, should explain the necessity and advantage
of re-bundling that library.*
### 6 Testing
#### 6.1 Return Values
- **EA6.0** *Return values from all functions should be tested,
including tests for the following characteristics:*
- **EA6.0a** *Classes and types of objects*
- **EA6.0b** *Dimensions of tabular objects*
- **EA6.0c** *Column names (or equivalent) of tabular objects*
- **EA6.0d** *Classes or types of all columns contained within
`data.frame`-type tabular objects *
- **EA6.0e** *Values of single-valued objects; for `numeric`
values either using `testthat::expect_equal()` or equivalent
with a defined value for the `tolerance` parameter, or using
`round(..., digits = x)` with some defined value of `x` prior to
testing equality.*
#### 6.2 Graphical Output
- **EA6.1** *The properties of graphical output from EDA software
should be explicitly tested, for example via the [`vdiffr`
package](https://github.com/r-lib/vdiffr) or equivalent.*
Tests for graphical output are frequently only run as part of an
extended test suite.