---
title: Machine Learning Software Standards
tags: statistical-software
robots: noindex, nofollow
---
<!-- Edit the .Rmd not the .md file -->
## Machine Learning Software
R has an extensive and diverse ecosystem of Machine Learning (ML)
software which is very well described in the corresponding [CRAN Task
View](https://cran.r-project.org/web/views/MachineLearning.html). Unlike
most other categories of statistical software considered here, the
primary distinguishing feature of ML software is not (necessarily or
directly) algorithmic, rather pertains to a *workflow* typical of
machine learning tasks. In particular, we consider ML software to
approach data analysis via the two primary steps of:
1. Passing a set of *training* data to an algorithm in order to
generate a candidate mapping between that data and some form of
pre-specified output or response variable. Such mappings will be
referred to here as “models”, with a single analysis of a single set
of training data generating one model.
2. Passing a set of test data to the model(s) generated by the first
step in order to derive some measure of predictive accuracy for that
model.
A single ML task generally yields two distinct outputs:
1. The model derived in the first of the previous steps; and
2. Associated statistics of model performance, as evaluated within the
context of the test data used to assess that performance.
Click on the following link to view a demonstration [Application of
Machine Learning Software
Standards](https://hackmd.io/Ix1YwD8YTWGuzdiXsVQadA).
**A Machine Learning Workflow**
Given those initial considerations, we now attempt the difficult task of
envisioning a typical standard workflow for inherently diverse ML
software. The following workflow ought to be considered an “extensive”
workflow, with shorter versions, and correspondingly more restricted
sets of standards, possible dependent upon envisioned areas of
application. For example, the workflow presumes input data to be too
large to be stored as a single entity in local memory. Adaptation to
situations in which all training data can be loaded into memory may mean
that some of the following workflow stages, and therefore corresponding
standards, may not apply.
Just as typical workflows are potentially very diverse, so are outputs
of ML software, which depend on areas of application and intended
purpose of software. The following refers to the “desired output” of ML
software, a phrase which is intentionally left non-specific, but which
it intended to connote any and all forms of “response variable” and
other “pre-specified outputs” such as categorical labels or validation
data, along with outputs which may not necessarily be able to be
pre-specified in simple uni- or multi-variate form, such as measures of
distance between sets of training and validation data.
Such “desired outputs” are presumed to be quantified in terms of a
“loss” or “cost” function (hereafter, simply “loss function”)
quantifying some measure of distance between a model estimate (resulting
from applying the model to one or more components of a training data
set) and a pre-defined “valid” output (during training), or a test data
set (following training).
Given the foregoing considerations, we consider a typical ML workflow to
progress through (at least some of) the following steps:
1. ***Input Data Specification*** Obtain a local copy of input data,
often as multiple *objects* (either on-disk or in memory) in some
suitably structured form such as in a series of sub-directories or
accompanied by additional data defining the structural properties of
input objects. Regardless of form, multiple objects are commonly
given generic labels which distinguish between `training` and `test`
data, along with optional additional categories and labels such as
`validation` data used, for example, to determine accuracy of models
applied to training data yet prior to testing.
2. ***Pre-Processing*** Define transformations of input data, including
but not restricted to, broadcasting dimensions (as defined below)
and standardising data ranges (typically to defined values of mean
and standard deviation).
3. ***Model and Algorithm Specification*** Specify the model and
associated processes which will be applied to map the input data on
to the desired output. This step minimally includes the following
distinct stages (generally in no particular order):
1. Specify the kind of model which will be applied to the training
data. ML software often allows the use of pre-trained models, in
which case this this step includes downloading or otherwise
obtaining a pre-trained model, along with specification of which
aspects of those models are to be modified through application
to a particular set of training and validation data.
2. Specify the kind of algorithm which will be used to explore the
search space (for example some kind of gradient descent
algorithm), along with parameters controlling how that algorithm
will be applied (for example a learning rate, as defined above).
3. Specify the kind of loss function will be used to quantify
distance between model estimates and desired output.
4. ***Model Training*** Apply the specified model to the training data
to generate a series of estimates from the specified loss function.
This stage may also include specifying parameters such as stopping
or exit criteria, and parameters controlling batch processing of
input data. Moreover, this stage may involve retaining some of the
following additional data:
1. Potential “pre-processing” stages such as initial estimates of
optimal learning rates (see above).
2. Details of summaries of actual paths taken through the search
space towards convergence on local or global minimum.
5. ***Model Output and Performance*** Measure the performance of the
trained model when applied to the test data set, generally requiring
the specification of a metric of model performance or accuracy.
Importantly, ML workflows may be partly iterative. This may in turn
potentially confound distinctions between training and test data, and
accordingly confound expectations commonly placed upon statistical
analyses of statistical independence of response variables. ML routines
such as cross-validation repeatedly (re-)partition data between training
and test sets. Resultant models can then not be considered to have been
developed through application to any single set of truly “independent”
data. In the context of the standards that follow, these considerations
admit a potential lack of clarity in any notional categorical
distinction between training and test data, and between model
specification and training.
The preceding workflow mentioned a couple of concepts the
interpretations of which in the context of these standards may be seen
by clicking on the corresponding items below. Following that, we proceed
to standards for ML software, enumerated and developed with reference to
the preceding workflow steps. In order that the following standards
initially adhere to the enumeration of workflow steps given above, more
general standards pertaining to aspects such as documentation and
testing are given following the initial five “workflow” standards.
<details>
<summary>
Click for a definition of *broadcasting*, referred to in Step 2, above.
</summary>
<p>
The following definition comes from a vignette for the [`rray`
package](https://github.com/r-lib/rray) named
[*Broadcasting*](https://rray.r-lib.org/articles/broadcasting.html).
- ***Broadcasting*** is, “repeating the dimensions of one object to
match the dimensions of another.”
This concept runs counter to aspects of standards in other categories,
which often suggest that functions should error when passed input
objects which do not have commensurate dimensions. Broadcasting is a
pre-processing step which enables objects with incommensurate dimensions
to be dimensionally reconciled.
The following demonstration is taken directly from the [`rray`
package](https://github.com/r-lib/rray) (which is not currently on
CRAN).
``` r
library (rray)
a <- array(c(1, 2), dim = c(2, 1))
b <- array(c(3, 4), dim = c(1, 2))
# rbind (a, b) # error!
rray_bind (a, b, .axis = 1)
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
#> [3,] 3 4
rray_bind (a, b, .axis = 2)
#> [,1] [,2] [,3]
#> [1,] 1 3 4
#> [2,] 2 3 4
```
Broadcasting is commonly employed in ML software because it enables ML
operations to be implemented on objects with incommensurate dimensions.
One example is image analysis, in which training data may all be
dimensionally commensurate, yet test images may have different
dimensions. Broadcasting allows data to be submitted to ML routines
regardless of potentially incommensurate dimensions.
</p>
</details>
<details>
<summary>
Click for a definition of *learning rate*, referred to in Step 5, above.
</summary>
<p>
- ***Learning Rate*** (generally) determines the step size used to
search for local optima as a fraction of the local gradient.
This parameter is particularly important for training ML algorithms like
neural networks, the results of which can be very sensitive to
variations in learning rates. A useful overview of the importance of
learning rates, and a useful approach to automatically determining
appropriate values, is given in [this blog
post](https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html).
</p>
</details>
<br>
Partly because of widespread and current relevance, the category of
Machine Learning software is one for which there have been other notable
attempts to develop standards. A particularly useful reference is the
[MLPerf organization](https://www.mlperf.org/) which, among other
activities, hosts several [github
repositories](https://github.com/mlperf) providing reference datasets
and benchmark conditions for comparing performance aspects of ML
software. While such reference or benchmark standards are not explicitly
referred to in the current version of the following standards, we expect
them to be gradually adapted and incorporated as we start to apply and
refine our standards in application to software submitted to our review
system.
### 1 Input Data Specification
Many of the following standards refer to the labelling of input data as
“testing” or “training” data, along with potentially additional labels
such as “validation” data. In regard to such labelling, the following
two standards apply,
- **ML1.0** *Documentation should make a clear conceptual distinction
between training and test data (even where such may ultimately be
confounded as described above.)*
- **ML1.0a** *Where these terms are ultimately eschewed, these
should nevertheless be used in initial documentation, along with
clear explanation of, and justification for, alternative
terminology.*
- **ML1.1** *Absent clear justification for alternative design
decisions, input data should be expected to be labelled “test”,
“training”, and, where applicable, “validation” data.*
- **ML1.1a** *The presence and use of these labels should be
explicitly confirmed via pre-processing steps (and tested in
accordance with **ML7.0**, below).*
- **ML1.1b** *Matches to expected labels should be
case-insensitive and based on partial matching such that, for
example, “Test”, “test”, or “testing” should all suffice.*
The following three standards (**ML1.2**–**ML1.4**) represent three
possible design intentions for ML software. Only one of these three will
generally be applicable to any one piece of software, although it is
nevertheless possible that more than one of these standards may apply.
The first of these three standards applies to ML software which is
intended to process, or capable of processing, input data as a single
(generally tabular) object.
- **ML1.2** *Training and test data sets for ML software should be
able to be input as a single, generally tabular, data object, with
the training and test data distinguished either by*
- *A specified variable containing, for example, `TRUE`/`FALSE` or
`0`/`1` values, or which uses some other system such as missing
(`NA`) values to denote test data); and/or*
- *An additional parameter designating case or row numbers, or
labels of test data.*
The second of these three standards applies to ML software which is
intended to process, or capable of processing, input data represented as
multiple objects which exist in local memory.
- **ML1.3** *Input data should be clearly partitioned between training
and test data (for example, through having each passed as a distinct
`list` item), or should enable an additional means of categorically
distinguishing training from test data (such as via an additional
parameter which provides explicit labels). Where applicable,
distinction of validation and any other data should also accord with
this standard.*
The third of these three standards for data input applies to ML software
for which data are expected to be input as references to multiple
external objects, generally expected to be read from either local or
remote connections.
- **ML1.4** *Training and test data sets, along with other necessary
components such as validation data sets, should be stored in their
own distinctly labelled sub-directories (for distinct files), or
according to an explicit and distinct labelling scheme (for example,
for database connections). Labelling should in all cases adhere to
**ML1.1**, above.*
The following standard applies to all ML software regardless of the
applicability or otherwise of the preceding three standards.
- **ML1.5** *ML software should implement a single function which
summarises the contents of test and training (and other) data sets,
minimally including counts of numbers of cases, records, or files,
and potentially extending to tables or summaries of file or data
types, sizes, and other information (such as unique hashes for each
component).*
#### 1.1 Missing Values
Missing data are handled differently by different ML routines, and it is
also difficult to suggest generally applicable standards for
pre-processing missing values in ML software. The [*General
Standards*](#general-standards) for missing values (**G2.13**–**G2.16**)
do not apply to Machine Learning software, in the place of which the
following standards attempt to cover a practical range of typical
approaches and applications.
- **ML1.6** *ML software which does not admit missing values, and
which expects no missing values, should implement explicit
pre-processing routines to identify whether data has any missing
values, and should generally error appropriately and informatively
when passed data with missing values. In addition, ML software which
does not admit missing values should:*
- **ML1.6a** *Explain why missing values are not admitted.*
- **ML1.6b** *Provide explicit examples (in function
documentation, vignettes, or both) for how missing values may be
imputed, rather than simply discarded.*
- **ML1.7** *ML software which admits missing values should clearly
document how such values are processed.*
- **ML1.7a** *Where missing values are imputed, software should
offer multiple user-defined ways to impute missing data.*
- **ML1.7b** *Where missing values are imputed, the precise
imputation steps should also be explicitly documented, either in
tests (see **ML7.2** below), function documentation, or
vignettes.*
- **ML1.8** *ML software should enable equal treatment of missing
values for both training and test data, with optional user ability
to control application to either one or both.*
### 2 Pre-processing
As reflected in the workflow envisioned at the outset, ML software
operates somewhat differently to statistical software in many other
categories. In particular, ML software often requires explicit
specification of a workflow, including specification of input data (as
per the standards of the preceding sub-section), and of both
transformations and statistical models to be applied to those data. This
section of standards refers exclusively to the transformation of input
data as a pre-processing step prior to any specification of, or
submission to, actual models.
- **ML2.0** *A dedicated function should enable pre-processing steps
to be defined and parametrized.*
- **ML2.0a** *That function should return an object which can be
directly submitted to a specified model (see section 3, below).*
- **ML2.0b** *Absent explicit justification otherwise, that return
object should have a defined class minimally intended to
implement a default `print` method which summarizes the input
data set (as per **ML1.5** above) and associated transformations
(see the following standard).*
Standards for most other categories of statistical software suggest that
pre-processing routines should ensure that input data sets are
commensurate, for example, through having equal numbers of cases or
rows. In contrast, ML software is commonly intended to accept input data
which can not be guaranteed to be dimensionally commensurate, such as
software intended to process rectangular image files which may be of
different sizes.
- **ML2.1** *ML software which uses broadcasting to reconcile
dimensionally incommensurate input data should offer an ability to
at least optionally record transformations applied to each input
file.*
Beyond broadcasting and dimensional transformations, the following
standards apply to the pre-processing stages of ML software.
- **ML2.2** *ML software which requires or relies upon numeric
transformations of input data (such as change in mean values or
variances) should allow optimal explicit specification of target
values, rather than restricting transformations to default generic
values only (such as transformations to z-scores).*
- **ML2.2a** *Where the parameters have default values, reasons
for those particular defaults should be explicitly described.*
- **ML2.2b** *Any extended documentation (such as vignettes) which
demonstrates the use of explicit values for numeric
transformations should explicitly describe why particular values
are used.*
For all transformations applied to input data, whether of dimension
(**ML2.1**) or scale (**ML2.2**),
- **ML2.3** *The values associated with all transformations should be
recorded in the object returned by the function described in the
preceding standard (**ML2.0**).*
- **ML2.4** *Default values of all transformations should be
explicitly documented, both in documentation of parameters where
appropriate (such as for numeric transformations), and in extended
documentation such as vignettes.*
- **ML2.5** *ML software should provide options to bypass or otherwise
switch off all default transformations.*
- **ML2.6** *Where transformations are implemented via distinct
functions, these should be exported to a package’s namespace so they
can be applied in other contexts.*
- **ML2.7** *Where possible, documentation should be provided for how
transformations may be reversed. For example, documentation may
demonstrate how the values retained via **ML2.3**, above, can be
used along with transformations either exported via **ML2.6** or
otherwise exemplified in demonstration code to independently
transform data, and then to reverse those transformations.*
### 3 Model and Algorithm Specification
A “model” in the context of ML software is understood to be a means of
specifying a mapping between input and output data, generally applied to
training and validation data. Model specification is the step of
specifying *how* such a mapping is to be constructed. The specification
of *what* the values of such a model actually are occurs through
training the model, and is described in the following sub-section. These
standards also refer to *control parameters* which specify how models
are trained. These parameters commonly include values specifying numbers
of iterations, training rates, and parameters controlling algorithmic
processes such as re-sampling or cross-validation.
- **ML3.0** *Model specification should be implemented as a distinct
stage subsequent to specification of pre-processing routines (see
Section 2, above) and prior to actual model fitting or training (see
Section 4, below). In particular,*
- **ML3.0a** *A dedicated function should enable models to be
specified without actually fitting or training them, or if this
(**ML3**) and the following (**ML4**) stages are controlled by a
single function, that function should have a parameter enabling
models to be specified yet not fitted (for example,
`nofit = FALSE`).*
- **ML3.0b** *That function should accept as input the objects
produced by the previous Input Data Specification stage, and
defined according to **ML2.0**, above.*
- **ML3.0c** *The function described above (**ML3.0a**) should
return an object which can be directly trained as described in
the following sub-section (**ML4**).*
- **ML3.0d** *That return object should have a defined class
minimally intended to implement a default `print` method which
summarises the model specification, including values of all
relevant parameters.*
- **ML3.1** *ML software should allow the use of both untrained
models, specified through model parameters only, as well as
pre-trained models. Use of the latter commonly entails an ability to
submit a previously-trained model object to the function defined
according to **ML3.0a**, above.*
- **ML3.2** *ML software should enable different models to be applied
to the object specifying data inputs and transformations (see
sub-sections 1–2, above) without needing to re-define those
preceding steps.*
A function fulfilling **ML3.0–3.2** might, for example, permit the
following arguments:
1. `data`: Input data specification constructed according to **ML1**
2. `model`: An optional previously-trained model
3. `control`: A list of parameters controlling how the model algorithm
is to be applied during the subsequent training phase (**ML4**).
A function with the arguments defined above would fulfil the preceding
three standards, because the `data` stage would represent the output of
**ML1**, while the `model` stage would allow for different pre-trained
models to be submitted using the same data and associated specifications
(**ML3.1**). The provision of a separate `.data` argument would fulfil
**ML3.2** by allowing one or both `model` or `control` parameters to be
re-defined while submitting the same `data` object.
- **ML3.3** *Where ML software implements its own distinct classes of
model objects, the properties and behaviours of those specific
classes of objects should be explicitly compared with objects
produced by other ML software. In particular, where possible, ML
software should provide extended documentation (as vignettes or
equivalent) comparing model objects with those from other ML
software, noting both unique abilities and restrictions of any
implemented classes.*
- **ML3.4** *Where training rates are used, ML software should provide
explicit documentation both in all functions which use training
rates, and in extended form such as vignettes, of the importance of,
and/or sensitivity to, different values of training rates. In
particular,*
- **ML3.4a** *Unless explicitly justified otherwise, ML software
should offer abilities to automatically determine appropriate or
optimal training rates, either as distinct pre-processing
stages, or as implicit stages of model training.*
- **ML3.4b** *ML software which provides default values for
training rates should clearly document anticipated restrictions
of validity of those default values; for example through clear
suggestions that user-determined and -specified values may
generally be necessary or preferable.*
#### 3.1 Control Parameters
Control parameters are considered here to specify how a model is to be
applied to a set of training data. These are generally distinct from
parameters specifying the actual model (such as model architecture).
While we recommend that control parameters be submitted as items of a
single named list, this is neither a firm expectation nor an explicit
part of the current standards.
- **ML3.5** *Parameters controlling optimization algorithms should
minimally include:*
- **ML3.5a** *Specification of the type of algorithm used to
explore the search space (commonly, for example, some kind of
gradient descent algorithm)*
- **ML3.5b** *The kind of loss function used to assess distance
between model estimates and desired output.*
- **ML3.6** *Unless explicitly justified otherwise (for example
because ML software under consideration is an implementation of one
specific algorithm), ML software should:*
- **ML3.6a** *Implement or otherwise permit usage of multiple ways
of exploring search space*
- **ML3.6b** *Implement or otherwise permit usage of multiple loss
functions.*
#### 3.2 CPU and GPU processing
ML software often involves manipulation of large numbers of rectangular
arrays for which graphics processing units (GPUs) are often more
efficient than central processing units (CPUs). ML software thus
commonly offers options to train models using either CPUs or GPUs. While
these standards do not currently suggest any particular design choice in
this regard, we do note the following:
- **ML3.7** *For ML software in which algorithms are coded in C++,
user-controlled use of either CPUs or GPUs (on NVIDIA processors at
least) should be implemented through direct use of
[`libcudacxx`](https://github.com/NVIDIA/libcudacxx).*
This library can be “switched on” through activating a single C++ header
file to switch from CPU to GPU.
### 4 Model Training
Model training is the stage of the ML workflow envisioned here in which
the actual computation is performed by applying a model specified
according to **ML3** to data specified according to **ML1** and **ML2**.
- **ML4.0** *ML software should generally implement a unified
single-function interface to model training, able to receive as
input a model specified according to all preceding standards. In
particular, models with categorically different specifications, such
as different model architectures or optimization algorithms, should
be able to be submitted to the same model training function.*
- **ML4.1** *ML software should at least optionally retain explicit
information on paths taken as an optimizer advances towards minimal
loss. Such information should minimally include:*
- **ML4.1a** *Specification of all model-internal parameters, or
equivalent hashed representation.*
- **ML4.1b** *The value of the loss function at each point*
- **ML4.1c** *Information used to advance to next point, for
example quantification of local gradient.*
- **ML4.2** *The subsequent extraction of information retained
according to the preceding standard should be explicitly documented,
including through example code.*
#### 4.1 Batch Processing
The following standards apply to ML software which implements batch
processing, commonly to train models on data sets too large to be loaded
in their entirety into memory.
- **ML4.3** *All parameters controlling batch processing and
associated terminology should be explicitly documented, and it
should not, for example, be presumed that users will understand the
definition of “epoch” as implemented in any particular ML software.*
According to that standard, it would for example be inappropriate to
have a parameter, `nepochs`, described as “Number of epochs used in
model training”. Rather, the definition and particular implementation of
“epoch” must be explicitly defined.
- **ML4.4** *Explicit guidance should be provided on selection of
appropriate values for parameter controlling batch processing, for
example, on trade-offs between batch sizes and numbers of epochs
(with both terms provided as Control Parameters in accordance with
the preceding standard, **ML3**).*
- **ML4.5** *ML software may optionally include a function to estimate
likely time to train a specified model, through estimating initial
timings from a small sample of the full batch.*
- **ML4.6** *ML software should by default provide explicit
information on the progress of batch jobs (even where those jobs may
be implemented in parallel on GPUs). That information may be
optionally suppressed through additional parameters.*
#### 4.2 Re-sampling
As described at the outset, ML software does not always rely on
pre-specified and categorical distinctions between training and test
data. For example, models may be fit to what is effectively one single
data set in which specified cases or rows are used as training data, and
the remainder as test data. Re-sampling generally refers to the practice
of re-defining categorical distinctions between training and test data.
One training run accordingly connotes training a model on one particular
set of training data and then applying that model to the specified set
of test data. Re-sampling starts that process anew, through constructing
an alternative categorical partition between test and training data.
Even where test and training data are distinguished by more than a
simple data-internal category (such as a labelling column), for example,
by being stored in distinctly-named sub-directories, re-sampling may be
implemented by effectively shuffling data between training and test
sub-directories.
- **ML4.7** *ML software should provide an ability to combine results
from multiple re-sampling iterations using a single parameter
specifying numbers of iterations.*
- **ML4.8** *Absent any additional specification, re-sampling
algorithms should by default partition data according to proportions
of original test and training data.*
- **ML4.8a** *Re-sampling routines of ML software should
nevertheless offer an ability to explicitly control or override
such default proportions of test and training data.*
### 5 Model Output and Performance
Model output is considered here as a stage distinct from model
performance. Model output refers to the end result of model training
(**ML4**), while model performance involves the assessment of a trained
model against a test data set. The present section first describes
standards for model output, which are standards guiding the form of a
model trained according to the preceding standards (**ML4**). Model
Performance is then considered as a separate stage.
#### 5.1 Model Output
- **ML5.0** *The result of applying the training processes described
above should be contained within a single model object returned by
the function defined according to **ML4.0**, above. Even where the
output reflects application to a test data set, the resultant object
need not include any information on model performance (see
**ML5.3**–**ML5.4**, below).*
- **ML5.0a** *That object should either have its own class, or
extend some previously-defined class.*
- **ML5.0b** *That class should have a defined `print` method
which summarises important aspects of the model object,
including but not limited to summaries of input data and
algorithmic control parameters.*
- **ML5.1** *As for the untrained model objects produced according to
the above standards, and in particular as a direct extension of
**ML3.3**, the properties and behaviours of trained models produced
by ML software should be explicitly compared with equivalent objects
produced by other ML software. (Such comparison will generally be
done in terms of comparing model performance, as described in the
following standard **ML5.3**–**ML5.4**).*
- **ML5.2** *The structure and functionality of objects representing
trained ML models should be thoroughly documented. In particular,*
- **ML5.2a** *Either all functionality extending from the class of
model object should be explicitly documented, or a method for
listing or otherwise accessing all associated functionality
explicitly documented and demonstrated in example code.*
- **ML5.2b** *Documentation should include examples of how to save
and re-load trained model objects for their re-use in accordance
with **ML3.1**, above.*
- **ML5.2c** *Where general functions for saving or serializing
objects, such as
[`saveRDS`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/readRDS.html)
are not appropriate for storing local copies of trained models,
an explicit function should be provided for that purpose, and
should be demonstrated with example code.*
The [`R6` system](https://r6.r-lib.org) for representing classes in R is
an example of a system with explicit functionality, all components of
which are accessible by a simple
[`ls()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/ls.html)
call. Adherence to **ML5.2a** would nevertheless require explicit
description of the ability of
[`ls()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/ls.html)
to supply a list of all functions associated with an object. The [`mlr`
package](https://github.com/mlr-org/mlr3), for example, uses [`R6`
classes](https://r6.r-lib.org), yet neither explicitly describes the use
of
[`ls()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/ls.html)
to list all associated functions, nor explicitly lists those functions.
#### 5.2 Model Performance
Model performance refers to the quantitative assessment of a trained
model when applied to a set of test data.
- **ML5.3** *Assessment of model performance should be implemented as
one or more functions distinct from model training.*
- **ML5.4** *Model performance should be able to be assessed according
to a variety of metrics.*
- **ML5.4a** *All model performance metrics represented by
functions internal to a package must be clearly and distinctly
documented.*
- **ML5.4b** *It should be possible to submit custom metrics to a
model assessment function, and the ability to do so should be
clearly documented including through example code.*
The remaining sub-sections specify general standards beyond the
preceding workflow-specific ones.
### 6 Documentation
- **ML6.0** *Descriptions of ML software should make explicit
reference to a workflow which separates training and testing stages,
and which clearly indicates a need for distinct training and test
data sets.*
The following standard applies to packages which are intended or other
able to only encompass a restricted subset of the six primary workflow
steps enumerated at the outset. Envisioned here are packages explicitly
intended to aid one particular aspect of the general workflow envisioned
here, such as implementations of ML optimization functions, or specific
loss measures.
- **ML6.1** *ML software intentionally designed to address only a
restricted subset of the workflow described here should clearly
document how it can be embedded within a typical full ML workflow in
the sense considered here.*
- **ML6.1** *Such demonstrations should include and contrast
embedding within a full workflow using at least two other
packages to implement that workflow.*
### 7 Testing
#### 7.1 Input Data
- **ML7.0** *Test should explicitly confirm partial and
case-insensitive matching of “test”, “train”, and, where applicable,
“validation” data.*
- **ML7.1** *Tests should demonstrate effects of different numeric
scaling of input data (see **ML2.2**).*
- **ML7.2** *For software which imputes missing data, tests should
compare internal imputation with explicit code which directly
implements imputation steps (even where such imputation is a
single-step implemented via some external package). These tests
serve as an explicit reference for how imputation is performed.*
#### 7.2 Model Classes
The following standard applies to models in both untrained and trained
forms, considered to be the respective outputs of the preceding
standards **ML3** and **ML4**.
- **ML7.3** *Where model objects are implemented as distinct classes,
tests should explicitly compare the functionality of these classes
with functionality of equivalent classes for ML model objects from
other packages.*
- **ML7.3a** *These tests should explicitly identify restrictions
on the functionality of model objects in comparison with those
of other packages.*
- **ML7.3b** *These tests should explicitly identify functional
advantages and unique abilities of the model objects in
comparison with those of other packages.*
#### 7.3 Model Training
- **ML7.4** *ML software should explicit document the effects of
different training rates, and in particular should demonstrate
divergence from optima with inappropriate training rates.*
- **ML7.5** *ML software which implements routines to determine
optimal training rates (see **ML3.4**, above) should implement tests
to confirm the optimality of resultant values.*
- **ML7.6** *ML software which implement independent training “epochs”
should demonstrate in tests the effects of lesser versus greater
numbers of epochs.*
- **ML7.7** *ML software should explicitly test different optimization
algorithms, even where software is intended to implement one
specific algorithm.*
- **ML7.8** *ML software should explicitly test different loss
functions, even where software is intended to implement one specific
measure of loss.*
- **ML7.9** *Tests should explicitly compare all possible combinations
in categorical differences in model architecture, such as different
model architectures with same optimization algorithms, same model
architectures with different optimization algorithms, and
differences in both.*
- **ML7.9a** *Such combinations will generally be formed from
multiple categorical factors, for which explicit use of
functions such as
[`expand.grid()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/expand.grid.html)
is recommended.*
The following example illustrates:
``` r
architechture <- c ("archA", "archB")
optimizers <- c ("optA", "optB", "optC")
cost_fns <- c ("costA", "costB", "costC")
expand.grid (architechture, optimizers, cost_fns)
```
## Var1 Var2 Var3
## 1 archA optA costA
## 2 archB optA costA
## 3 archA optB costA
## 4 archB optB costA
## 5 archA optC costA
## 6 archB optC costA
## 7 archA optA costB
## 8 archB optA costB
## 9 archA optB costB
## 10 archB optB costB
## 11 archA optC costB
## 12 archB optC costB
## 13 archA optA costC
## 14 archB optA costC
## 15 archA optB costC
## 16 archB optB costC
## 17 archA optC costC
## 18 archB optC costC
All possible combinations of these categorical parameters could then be
tested by iterating over the rows of that output.
- **ML7.10** *The successful extraction of information on paths taken
by optimizers (see **ML5.1**, above), should be tested, including
testing the general properties, but not necessarily actual values
of, such data.*
#### 7.4 Model Performance
- **ML7.11** *All performance metrics available for a given class of
trained model should be thoroughly tested and compared.*
- **ML7.11a** *Tests which compare metrics should do so over a
range of inputs (generally implying differently trained models)
to demonstrate relative advantages and disadvantages of
different metrics.*