Machine Learning Software Standards

--- title: Machine Learning Software Standards tags: statistical-software robots: noindex, nofollow ---  ## Machine Learning Software R has an extensive and diverse ecosystem of Machine Learning (ML) software which is very well described in the corresponding [CRAN Task View](https://cran.r-project.org/web/views/MachineLearning.html). Unlike most other categories of statistical software considered here, the primary distinguishing feature of ML software is not (necessarily or directly) algorithmic, rather pertains to a *workflow* typical of machine learning tasks. In particular, we consider ML software to approach data analysis via the two primary steps of: 1. Passing a set of *training* data to an algorithm in order to generate a candidate mapping between that data and some form of pre-specified output or response variable. Such mappings will be referred to here as “models”, with a single analysis of a single set of training data generating one model. 2. Passing a set of test data to the model(s) generated by the first step in order to derive some measure of predictive accuracy for that model. A single ML task generally yields two distinct outputs: 1. The model derived in the first of the previous steps; and 2. Associated statistics of model performance, as evaluated within the context of the test data used to assess that performance. Click on the following link to view a demonstration [Application of Machine Learning Software Standards](https://hackmd.io/Ix1YwD8YTWGuzdiXsVQadA). **A Machine Learning Workflow** Given those initial considerations, we now attempt the difficult task of envisioning a typical standard workflow for inherently diverse ML software. The following workflow ought to be considered an “extensive” workflow, with shorter versions, and correspondingly more restricted sets of standards, possible dependent upon envisioned areas of application. For example, the workflow presumes input data to be too large to be stored as a single entity in local memory. Adaptation to situations in which all training data can be loaded into memory may mean that some of the following workflow stages, and therefore corresponding standards, may not apply. Just as typical workflows are potentially very diverse, so are outputs of ML software, which depend on areas of application and intended purpose of software. The following refers to the “desired output” of ML software, a phrase which is intentionally left non-specific, but which it intended to connote any and all forms of “response variable” and other “pre-specified outputs” such as categorical labels or validation data, along with outputs which may not necessarily be able to be pre-specified in simple uni- or multi-variate form, such as measures of distance between sets of training and validation data. Such “desired outputs” are presumed to be quantified in terms of a “loss” or “cost” function (hereafter, simply “loss function”) quantifying some measure of distance between a model estimate (resulting from applying the model to one or more components of a training data set) and a pre-defined “valid” output (during training), or a test data set (following training). Given the foregoing considerations, we consider a typical ML workflow to progress through (at least some of) the following steps: 1. ***Input Data Specification*** Obtain a local copy of input data, often as multiple *objects* (either on-disk or in memory) in some suitably structured form such as in a series of sub-directories or accompanied by additional data defining the structural properties of input objects. Regardless of form, multiple objects are commonly given generic labels which distinguish between `training` and `test` data, along with optional additional categories and labels such as `validation` data used, for example, to determine accuracy of models applied to training data yet prior to testing. 2. ***Pre-Processing*** Define transformations of input data, including but not restricted to, broadcasting dimensions (as defined below) and standardising data ranges (typically to defined values of mean and standard deviation). 3. ***Model and Algorithm Specification*** Specify the model and associated processes which will be applied to map the input data on to the desired output. This step minimally includes the following distinct stages (generally in no particular order): 1. Specify the kind of model which will be applied to the training data. ML software often allows the use of pre-trained models, in which case this this step includes downloading or otherwise obtaining a pre-trained model, along with specification of which aspects of those models are to be modified through application to a particular set of training and validation data. 2. Specify the kind of algorithm which will be used to explore the search space (for example some kind of gradient descent algorithm), along with parameters controlling how that algorithm will be applied (for example a learning rate, as defined above). 3. Specify the kind of loss function will be used to quantify distance between model estimates and desired output. 4. ***Model Training*** Apply the specified model to the training data to generate a series of estimates from the specified loss function. This stage may also include specifying parameters such as stopping or exit criteria, and parameters controlling batch processing of input data. Moreover, this stage may involve retaining some of the following additional data: 1. Potential “pre-processing” stages such as initial estimates of optimal learning rates (see above). 2. Details of summaries of actual paths taken through the search space towards convergence on local or global minimum. 5. ***Model Output and Performance*** Measure the performance of the trained model when applied to the test data set, generally requiring the specification of a metric of model performance or accuracy. Importantly, ML workflows may be partly iterative. This may in turn potentially confound distinctions between training and test data, and accordingly confound expectations commonly placed upon statistical analyses of statistical independence of response variables. ML routines such as cross-validation repeatedly (re-)partition data between training and test sets. Resultant models can then not be considered to have been developed through application to any single set of truly “independent” data. In the context of the standards that follow, these considerations admit a potential lack of clarity in any notional categorical distinction between training and test data, and between model specification and training. The preceding workflow mentioned a couple of concepts the interpretations of which in the context of these standards may be seen by clicking on the corresponding items below. Following that, we proceed to standards for ML software, enumerated and developed with reference to the preceding workflow steps. In order that the following standards initially adhere to the enumeration of workflow steps given above, more general standards pertaining to aspects such as documentation and testing are given following the initial five “workflow” standards. <details> <summary> Click for a definition of *broadcasting*, referred to in Step 2, above. </summary> The following definition comes from a vignette for the [`rray` package](https://github.com/r-lib/rray) named [*Broadcasting*](https://rray.r-lib.org/articles/broadcasting.html). - ***Broadcasting*** is, “repeating the dimensions of one object to match the dimensions of another.” This concept runs counter to aspects of standards in other categories, which often suggest that functions should error when passed input objects which do not have commensurate dimensions. Broadcasting is a pre-processing step which enables objects with incommensurate dimensions to be dimensionally reconciled. The following demonstration is taken directly from the [`rray` package](https://github.com/r-lib/rray) (which is not currently on CRAN). ``` r library (rray) a <- array(c(1, 2), dim = c(2, 1)) b <- array(c(3, 4), dim = c(1, 2)) # rbind (a, b) # error! rray_bind (a, b, .axis = 1) #> [,1] [,2] #> [1,] 1 1 #> [2,] 2 2 #> [3,] 3 4 rray_bind (a, b, .axis = 2) #> [,1] [,2] [,3] #> [1,] 1 3 4 #> [2,] 2 3 4 ``` Broadcasting is commonly employed in ML software because it enables ML operations to be implemented on objects with incommensurate dimensions. One example is image analysis, in which training data may all be dimensionally commensurate, yet test images may have different dimensions. Broadcasting allows data to be submitted to ML routines regardless of potentially incommensurate dimensions. </details> <details> <summary> Click for a definition of *learning rate*, referred to in Step 5, above. </summary> - ***Learning Rate*** (generally) determines the step size used to search for local optima as a fraction of the local gradient. This parameter is particularly important for training ML algorithms like neural networks, the results of which can be very sensitive to variations in learning rates. A useful overview of the importance of learning rates, and a useful approach to automatically determining appropriate values, is given in [this blog post](https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html). </details> Partly because of widespread and current relevance, the category of Machine Learning software is one for which there have been other notable attempts to develop standards. A particularly useful reference is the [MLPerf organization](https://www.mlperf.org/) which, among other activities, hosts several [github repositories](https://github.com/mlperf) providing reference datasets and benchmark conditions for comparing performance aspects of ML software. While such reference or benchmark standards are not explicitly referred to in the current version of the following standards, we expect them to be gradually adapted and incorporated as we start to apply and refine our standards in application to software submitted to our review system. ### 1 Input Data Specification Many of the following standards refer to the labelling of input data as “testing” or “training” data, along with potentially additional labels such as “validation” data. In regard to such labelling, the following two standards apply, - **ML1.0** *Documentation should make a clear conceptual distinction between training and test data (even where such may ultimately be confounded as described above.)* - **ML1.0a** *Where these terms are ultimately eschewed, these should nevertheless be used in initial documentation, along with clear explanation of, and justification for, alternative terminology.* - **ML1.1** *Absent clear justification for alternative design decisions, input data should be expected to be labelled “test”, “training”, and, where applicable, “validation” data.* - **ML1.1a** *The presence and use of these labels should be explicitly confirmed via pre-processing steps (and tested in accordance with **ML7.0**, below).* - **ML1.1b** *Matches to expected labels should be case-insensitive and based on partial matching such that, for example, “Test”, “test”, or “testing” should all suffice.* The following three standards (**ML1.2**–**ML1.4**) represent three possible design intentions for ML software. Only one of these three will generally be applicable to any one piece of software, although it is nevertheless possible that more than one of these standards may apply. The first of these three standards applies to ML software which is intended to process, or capable of processing, input data as a single (generally tabular) object. - **ML1.2** *Training and test data sets for ML software should be able to be input as a single, generally tabular, data object, with the training and test data distinguished either by* - *A specified variable containing, for example, `TRUE`/`FALSE` or `0`/`1` values, or which uses some other system such as missing (`NA`) values to denote test data); and/or* - *An additional parameter designating case or row numbers, or labels of test data.* The second of these three standards applies to ML software which is intended to process, or capable of processing, input data represented as multiple objects which exist in local memory. - **ML1.3** *Input data should be clearly partitioned between training and test data (for example, through having each passed as a distinct `list` item), or should enable an additional means of categorically distinguishing training from test data (such as via an additional parameter which provides explicit labels). Where applicable, distinction of validation and any other data should also accord with this standard.* The third of these three standards for data input applies to ML software for which data are expected to be input as references to multiple external objects, generally expected to be read from either local or remote connections. - **ML1.4** *Training and test data sets, along with other necessary components such as validation data sets, should be stored in their own distinctly labelled sub-directories (for distinct files), or according to an explicit and distinct labelling scheme (for example, for database connections). Labelling should in all cases adhere to **ML1.1**, above.* The following standard applies to all ML software regardless of the applicability or otherwise of the preceding three standards. - **ML1.5** *ML software should implement a single function which summarises the contents of test and training (and other) data sets, minimally including counts of numbers of cases, records, or files, and potentially extending to tables or summaries of file or data types, sizes, and other information (such as unique hashes for each component).* #### 1.1 Missing Values Missing data are handled differently by different ML routines, and it is also difficult to suggest generally applicable standards for pre-processing missing values in ML software. The [*General Standards*](#general-standards) for missing values (**G2.13**–**G2.16**) do not apply to Machine Learning software, in the place of which the following standards attempt to cover a practical range of typical approaches and applications. - **ML1.6** *ML software which does not admit missing values, and which expects no missing values, should implement explicit pre-processing routines to identify whether data has any missing values, and should generally error appropriately and informatively when passed data with missing values. In addition, ML software which does not admit missing values should:* - **ML1.6a** *Explain why missing values are not admitted.* - **ML1.6b** *Provide explicit examples (in function documentation, vignettes, or both) for how missing values may be imputed, rather than simply discarded.* - **ML1.7** *ML software which admits missing values should clearly document how such values are processed.* - **ML1.7a** *Where missing values are imputed, software should offer multiple user-defined ways to impute missing data.* - **ML1.7b** *Where missing values are imputed, the precise imputation steps should also be explicitly documented, either in tests (see **ML7.2** below), function documentation, or vignettes.* - **ML1.8** *ML software should enable equal treatment of missing values for both training and test data, with optional user ability to control application to either one or both.* ### 2 Pre-processing As reflected in the workflow envisioned at the outset, ML software operates somewhat differently to statistical software in many other categories. In particular, ML software often requires explicit specification of a workflow, including specification of input data (as per the standards of the preceding sub-section), and of both transformations and statistical models to be applied to those data. This section of standards refers exclusively to the transformation of input data as a pre-processing step prior to any specification of, or submission to, actual models. - **ML2.0** *A dedicated function should enable pre-processing steps to be defined and parametrized.* - **ML2.0a** *That function should return an object which can be directly submitted to a specified model (see section 3, below).* - **ML2.0b** *Absent explicit justification otherwise, that return object should have a defined class minimally intended to implement a default `print` method which summarizes the input data set (as per **ML1.5** above) and associated transformations (see the following standard).* Standards for most other categories of statistical software suggest that pre-processing routines should ensure that input data sets are commensurate, for example, through having equal numbers of cases or rows. In contrast, ML software is commonly intended to accept input data which can not be guaranteed to be dimensionally commensurate, such as software intended to process rectangular image files which may be of different sizes. - **ML2.1** *ML software which uses broadcasting to reconcile dimensionally incommensurate input data should offer an ability to at least optionally record transformations applied to each input file.* Beyond broadcasting and dimensional transformations, the following standards apply to the pre-processing stages of ML software. - **ML2.2** *ML software which requires or relies upon numeric transformations of input data (such as change in mean values or variances) should allow optimal explicit specification of target values, rather than restricting transformations to default generic values only (such as transformations to z-scores).* - **ML2.2a** *Where the parameters have default values, reasons for those particular defaults should be explicitly described.* - **ML2.2b** *Any extended documentation (such as vignettes) which demonstrates the use of explicit values for numeric transformations should explicitly describe why particular values are used.* For all transformations applied to input data, whether of dimension (**ML2.1**) or scale (**ML2.2**), - **ML2.3** *The values associated with all transformations should be recorded in the object returned by the function described in the preceding standard (**ML2.0**).* - **ML2.4** *Default values of all transformations should be explicitly documented, both in documentation of parameters where appropriate (such as for numeric transformations), and in extended documentation such as vignettes.* - **ML2.5** *ML software should provide options to bypass or otherwise switch off all default transformations.* - **ML2.6** *Where transformations are implemented via distinct functions, these should be exported to a package’s namespace so they can be applied in other contexts.* - **ML2.7** *Where possible, documentation should be provided for how transformations may be reversed. For example, documentation may demonstrate how the values retained via **ML2.3**, above, can be used along with transformations either exported via **ML2.6** or otherwise exemplified in demonstration code to independently transform data, and then to reverse those transformations.* ### 3 Model and Algorithm Specification A “model” in the context of ML software is understood to be a means of specifying a mapping between input and output data, generally applied to training and validation data. Model specification is the step of specifying *how* such a mapping is to be constructed. The specification of *what* the values of such a model actually are occurs through training the model, and is described in the following sub-section. These standards also refer to *control parameters* which specify how models are trained. These parameters commonly include values specifying numbers of iterations, training rates, and parameters controlling algorithmic processes such as re-sampling or cross-validation. - **ML3.0** *Model specification should be implemented as a distinct stage subsequent to specification of pre-processing routines (see Section 2, above) and prior to actual model fitting or training (see Section 4, below). In particular,* - **ML3.0a** *A dedicated function should enable models to be specified without actually fitting or training them, or if this (**ML3**) and the following (**ML4**) stages are controlled by a single function, that function should have a parameter enabling models to be specified yet not fitted (for example, `nofit = FALSE`).* - **ML3.0b** *That function should accept as input the objects produced by the previous Input Data Specification stage, and defined according to **ML2.0**, above.* - **ML3.0c** *The function described above (**ML3.0a**) should return an object which can be directly trained as described in the following sub-section (**ML4**).* - **ML3.0d** *That return object should have a defined class minimally intended to implement a default `print` method which summarises the model specification, including values of all relevant parameters.* - **ML3.1** *ML software should allow the use of both untrained models, specified through model parameters only, as well as pre-trained models. Use of the latter commonly entails an ability to submit a previously-trained model object to the function defined according to **ML3.0a**, above.* - **ML3.2** *ML software should enable different models to be applied to the object specifying data inputs and transformations (see sub-sections 1–2, above) without needing to re-define those preceding steps.* A function fulfilling **ML3.0–3.2** might, for example, permit the following arguments: 1. `data`: Input data specification constructed according to **ML1** 2. `model`: An optional previously-trained model 3. `control`: A list of parameters controlling how the model algorithm is to be applied during the subsequent training phase (**ML4**). A function with the arguments defined above would fulfil the preceding three standards, because the `data` stage would represent the output of **ML1**, while the `model` stage would allow for different pre-trained models to be submitted using the same data and associated specifications (**ML3.1**). The provision of a separate `.data` argument would fulfil **ML3.2** by allowing one or both `model` or `control` parameters to be re-defined while submitting the same `data` object. - **ML3.3** *Where ML software implements its own distinct classes of model objects, the properties and behaviours of those specific classes of objects should be explicitly compared with objects produced by other ML software. In particular, where possible, ML software should provide extended documentation (as vignettes or equivalent) comparing model objects with those from other ML software, noting both unique abilities and restrictions of any implemented classes.* - **ML3.4** *Where training rates are used, ML software should provide explicit documentation both in all functions which use training rates, and in extended form such as vignettes, of the importance of, and/or sensitivity to, different values of training rates. In particular,* - **ML3.4a** *Unless explicitly justified otherwise, ML software should offer abilities to automatically determine appropriate or optimal training rates, either as distinct pre-processing stages, or as implicit stages of model training.* - **ML3.4b** *ML software which provides default values for training rates should clearly document anticipated restrictions of validity of those default values; for example through clear suggestions that user-determined and -specified values may generally be necessary or preferable.* #### 3.1 Control Parameters Control parameters are considered here to specify how a model is to be applied to a set of training data. These are generally distinct from parameters specifying the actual model (such as model architecture). While we recommend that control parameters be submitted as items of a single named list, this is neither a firm expectation nor an explicit part of the current standards. - **ML3.5** *Parameters controlling optimization algorithms should minimally include:* - **ML3.5a** *Specification of the type of algorithm used to explore the search space (commonly, for example, some kind of gradient descent algorithm)* - **ML3.5b** *The kind of loss function used to assess distance between model estimates and desired output.* - **ML3.6** *Unless explicitly justified otherwise (for example because ML software under consideration is an implementation of one specific algorithm), ML software should:* - **ML3.6a** *Implement or otherwise permit usage of multiple ways of exploring search space* - **ML3.6b** *Implement or otherwise permit usage of multiple loss functions.* #### 3.2 CPU and GPU processing ML software often involves manipulation of large numbers of rectangular arrays for which graphics processing units (GPUs) are often more efficient than central processing units (CPUs). ML software thus commonly offers options to train models using either CPUs or GPUs. While these standards do not currently suggest any particular design choice in this regard, we do note the following: - **ML3.7** *For ML software in which algorithms are coded in C++, user-controlled use of either CPUs or GPUs (on NVIDIA processors at least) should be implemented through direct use of [`libcudacxx`](https://github.com/NVIDIA/libcudacxx).* This library can be “switched on” through activating a single C++ header file to switch from CPU to GPU. ### 4 Model Training Model training is the stage of the ML workflow envisioned here in which the actual computation is performed by applying a model specified according to **ML3** to data specified according to **ML1** and **ML2**. - **ML4.0** *ML software should generally implement a unified single-function interface to model training, able to receive as input a model specified according to all preceding standards. In particular, models with categorically different specifications, such as different model architectures or optimization algorithms, should be able to be submitted to the same model training function.* - **ML4.1** *ML software should at least optionally retain explicit information on paths taken as an optimizer advances towards minimal loss. Such information should minimally include:* - **ML4.1a** *Specification of all model-internal parameters, or equivalent hashed representation.* - **ML4.1b** *The value of the loss function at each point* - **ML4.1c** *Information used to advance to next point, for example quantification of local gradient.* - **ML4.2** *The subsequent extraction of information retained according to the preceding standard should be explicitly documented, including through example code.* #### 4.1 Batch Processing The following standards apply to ML software which implements batch processing, commonly to train models on data sets too large to be loaded in their entirety into memory. - **ML4.3** *All parameters controlling batch processing and associated terminology should be explicitly documented, and it should not, for example, be presumed that users will understand the definition of “epoch” as implemented in any particular ML software.* According to that standard, it would for example be inappropriate to have a parameter, `nepochs`, described as “Number of epochs used in model training”. Rather, the definition and particular implementation of “epoch” must be explicitly defined. - **ML4.4** *Explicit guidance should be provided on selection of appropriate values for parameter controlling batch processing, for example, on trade-offs between batch sizes and numbers of epochs (with both terms provided as Control Parameters in accordance with the preceding standard, **ML3**).* - **ML4.5** *ML software may optionally include a function to estimate likely time to train a specified model, through estimating initial timings from a small sample of the full batch.* - **ML4.6** *ML software should by default provide explicit information on the progress of batch jobs (even where those jobs may be implemented in parallel on GPUs). That information may be optionally suppressed through additional parameters.* #### 4.2 Re-sampling As described at the outset, ML software does not always rely on pre-specified and categorical distinctions between training and test data. For example, models may be fit to what is effectively one single data set in which specified cases or rows are used as training data, and the remainder as test data. Re-sampling generally refers to the practice of re-defining categorical distinctions between training and test data. One training run accordingly connotes training a model on one particular set of training data and then applying that model to the specified set of test data. Re-sampling starts that process anew, through constructing an alternative categorical partition between test and training data. Even where test and training data are distinguished by more than a simple data-internal category (such as a labelling column), for example, by being stored in distinctly-named sub-directories, re-sampling may be implemented by effectively shuffling data between training and test sub-directories. - **ML4.7** *ML software should provide an ability to combine results from multiple re-sampling iterations using a single parameter specifying numbers of iterations.* - **ML4.8** *Absent any additional specification, re-sampling algorithms should by default partition data according to proportions of original test and training data.* - **ML4.8a** *Re-sampling routines of ML software should nevertheless offer an ability to explicitly control or override such default proportions of test and training data.* ### 5 Model Output and Performance Model output is considered here as a stage distinct from model performance. Model output refers to the end result of model training (**ML4**), while model performance involves the assessment of a trained model against a test data set. The present section first describes standards for model output, which are standards guiding the form of a model trained according to the preceding standards (**ML4**). Model Performance is then considered as a separate stage. #### 5.1 Model Output - **ML5.0** *The result of applying the training processes described above should be contained within a single model object returned by the function defined according to **ML4.0**, above. Even where the output reflects application to a test data set, the resultant object need not include any information on model performance (see **ML5.3**–**ML5.4**, below).* - **ML5.0a** *That object should either have its own class, or extend some previously-defined class.* - **ML5.0b** *That class should have a defined `print` method which summarises important aspects of the model object, including but not limited to summaries of input data and algorithmic control parameters.* - **ML5.1** *As for the untrained model objects produced according to the above standards, and in particular as a direct extension of **ML3.3**, the properties and behaviours of trained models produced by ML software should be explicitly compared with equivalent objects produced by other ML software. (Such comparison will generally be done in terms of comparing model performance, as described in the following standard **ML5.3**–**ML5.4**).* - **ML5.2** *The structure and functionality of objects representing trained ML models should be thoroughly documented. In particular,* - **ML5.2a** *Either all functionality extending from the class of model object should be explicitly documented, or a method for listing or otherwise accessing all associated functionality explicitly documented and demonstrated in example code.* - **ML5.2b** *Documentation should include examples of how to save and re-load trained model objects for their re-use in accordance with **ML3.1**, above.* - **ML5.2c** *Where general functions for saving or serializing objects, such as [`saveRDS`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/readRDS.html) are not appropriate for storing local copies of trained models, an explicit function should be provided for that purpose, and should be demonstrated with example code.* The [`R6` system](https://r6.r-lib.org) for representing classes in R is an example of a system with explicit functionality, all components of which are accessible by a simple [`ls()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/ls.html) call. Adherence to **ML5.2a** would nevertheless require explicit description of the ability of [`ls()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/ls.html) to supply a list of all functions associated with an object. The [`mlr` package](https://github.com/mlr-org/mlr3), for example, uses [`R6` classes](https://r6.r-lib.org), yet neither explicitly describes the use of [`ls()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/ls.html) to list all associated functions, nor explicitly lists those functions. #### 5.2 Model Performance Model performance refers to the quantitative assessment of a trained model when applied to a set of test data. - **ML5.3** *Assessment of model performance should be implemented as one or more functions distinct from model training.* - **ML5.4** *Model performance should be able to be assessed according to a variety of metrics.* - **ML5.4a** *All model performance metrics represented by functions internal to a package must be clearly and distinctly documented.* - **ML5.4b** *It should be possible to submit custom metrics to a model assessment function, and the ability to do so should be clearly documented including through example code.* The remaining sub-sections specify general standards beyond the preceding workflow-specific ones. ### 6 Documentation - **ML6.0** *Descriptions of ML software should make explicit reference to a workflow which separates training and testing stages, and which clearly indicates a need for distinct training and test data sets.* The following standard applies to packages which are intended or other able to only encompass a restricted subset of the six primary workflow steps enumerated at the outset. Envisioned here are packages explicitly intended to aid one particular aspect of the general workflow envisioned here, such as implementations of ML optimization functions, or specific loss measures. - **ML6.1** *ML software intentionally designed to address only a restricted subset of the workflow described here should clearly document how it can be embedded within a typical full ML workflow in the sense considered here.* - **ML6.1** *Such demonstrations should include and contrast embedding within a full workflow using at least two other packages to implement that workflow.* ### 7 Testing #### 7.1 Input Data - **ML7.0** *Test should explicitly confirm partial and case-insensitive matching of “test”, “train”, and, where applicable, “validation” data.* - **ML7.1** *Tests should demonstrate effects of different numeric scaling of input data (see **ML2.2**).* - **ML7.2** *For software which imputes missing data, tests should compare internal imputation with explicit code which directly implements imputation steps (even where such imputation is a single-step implemented via some external package). These tests serve as an explicit reference for how imputation is performed.* #### 7.2 Model Classes The following standard applies to models in both untrained and trained forms, considered to be the respective outputs of the preceding standards **ML3** and **ML4**. - **ML7.3** *Where model objects are implemented as distinct classes, tests should explicitly compare the functionality of these classes with functionality of equivalent classes for ML model objects from other packages.* - **ML7.3a** *These tests should explicitly identify restrictions on the functionality of model objects in comparison with those of other packages.* - **ML7.3b** *These tests should explicitly identify functional advantages and unique abilities of the model objects in comparison with those of other packages.* #### 7.3 Model Training - **ML7.4** *ML software should explicit document the effects of different training rates, and in particular should demonstrate divergence from optima with inappropriate training rates.* - **ML7.5** *ML software which implements routines to determine optimal training rates (see **ML3.4**, above) should implement tests to confirm the optimality of resultant values.* - **ML7.6** *ML software which implement independent training “epochs” should demonstrate in tests the effects of lesser versus greater numbers of epochs.* - **ML7.7** *ML software should explicitly test different optimization algorithms, even where software is intended to implement one specific algorithm.* - **ML7.8** *ML software should explicitly test different loss functions, even where software is intended to implement one specific measure of loss.* - **ML7.9** *Tests should explicitly compare all possible combinations in categorical differences in model architecture, such as different model architectures with same optimization algorithms, same model architectures with different optimization algorithms, and differences in both.* - **ML7.9a** *Such combinations will generally be formed from multiple categorical factors, for which explicit use of functions such as [`expand.grid()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/expand.grid.html) is recommended.* The following example illustrates: ``` r architechture <- c ("archA", "archB") optimizers <- c ("optA", "optB", "optC") cost_fns <- c ("costA", "costB", "costC") expand.grid (architechture, optimizers, cost_fns) ``` ## Var1 Var2 Var3 ## 1 archA optA costA ## 2 archB optA costA ## 3 archA optB costA ## 4 archB optB costA ## 5 archA optC costA ## 6 archB optC costA ## 7 archA optA costB ## 8 archB optA costB ## 9 archA optB costB ## 10 archB optB costB ## 11 archA optC costB ## 12 archB optC costB ## 13 archA optA costC ## 14 archB optA costC ## 15 archA optB costC ## 16 archB optB costC ## 17 archA optC costC ## 18 archB optC costC All possible combinations of these categorical parameters could then be tested by iterating over the rows of that output. - **ML7.10** *The successful extraction of information on paths taken by optimizers (see **ML5.1**, above), should be tested, including testing the general properties, but not necessarily actual values of, such data.* #### 7.4 Model Performance - **ML7.11** *All performance metrics available for a given class of trained model should be thoroughly tested and compared.* - **ML7.11a** *Tests which compare metrics should do so over a range of inputs (generally implying differently trained models) to demonstrate relative advantages and disadvantages of different metrics.*