Unsupervised Learning Software Standards

--- title: Unsupervised Learning Software Standards tags: statistical-software robots: noindex, nofollow ---  ## Dimensionality Reduction, Clustering, and Unsupervised Learning This sub-section details standards for Dimensionality Reduction, Clustering, and Unsupervised Learning Software – referred to from here on for simplicity as “Unsupervised Learning Software”. Software in this category is distinguished from Regression Software though the latter aiming to construct or analyse one or more mappings between two defined data sets (for example, a set of “independent” data, *X*, and a set of “dependent” data, “Y”), whereas Unsupervised Learning Software aims to construct or analyse one or more mappings between a defined set of input or independent data, and a second set of “output” data which are not necessarily known or given prior to the analysis. A key distinction in Unsupervised Learning Software and Algorithms is between that for which output data represent (generally numerical) transformations of the input data set, and that for which output data are discrete labels applied to the input data. Examples of the former type include dimensionality reduction and ordination software and algorithms, and examples of the latter include clustering and discrete partitioning software and algorithms. Some examples of *Dimensionality Reduction, Clustering, and Unsupervised Learning* software include: 1. [`ivis`](https://joss.theoj.org/papers/10.21105/joss.01596) implements a dimensionality reduction technique using a "Siamese Neural Network architecture. 2. [`tsfeaturex`](https://joss.theoj.org/papers/10.21105/joss.01279) is a package to automate “time series feature extraction,” which also provides an example of a package for which both input and output data are generally incomparable with most other packages in this category. 3. [`iRF`](https://joss.theoj.org/papers/10.21105/joss.01077) is another example of a generally incomparable package within this category, here one for which the features extracted are the most distinct predictive features extracted from repeated iterations of random forest algorithms. 4. [`compboost`](https://joss.theoj.org/papers/10.21105/joss.00967) is a package for component-wise gradient boosting which may be sufficient general to potentially allow general application to problems addressed by several packages in this category. 5. The [`iml`](https://joss.theoj.org/papers/10.21105/joss.00786) package may offer usable functionality for devising general assessments of software within this category, through offering a “toolbox for making machine learning models interpretable” in a “model agnostic” way. Click on the following link to view a demonstration [Application of Dimensionality Reduction, Clustering, and Unsupervised Learning Standards](https://hackmd.io/iOZD_oCpT86zoY5z4memaQ). ### 1 Input Data Structures and Validation - **UL1.0** *Unsupervised Learning Software should explicitly document expected format (types or classes) for input data, including descriptions of types or classes which are not accepted; for example, specification that software accepts only numeric inputs in `vector` or `matrix` form, or that all inputs must be in `data.frame` form with both column and row names.* - **UL1.1** *Unsupervised Learning Software should provide distinct sub-routines to assert that all input data is of the expected form, and issue informative error messages when incompatible data are submitted.* The following code demonstrates an example of a routine from the base `stats` package which fails to meet this standard. ``` r d <- dist (USArrests) # example from help file for 'hclust' function hc <- hclust (d) # okay hc <- hclust (as.matrix (d)) ``` ## Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536"): missing value where TRUE/FALSE needed The latter fails, yet issues an uninformative error message that clearly indicates a failure to provide sufficient checks on the class of input data. - **UL1.2** *Unsupervised learning which uses row or column names to label output objects should assert that input data have non-default row or column names, and issue an informative message when these are not provided.* Such messages need not necessarily be provided by default, but should at least be optionally available. <details> <summary> Click here for examples of checks for whether row and column names have generic default values. </summary> <p> The `data.frame` function inserts default row and column names where these are not explicitly specified. ``` r x <- data.frame (matrix (1:10, ncol = 2)) x ``` ## X1 X2 ## 1 1 6 ## 2 2 7 ## 3 3 8 ## 4 4 9 ## 5 5 10 Generic row names are almost always simple integer sequences, which the following condition confirms. ``` r identical (rownames (x), as.character (seq (nrow (x)))) ``` ## [1] TRUE Generic column names may come in a variety of formats. The following code uses a `grep` expression to match any number of characters plus an optional leading zero followed by a generic sequence of column numbers, appropriate for matching column names produced by generic construction of `data.frame` objects. ``` r all (vapply (seq (ncol (x)), function (i) grepl (paste0 ("[[:alpha:]]0?", i), colnames (x) [i]), logical (1))) ``` ## [1] TRUE Messages should be issued in both of these cases. </p> </details> <br> The following code illustrates that the `hclust` function does not implement any such checks or assertions, rather it silently returns an object with default labels. ``` r u <- USArrests rownames (u) <- seq (nrow (u)) hc <- hclust (dist (u)) head (hc$labels) ``` ## [1] "1" "2" "3" "4" "5" "6" - **UL1.3** *Unsupervised Learning Software should transfer all relevant aspects of input data, notably including row and column names, and potentially information from other `attributes()`, to corresponding aspects of return objects.* - **UL1.3a** *Where otherwise relevant information is not transferred, this should be explicitly documented.* An example of a function according with UL1.3 is [`stats::cutree()`](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cutree.html) ``` r hc <- hclust (dist (USArrests)) head (cutree (hc, 10)) ``` ## Alabama Alaska Arizona Arkansas California Colorado ## 1 2 3 4 5 4 The row names of `USArrests` are transferred to the output object. In contrast, some routines from the [`cluster` package](https://cran.r-project.org/package=cluster) do not comply with this standard: ``` r library (cluster) ac <- agnes (USArrests) # agglomerative nesting head (cutree (ac, 10)) ``` ## [1] 1 2 3 4 3 4 The case labels are not appropriately carried through to the object returned by [`agnes()`](https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/agnes.html) to enable them to be transferred within [`cutree()`](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cutree.html). (The labels are transferred to the object returned by `agnes`, just not in a way that enables `cutree` to inherit them.) - **UL1.4** *Unsupervised Learning Software should document any assumptions made with regard to input data; for example assumptions about distributional forms or locations (such as that data are centred or on approximately equivalent distributional scales). Implications of violations of these assumptions should be both documented and tested, in particular:* - **UL1.4a** *Software which responds qualitatively differently to input data which has components on markedly different scales should explicitly document such differences, and implications of submitting such data.* - **UL1.4b** *Examples or other documentation should not use `scale()` or equivalent transformations without explaining why scale is applied, and explicitly illustrating and contrasting the consequences of not applying such transformations.* ### 2 Pre-processing and Variable Transformation - **UL2.0** *Routines likely to give unreliable or irreproducible results in response to violations of assumptions regarding input data (see UL1.6) should implement pre-processing steps to diagnose potential violations, and issue appropriately informative messages, and/or include parameters to enable suitable transformations to be applied.* Example of compliance with this standard are the documentation entries for the `center` and `scale.` parameters of the [`stats::prcomp()`](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/prcomp.html) function. - **UL2.1** *Unsupervised Learning Software should document any transformations applied to input data, for example conversion of label-values to `factor`, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).* - **UL2.2** *Unsupervised Learning Software which accepts missing values in input data should implement explicit parameters controlling the processing of missing values, ideally distinguishing `NA` or `NaN` values from `Inf` values.* This standard applies beyond *General Standards* **G2.13**–**G2.16**, through the additional requirement of implementing explicit parameters. - **UL2.3** *Unsupervised Learning Software should implement pre-processing routines to identify whether aspects of input data are perfectly collinear.* ### 3 Algorithms #### 3.1 Labelling - **UL3.1** *Algorithms which apply sequential labels to input data (such as clustering or partitioning algorithms) should ensure that the sequence follows decreasing group sizes (so labels of “1”, “a”, or “A” describe the largest group, “2”, “b”, or “B” the second largest, and so on.)* Note that the [`stats::cutree()` function](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cutree.html) does not accord with this standard: ``` r hc <- hclust (dist (USArrests)) table (cutree (hc, k = 10)) ``` ## ## 1 2 3 4 5 6 7 8 9 10 ## 3 3 3 6 5 10 2 5 5 8 The [`cutree()` function](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cutree.html) applies arbitrary integer labels to the groups, yet the order of labels is not related to the order of group sizes. - **UL3.2** *Dimensionality reduction or equivalent algorithms which label dimensions should ensure that that sequences of labels follows decreasing “importance” (for example, eigenvalues or variance contributions).* The [`stats::prcomp`](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/prcomp.html) function accords with this standard: ``` r z <- prcomp (eurodist, rank = 5) # return maximum of 5 components summary (z) ``` ## Importance of first k=5 (out of 21) components: ## PC1 PC2 PC3 PC4 PC5 ## Standard deviation 2529.6298 2157.3434 1459.4839 551.68183 369.10901 ## Proportion of Variance 0.4591 0.3339 0.1528 0.02184 0.00977 ## Cumulative Proportion 0.4591 0.7930 0.9458 0.96764 0.97741 The proportion of variance explained by each component decreasing with increasing numeric labelling of the components. - **UL3.3** *Unsupervised Learning Software for which input data does not generally include labels (such as `array`-like data with no row names) should provide an additional parameter to enable cases to be labelled.* #### 3.2 Prediction - **UL3.4** *Where applicable, Unsupervised Learning Software should implement routines to predict the properties (such as numerical ordinates, or cluster memberships) of additional new data without re-running the entire algorithm.* While many algorithms such as Hierarchical clustering can not (readily) be used to predict memberships of new data, other algorithms can nevertheless be applied to perform this task. The following demonstrates how the output of [`stats::hclust`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html) can be used to predict membership of new data using the [`class:knn()` function](https://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html). (This is intended to illustrate only one of many possible approaches.) ``` r library (class) set.seed (1) hc <- hclust (dist (iris [, -5])) groups <- cutree (hc, k = 3) # function to randomly select part of a data.frame and # add some randomness sample_df <- function (x, n = 5) { x [sample (nrow (x), size = n), ] + runif (ncol (x) * n) } iris_new <- sample_df (iris [, -5], n = 5) # use knn to predict membership of those new points: knnClust <- knn (train = iris [, -5], test = iris_new , k = 1, cl = groups) knnClust ``` ## [1] 2 2 1 1 2 ## Levels: 1 2 3 The [`stats::prcomp()` function](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prcomp.html) implements its own `predict()` method which conforms to this standard: ``` r res <- prcomp (USArrests) arrests_new <- sample_df (USArrests, n = 5) predict (res, newdata = arrests_new) ``` ## PC1 PC2 PC3 PC4 ## North Carolina 165.17494 -30.693263 -11.682811 1.304563 ## Maryland 129.44401 -4.132644 -2.161693 1.258237 ## Ohio -49.51994 12.748248 2.104966 -2.777463 ## Colorado 35.78896 14.023774 12.869816 1.233391 ## Georgia 41.28054 -7.203986 3.987152 -7.818416 #### 3.3 Group Distributions and Associated Statistics Many unsupervised learning algorithms serve to label, categorise, or partition data. Software which performs any of these tasks will commonly output some kind of labelling or grouping schemes. The above example of principal components illustrates that the return object records the standard deviations associated with each component: ``` r res <- prcomp (USArrests) print(res) ``` ## Standard deviations (1, .., p=4): ## [1] 83.732400 14.212402 6.489426 2.482790 ## ## Rotation (n x k) = (4 x 4): ## PC1 PC2 PC3 PC4 ## Murder 0.04170432 -0.04482166 0.07989066 -0.99492173 ## Assault 0.99522128 -0.05876003 -0.06756974 0.03893830 ## UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914 ## Rape 0.07515550 0.20071807 0.97408059 0.07232502 ``` r summary (res) ``` ## Importance of components: ## PC1 PC2 PC3 PC4 ## Standard deviation 83.7324 14.21240 6.4894 2.48279 ## Proportion of Variance 0.9655 0.02782 0.0058 0.00085 ## Cumulative Proportion 0.9655 0.99335 0.9991 1.00000 Such output accords with the following standard: - **UL3.5** *Objects returned from Unsupervised Learning Software which labels, categorise, or partitions data into discrete groups should include, or provide immediate access to, quantitative information on intra-group variances or equivalent, as well as on inter-group relationships where applicable.* The above example of principal components is one where there are no inter-group relationships, and so that standard is fulfilled by providing information on intra-group variances alone. Discrete clustering algorithms, in contrast, yield results for which inter-group relationships are meaningful, and such relationships can generally be meaningfully provided. The [`hclust()` routine](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html), like many clustering routines, simply returns a *scheme* for devising an arbitrary number of clusters, and so can not meaningfully provide variances or relationships between such. The [`cutree()` function](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cutree.html), however, does yield defined numbers of clusters, yet devoid of any quantitative information on variances or equivalent. ``` r res <- hclust (dist (USArrests)) str (cutree (res, k = 5)) ``` ## Named int [1:50] 1 1 1 2 1 2 3 1 4 2 ... ## - attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ... Compare that with the output of a largely equivalent routine, the [`clara()` function](https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clara.html) from the [`cluster` package](https://cran.r-project.org/package=cluster). ``` r library (cluster) cl <- clara (USArrests, k = 10) # direct clustering into specified number of clusters cl$clusinfo ``` ## size max_diss av_diss isolation ## [1,] 4 24.708298 14.284874 1.4837745 ## [2,] 6 28.857755 16.759943 1.7329563 ## [3,] 6 44.640565 23.718040 0.9677229 ## [4,] 6 28.005892 17.382196 0.8442061 ## [5,] 6 15.901258 9.363471 1.1037219 ## [6,] 7 29.407822 14.817031 0.9080598 ## [7,] 4 11.764353 6.781659 0.8165753 ## [8,] 3 8.766984 5.768183 0.3547323 ## [9,] 3 18.848077 10.101505 0.7176276 ## [10,] 5 16.477257 8.468541 0.6273603 That object contains information on dissimilarities between each observation and cluster medoids, which in the context of UL3.4 is “information on intra-group variances or equivalent”. Moreover, inter-group information is also available as the [“silhouette”](https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/silhouette.html) of the clustering scheme. ### 4 Return Results - **UL4.0** *Unsupervised Learning Software should return some form of “model” object, generally through using or modifying existing class structures for model objects, or creating a new class of model objects.* - **UL4.1** *Unsupervised Learning Software may enable an ability to generate a model object without actually fitting values. This may be useful for controlling batch processing of computationally intensive fitting algorithms.* - **UL4.2** *The return object from Unsupervised Learning Software should include, or otherwise enable immediate extraction of, all parameters used to control the algorithm used.* #### 4.1 Reporting Return Results - **UL4.2** *Model objects returned by Unsupervised Learning Software should implement or appropriately extend a default `print` method which provides an on-screen summary of model (input) parameters and methods used to generate results. The `print` method may also summarise statistical aspects of the output data or results.* - **UL4.2a** *The default `print` method should always ensure only a restricted number of rows of any result matrices or equivalent are printed to the screen.* The [`prcomp` objects](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/prcomp.html) returned from the function of the same name include potential large matrices of component coordinates which are by default printed in their entirety to the screen. This is because the default print behaviour for most tabular objects in R (`matrix`, `data.frame`, and objects from the `Matrix` package, for example) is to print objects in their entirety (limited only by such options as `getOption("max.print")`, which determines maximal numbers of printed objects, such as lines of `data.frame` objects). Such default behaviour ought be avoided, particularly in Unsupervised Learning Software which commonly returns objects containing large numbers of numeric entries. - **UL4.3** *Unsupervised Learning Software should also implement `summary` methods for model objects which should summarise the primary statistics used in generating the model (such as numbers of observations, parameters of methods applied). The `summary` method may also provide summary statistics from the resultant model.* ### 5 Documentation ### 6 Visualization - **UL6.0** *Objects returned by Unsupervised Learning Software should have default `plot` methods, either through explicit implementation, extension of methods for existing model objects, through ensuring default methods work appropriately, or through explicit reference to helper packages such as [`factoextra`](https://github.com/kassambara/factoextra) and associated functions.* - **UL6.1** *Where the default `plot` method is **NOT** a generic `plot` method dispatched on the class of return objects (that is, through an S3-type `plot.<myclass>` function or equivalent), that method dispatch (or equivalent) should nevertheless exist in order to explicitly direct users to the appropriate function.* - **UL6.2** *Where default plot methods include labelling components of return objects (such as cluster labels), routines should ensure that labels are automatically placed to ensure readability, and/or that appropriate diagnostic messages are issued where readability is likely to be compromised (for example, through attempting to place too many labels).* ### 7 Testing Unsupervised Learning Software should test the following properties and behaviours: - **UL7.0** *Inappropriate types of input data are rejected with expected error messages.* #### 7.1 Input Scaling The following tests should be implement for Unsupervised Learning Software for which inputs are presumed or required to be scaled in any particular ways (such as having mean values of zero). - **UL7.1** *Tests should demonstrate that violations of assumed input properties yield unreliable or invalid outputs, and should clarify how such unreliability or invalidity is manifest through the properties of returned objects.* #### 7.2 Output Labelling With regard to labelling of output data, tests for Unsupervised Learning Software should: - **UL7.2** *Demonstrate that labels placed on output data follow decreasing group sizes (**UL3.1**)* - **UL7.3** *Demonstrate that labels on input data are propagated to, or may be recovered from, output data (see **UL3.3**).* #### 7.3 Prediction With regard to prediction, tests for Unsupervised Learning Software should: - **UL7.4** *Demonstrate that submission of new data to a previously fitted model can generate results more efficiently than initial model fitting.* #### 7.4 Batch Processing For Unsupervised Learning Software which implements batch processing routines: - **UL7.5** *Batch processing routines should be explicitly tested, commonly via extended tests (see **G4.10**–**G4.12**).* - **UL7.5a** *Tests of batch processing routines should demonstrate that equivalent results are obtained from direct (non-batch) processing.*