Time Series Software Standards

--- title: Time Series Software Standards tags: statistical-software robots: noindex, nofollow ---  ## Time Series Software The category of Time Series software is arguably easier to define that the preceding categories, and represents any software the primary input of which is intended to be temporally structured data. Importantly, while “*temporally structured*” may often imply temporally ordered, this need not necessarily be the case. The primary definition of temporally structured data is that they possess some kind of index which can be used to extract temporal relationships. Time series software is presumed to perform one or more of the following steps: 1. Accept and validate input data 2. Apply data transformation and pre-processing steps 3. Apply one or more analytic algorithms 4. Return the result of that algorithmic application 5. Offer additional functionality such as printing or summarising return results This document details standards for each of these steps, each prefixed with “TS”. ### 1 Input data structures and validation Input validation is an important software task, and an important part of our standards. While there are many ways to approach validation, the class systems of R offer a particularly convenient and effective means. For Time Series Software in particular, a range of class systems have been developed, for which we refer to the section “Time Series Classes” in the CRAN Task view on [Time Series Analysis"](https://cran.r-project.org/web/views/TimeSeries.html), and the class-conversion package [`tsbox`](https://www.tsbox.help/). Software which uses and relies on defined classes can often validate input through affirming appropriate class(es). Software which does not use or rely on class systems will generally need specific routines to validate input data structures. In particular, because of the long history of time series software in R, and the variety of class systems for representing time series data, new time series packages should accept as many different classes of input as possible by according with the following standards: - **TS1.0** *Time Series Software should use and rely on explicit class systems developed for representing time series data, and should not permit generic, non-time-series input* The core algorithms of time-series software are often ultimately applied to simple vector objects, and some time series software accepts simple vector inputs, assuming these to represent temporally sequential data. Permitting such generic inputs nevertheless prevents any such assumptions from being asserted or tested. Missing values pose particular problems in this regard. A simple `na.omit()` call or similar will shorten the length of the vector by removing any `NA` values, and will change the explicit temporal relationship between elements. The use of explicit classes for time series generally ensures an ability to explicitly assert properties such as strict temporal regularity, and to control for any deviation from expected properties. - **TS1.1** *Time Series Software should explicitly document the types and classes of input data able to be passed to each function.* - **TS1.2** *Time Series Software should accept input data in as many time series specific classes as possible.* - **TS1.3** *Time Series Software should implement validation routines to confirm that inputs are of acceptable classes (or represented in otherwise appropriate ways for software which does not use class systems).* - **TS1.4** *Time Series Software should implement a single pre-processing routine to validate input data, and to appropriately transform it to a single uniform type to be passed to all subsequent data-processing functions (the [`tsbox` package](https://www.tsbox.help/) provides one convenient approach for this).* - **TS1.5** *The pre-processing function described above should maintain all time- or date-based components or attributes of input data.* For Time Series Software which relies on or implements custom classes or types for representing time-series data, the following standards should be adhered to: - **TS1.6** *The software should ensure strict ordering of the time, frequency, or equivalent ordering index variable.* - **TS1.7** *Any violations of ordering should be caught in the pre-processing stages of all functions.* #### 1.1 Time Intervals and Relative Time While most common packages and classes for time series data assume *absolute* temporal scales such as those represented in [`POSIX` classes](https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.POSIXlt.html) for dates or times, time series may also be quantified on *relative* scales where the temporal index variable quantifies intervals rather than absolute times or dates. Many analytic routines which accept time series inputs in absolute form are also appropriately applied to analogous data in relative form, and thus many packages should accept time series inputs both in absolute and relative forms. Software which can or should accept times series inputs in relative form should: - **TS1.8** *Accept inputs defined via the [`units` package](https://github.com/r-quantities/units/) for attributing SI units to R vectors.* - **TS1.9** *Where time intervals or periods may be days or months, be explicit about the system used to represent such, particularly regarding whether a calendar system is used, or whether a year is presumed to have 365 days, 365.2422 days, or some other value.* ### 2 Pre-processing and Variable Transformation #### 2.1 Missing Data One critical pre-processing step for Time Series Software is the appropriate handling of missing data. It is convenient to distinguish between *implicit* and *explicit* missing data. For regular time series, explicit missing data may be represented by `NA` values, while for irregular time series, implicit missing data may be represented by missing rows. The difference is demonstrated in the following table. <table> <caption> Missing Values </caption> <tbody> <tr class="odd"> <td style="text-align: left;"> Time </td> <td style="text-align: left;"> value </td> </tr> <tr class="even"> <td style="text-align: left;"> 08:43 </td> <td style="text-align: left;"> 0.71 </td> </tr> <tr class="odd"> <td style="text-align: left;"> 08:44 </td> <td style="text-align: left;"> NA </td> </tr> <tr class="odd"> <td style="text-align: left;"> 08:45 </td> <td style="text-align: left;"> 0.28 </td> </tr> <tr class="odd"> <td style="text-align: left;"> 08:47 </td> <td style="text-align: left;"> 0.34 </td> </tr> <tr class="odd"> <td style="text-align: left;"> 08:48 </td> <td style="text-align: left;"> 0.07 </td> </tr> </tbody> </table> The value for 08:46 is *implicitly missing*, while the value for 08:44 is *explicitly missing*. These two forms of missingness may connote different things, and may require different forms of pre-processing. With this in mind, and beyond the [*General Standards*](#general-standards) for missing data (**G2.13**–**G2.16**), the following standards apply: - **TS2.0** *Time Series Software which presumes or requires regular data should only allow **explicit** missing values, and should issue appropriate diagnostic messages, potentially including errors, in response to any **implicit** missing values.* - **TS2.1** *Where possible, all functions should provide options for users to specify how to handle missing data, with options minimally including:* - **TS2.1a** \*error on missing data; or. - **TS2.1b** *warn or ignore missing data, and proceed to analyse irregular data, ensuring that results from function calls with regular yet missing data return identical values to submitting equivalent irregular data with no missing values; or* - **TS2.1c** *replace missing data with appropriately imputed values.* This latter standard is a modified version of *General Standard* **G2.14**, with additional requirements via **TS2.1b**. #### 2.2 Stationarity Time Series Software should explicitly document assumptions or requirements made with respect to the stationarity or otherwise of all input data. In particular, any (sub-)functions which assume or rely on stationarity should: - **TS2.2** *Consider stationarity of all relevant moments - typically first (mean) and second (variance) order, or otherwise document why such consideration may be restricted to lower orders only.* - **TS2.3** *Explicitly document all assumptions and/or requirements of stationarity* - **TS2.4** *Implement appropriate checks for all relevant forms of stationarity, and either:* - **TS2.4a** *issue diagnostic messages or warnings; or* - **TS2.4b** *enable or advise on appropriate transformations to ensure stationarity.* The two options in the last point (TS2.4b) respectively translate to *enabling* transformations to ensure stationarity by providing appropriate routines, generally triggered by some function parameter, or *advising* on appropriate transformations, for example by directing users to additional functions able to implement appropriate transformations. #### 2.3 Covariance Matrices Where covariance matrices are constructed or otherwise used within or as input to functions, they should: - **TS2.5** *Incorporate a system to ensure that both row and column orders follow the same ordering as the underlying time series data. This may, for example, be done by including the `index` attribute of the time series data as an attribute of the covariance matrix.* - **TS2.6** *Where applicable, covariance matrices should also include specification of appropriate units.* *General Standard* **G3.1** also applies to all Time Series Software which constructs or uses covariance matrices. ### 3 Analytic Algorithms Analytic algorithms are considered here to reflect the core analytic components of Time Series Software. These may be many and varied, and we explicitly consider only a small subset here. #### 3.1 Forecasting Statistical software which implements forecasting routines should: - **TS3.0** *Provide tests to demonstrate at least one case in which errors widen appropriately with forecast horizon.* - **TS3.1** *If possible, provide at least one test which violates TS3.0* - **TS3.2** *Document the general drivers of forecast errors or horizons, as demonstrated via the particular cases of TS3.0 and TS3.1* - **TS3.3** *Either:* - **TS3.3a** *Document, preferable via an example, how to trim forecast values based on a specified error margin or equivalent; or* - **TS3.3b** *Provide an explicit mechanism to trim forecast values to a specified error margin, either via an explicit post-processing function, or via an input parameter to a primary analytic function.* ### 4 Return Results For (functions within) Time Series Software which return time series data: - **TS4.0** *Return values should either:* - **TS4.0a** *Be in same class as input data, for example by using the [`tsbox` package](https://www.tsbox.help/) to re-convert from standard internal format (see 1.4, above); or* - **TS4.0b** *Be in a unique, preferably class-defined, format.* - **TS4.1** *Any units included as attributes of input data should also be included within return values.* - **TS4.2** *The type and class of all return values should be explicitly documented.* For (functions within) Time Series Software which return data other than direct series: - **TS4.3** *Return values should explicitly include all appropriate units and/or time scales* #### 4.1 Data Transformation Time Series Software which internally implements routines for transforming data to achieve stationarity and which returns forecast values should: - **TS4.4** *Document the effect of any such transformations on forecast data, including potential effects on both first- and second-order estimates.* - **TS4.5** *In decreasing order of preference, either:* - **TS4.5a** *Provide explicit routines or options to back-transform data commensurate with original, non-stationary input data* - **TS4.5b** *Demonstrate how data may be back-transformed to a form commensurate with original, non-stationary input data.* - **TS4.5c** *Document associated limitations on forecast values* #### 4.2 Forecasting Where Time Series Software implements or otherwise enables forecasting abilities, it should return one of the following three kinds of information. These are presented in decreasing order of preference, such that software should strive to return the first kind of object, failing that the second, and only the third as a last resort. - **TS4.6** *Time Series Software which implements or otherwise enables forecasting should return either:* - **TS4.6a** *A distribution object, for example via one of the many packages described in the CRAN Task View on [Probability Distributions](https://cran.r-project.org/web/views/Distributions.html) (or the new [`distributional` package](https://pkg.mitchelloharawild.com/distributional/) as used in the [`fable` package](https://fable.tidyverts.org) for time-series forecasting).* - **TS4.6b** *For each variable to be forecast, predicted values equivalent to first- and second-order moments (for example, mean and standard error values).* - **TS4.6c** *Some more general indication of error associated with forecast estimates.* Beyond these particular standards for return objects, Time Series Software which implements or otherwise enables forecasting should: - **TS4.7** *Ensure that forecast (modelled) values are clearly distinguished from observed (model or input) values, either (in this case in no order of preference) by* - **TS4.7a** *Returning forecast values alone* - **TS4.7b** *Returning distinct list items for model and forecast values* - **TS4.7c** *Combining model and forecast values into a single return object with an appropriate additional column clearly distinguishing the two kinds of data.* ### 5 Visualization Time Series Software should: - **TS5.0** *Implement default `plot` methods for any implemented class system.* - **TS5.1** *When representing results in temporal domain(s), ensure that one axis is clearly labelled “time” (or equivalent), with continuous units.* - **TS5.2** *Default to placing the “time” (or equivalent) variable on the horizontal axis.* - **TS5.3** *Ensure that units of the time, frequency, or index variable are printed by default on the axis.* - **TS5.4** *For frequency visualization, abscissa spanning \[ − *π*, *π*\] should be avoided in favour of positive units of \[0, 2*π*\] or \[0, 0.5\], in all cases with appropriate additional explanation of units.* - **TS5.5** *Provide options to determine whether plots of data with missing values should generate continuous or broken lines.* For the results of forecast operations, Time Series Software should - **TS5.6** *By default indicate distributional limits of forecast on plot* - **TS5.7** *By default include model (input) values in plot, as well as forecast (output) values* - **TS5.8** *By default provide clear visual distinction between model (input) values and forecast (output) values.*