PTF v2 - Design Document for Unified Output Format

# Design document for ptf v2 - Unified Output format ## Introduction ### Aim The enhancement proposal aims to: - Standardise the output formats of all time series forecasting models in `pytorch-forecasting` starting with v2. - Update the existing metrics to handle this standardised output format. ### Background - After the initial design for DLinear model in the merged PR [#1874](https://github.com/sktime/pytorch-forecasting/pull/1874), @agobbifbk proposed the idea of the requirement for a standard output format for all models in PTF v2, ensuring a clean interface and reduce the complexity of handling outputs for different types of forecasts and loss functions in a model. - The idea was welcomed by other members in the team working on `pytorch-forecasting` v2. ### Context - There has been preliminary work on the potential changes for standardising the outputs and metrics - PR [#1894](https://github.com/sktime/pytorch-forecasting/pull/1895) - PR [#1897](https://github.com/sktime/pytorch-forecasting/pull/1897) - Work on these PRs is on a pause, conditional on an agreement on the design suggested by this enhancement proposal. ## Description of status quo ### Model outputs #### High-level summary * the `predict` function of a model returns a tensor of a dimension depending on input and varying by model: * the `loss` chosen in constructing the model, e.g., `MeanSquaredLoss`, a loss from `pytorch-forecasting` * an integer `batch_size` in the loader = batch in the loader * an integer `max_prediction_length` passed in `TimeSeriesDataSet` * the `predict` return is not currently consistent by model, typical (not exhaustive list of) potential outputs for a single loss are: * 3D tensor `(batch_size, max_prediction_length, loss_dims)`, where `loss_dims` depends on the loss chosen, e.g., quantile loss with 3 quantiles leads to `loss_dims=3` * 2D tensor `(batch_size, max_prediction_length)` for models like `NBeats`. * some models also allow passing of a multi-loss, then the `predict` output will be a list of tensors `(batch_size, max_prediction_length, loss_dims)` or `(batch_size, max_prediction_length)` depending upon the chosen model and loss, with `loss_dims` looping over dimensions implied by loss #### Base vignette for `predict` ```python= # User defines the prediction_length and batch_size prediction_length = 20 # as an example batch_size = 32 # as an example # Setup the dataset and the dataloader dataset = TimeSeriesDataset( ..., max_prediction_length=prediction_length ) dataloader = dataset.to_dataloader( ..., batch_size = batch_size ) # Define Loss loss = QuantileLoss(quantiles = [0.1, 0.5, 0.7]) # this will imply loss_dims = 3 = len(quantiles) # other losses can be chosen, will imply different loss_dims model_cls = TFT # other model classes can be chosen # these will differ in "other params" # and the exact perdiction return model = model_cls.from_dataset( dataset, loss=loss, ... # other params ) raw_predictions = model.predict( dataloader, mode="raw", ... ) # raw_predictions is a 3D tensor # will be (batch, prediction_length, loss_dims) ``` The exact prediction return depends on the model. Currently there are two possibilities: * 3D tensor `(batch, prediction_length, loss_dims)`: `TFT`, `Tide`, `NHiTS`, `DeepAR`, `DecoderMLP`, `RecurrentNetwork`, *Special Case: `TimeXer`* * 2D tensor `(batch, prediction_length)`: `NBEATS`, *Special Case: `TimeXer`* * these models can only support point prediction losses * Special Cases: * `TimeXer`: * `TimeXer`'s `forward` pass returns a raw tensor that can be a 2D or 3D depending on the type of `loss` used. * If `QuantileLoss` is used: The output format of `forward` will be a 3D tensor `(batch, prediction_length, loss_dims)` where `loss_dims` depends upon the type of `loss` * If point prediction losses like `MAE` are used: The output format of `forward` will be a 2D tensor `(batch, prediction_length)` For reference, you can also look at this [gist](https://gist.github.com/phoeenniixx/52ab40913fe4d0858c0beaa25e1b4e8c) that contains the output formats of each model with the loss that model is compatible with. The following `loss_dims` are implied by metrics: * point prediction losses imply `loss_dims=1`. Metrics are: * `PoissonLoss` * `SMAPE` * `MAPE` * `MAE` * `CrossEntropy` * `RMSE` * `MASE` * `TweedieLoss` ***NOTE: For Some models like `NBEATS` and `TimeXer` this `loss_dims` doesnot appear in the final raw output tensor of `forward`, in these models the output is a 2D tensor*** * `QuantileLoss(quantiles=quantile_list)` implies `loss_dims=len(quantiles)` * distributional loss implies `loss_dims` specific to the loss. * sub-case 1: `NormalDistributionLoss`, `MultivariateNormalDistributionLoss`, `ImplicitQuantileNetworkDistributionLoss` imply `loss_dims = len(loss.distribution_arguments) + 2` * sub-case 2: other losses like `NegativeBinomialDistributionLoss`, `LogNormalDistributionLoss`, `BetaDistributionLoss` imply `loss_dims = len(loss.distribution_arguments)` The resulting 3D tensor is the expected input to the loss, see later section "metrics" ***NOTE: Not every model that returns a 3D tensor can be combined with all losses - it depends on the architecture of the model, if it is even compatible with that loss. For example, `Tide` can only take point prediction losses, so its return will always be `(batch, prediction_length, 1)`.*** ### Loss Classes For ease of reference, losses are grouped into four categories. Model compatibility is then defined in terms of these categories: * `PP-numeric`: Losses that compute point predictions and expect numeric `y` values (regression). * `PoissonLoss` * `SMAPE` * `MAPE` * `MAE` * `RMSE` * `MASE` * `TweedieLoss` * `PP-categorical`: The point prediction losses for classification * `CrossEntropy` * `Distr`: Distribution Losses * `NormalDistributionLoss` * `MultivariateNormalDistributionLoss` * `ImplicitQuantileNetworkDistributionLoss` * `NegativeBinomialDistributionLoss` * `LogNormalDistributionLoss` * `BetaDistributionLoss` * `Quantile`: meaning `QuantileLoss` #### Models and the metrics they are compatible with Now we list down the models and the losses they are compatible with * `TemporalFusionTransformer`:`PP-numeric`, `PP-categorical`, `Distr`, `Quantile` * `Tide`: `PP-numeric` * `DeepAR`: `Distr` * `DecoderMLP`: `PP-numeric`, `PP-categorical`, `Distr`, `Quantile` * `NBEATS`: `PP-numeric` * `NHiTS`: `PP-numeric`, `Distr`, `Quantile` * `RecurrentNetwork`:`PP-numeric` * `TimeXer`: `PP-numeric`, `Quantile` #### Interpretation of parameters - `batch_size`: This refers to the number of time series datapoints in every "batch" of data returned by the dataloader initialised on a TimeSeriesDataset. - `prediction_length`: The number of future timesteps the model forecasts, same as "forecast horizon" in the more general setting. `prediction_length` is a special term for endocder-decoder models, it determines the size of the prediction output tensor's time dimension. #### Internal private API to infer `loss_dims` `loss_dims` depends on the chosen `loss` in constructing an instance of the model class. It is inferred by `deduce_default_output_parameters()` of `BaseModel` class [[Source]](https://github.com/sktime/pytorch-forecasting/blob/b2cfc147d515f4bd10ac87ff524de5a10e3e7b8f/pytorch_forecasting/models/base/_base_model.py#L653-L703). Exceptions: * some models do not allow use of probabilistic losses, e.g., `N_Beats`, so no `loss_dims` needs to be inferred * the inference logic is different for models with linear layer, e.g., `DeepAR` #### Further details ***NOTE: The vignettes use `predict(mode="raw")` to inspect the direct output of the model's `forward` pass. The exact shape of this output can depend on other arguments, such as `n_samples`, as detailed in the specific examples.*** There are multiple ways in which Output format for the model is decided: * **Using `BaseModel`'s functionality** - The most common way is using `deduce_default_output_parameters()` of `BaseModel`, it is done in mainly three steps: * **Inference from Loss**: The method first inspects the loss function to determine how many output values are needed per time step. For example, a `QuantileLoss` with 7 quantiles will require an `output_size` of 7, while a point-forecasting loss like `RMSE` will require an `output_size` of 1. * **Propagation to Model**: This inferred `output_size` is then passed as a hyperparameter to the model's constructor, typically when you use the `.from_dataset()` method. * **Architectural Impact**: Finally, the model uses this `output_size` to build its final `output_layer`, ensuring the layer's dimensions perfectly match what the loss function expects to evaluate. For `TFT`, the output format is decided based on chosen `loss` - This `loss` is used by `from_dataset()` to infer the output format [[source]](https://github.com/sktime/pytorch-forecasting/blob/b2cfc147d515f4bd10ac87ff524de5a10e3e7b8f/pytorch_forecasting/models/temporal_fusion_transformer/_tft.py#L458-L460) The actual compatibility of the model and a specific `loss` function depends on the internal structure of the model and on the fact whether its output tensor shapes align with the expectations of that `loss`. Like `Tide` model is primarily used for point prediction, so it will be incompatible with the `loss` functions like `QuantileLoss` and `DistributionLoss`  * **Linear Layers** - Some models like `DeepAR` uses `nn.Linear` layers: ***NOTE: The vignettes use the specific combination `predict(mode="raw", n_samples=None)` to directly access the raw parameters of the forecast distribution. This allows us to see the final tensor whose shape is being explained. Omitting `n_samples=None` would result in sampled paths instead of distribution parameters.*** * For `DeepAR`, linear layers for argument projection (`distribution_projector` layer) are used [[source]](https://github.com/sktime/pytorch-forecasting/blob/b2cfc147d515f4bd10ac87ff524de5a10e3e7b8f/pytorch_forecasting/models/deepar/_deepar.py#L189-L200). The model's `distribution_projector` layer is built dynamically based on the arguments of the chosen `loss` function. This process involves two key steps: creating the initial network output and then rescaling it to produce the final raw prediction tensor. * For a single target, it creates one `nn.Linear` layer whose output dimension is `len(loss.distribution_arguments)`. The final prediction is a single tensor of shape `(batch_size, prediction_length, num_parameters)`. The `num_parameters` here depends on the type of `DistributionLoss` you will be using. * if the loss is one of the following losses, the `num_parameters` will be `len(loss.distribution_arguments) + 2`: * `NormalDistributionLoss` * `MultivariateNormalDistributionLoss` * `ImplicitQuantileNetworkDistributionLoss` This "+2" comes concatenation of `target_scale` with other `distribution_arguments` of the loss in `rescale parameters`. Example can be seen in `MultivariateNormalDistributionLoss` [here](https://github.com/sktime/pytorch-forecasting/blob/b2cfc147d515f4bd10ac87ff524de5a10e3e7b8f/pytorch_forecasting/metrics/distributions.py#L151-L159) ```python= # User defines the prediction_length and batch_size prediction_length = 20 # for example batch_size = 32 # for example # Setup the dataset and the dataloader dataset = TimeSeriesDataset( ..., max_prediction_length=prediction_length ) dataloader = dataset.to_dataloader( ..., batch_size = batch_size ) #-----------Loss 1------------------------ loss_1 = NormalDistributionLoss() # has 2 distribution_arguments = ["loc", "scale"] model_1 = DeepAR.from_dataset( dataset, loss=loss_1, ... # other params ) raw_predictions_1 = model_1.predict( dataloader, mode="raw", n_samples=None, ... ) # the shape of the raw output of forward pass # will be (batch_size, prediction_length, 4) # because len(distribution_arguments) == 2 # so, num_parameters = len(distribution_arguments) + 2 = 4 #-----------Loss 2------------------------ loss_2 = ImplicitQuantileNetworkDistributionLoss() # has distribution_arguments = list(range(int(input_size))) # where input_size is 16 by default and can be changed by the # user if they want model_2 = DeepAR.from_dataset( dataset, loss=loss_2, ... # other params ) raw_predictions_2 = model_2.predict( dataloader, mode="raw", n_samples=None, ... ) # the shape of the raw output of forward pass # will be (batch_size, prediction_length, 18) # if input_size is 16 # as, num_parameters = len(distribution_arguments) + 2 = 18 # same is the case with MultivariateNormalDistributionLoss # where len(distribution_arguments) = 2 + rank, # so the num_parameters will be # len(loss.distribution_arguments) + 2 # or 2 + (2 + rank) ``` * Other `DistributionLoss` functions like `NegativeBinomialDistributionLoss`, `LogNormalDistributionLoss`, `BetaDistributionLoss` doesnot use `target_scale` concatenation so the `num_parameters` is always `len(loss.distribution_arguments)`. ```python= # User defines the prediction_length and batch_size prediction_length = 20 # for example batch_size = 32 # for example # Setup the dataset and the dataloader dataset = TimeSeriesDataset( ..., max_prediction_length=prediction_length ) dataloader = dataset.to_dataloader( ..., batch_size = batch_size ) # Define Loss loss = NegativeBinomialDistributionLoss() # has 2 distribution_arguments = ["mean", "shape"] model = DeepAR.from_dataset( dataset, loss=loss, ... # other params ) raw_predictions = model.predict( dataloader, mode="raw", n_samples=None, ... ) # the shape of the raw output of forward pass # will be (batch_size, prediction_length, 2) # because len(distribution_arguments) == 2 # same is the case with LogNormalDistributionLoss and # BetaDistributionLoss where num_parameters will # be len(loss.distribution_arguments) ``` * For multiple targets (using a `MultiLoss`), The model creates an `nn.ModuleList` with a separate `nn.Linear` layer for each target. The final prediction is a list of tensors. Each tensor's shape `(batch_size, prediction_length, num_parameters_for_target_i)` depends on its corresponding `loss` function. ```python= # User defines the prediction_length and batch_size prediction_length = 20 # for example batch_size = 32 # for example # setup dataset and dataloader dataset = TimeSeriesDataset( ..., target = [target_1, target_2], target_normalizer=MultiNormalizer([ GroupNormalizer(), GroupNormalizer(transformation="logit") # for BetaDistributionLoss, transformation has to # "logit" always ]), max_prediction_length=prediction_length ) dataloader = dataset.to_dataloader( ..., batch_size = batch_size ) # Define Loss multi_loss = MultiLoss([loss_1, loss_2]) model = DeepAR.from_dataset( dataset, loss=multi_loss, ... # other params ) raw_predictions = model.predict( dataloader, mode="raw", n_samples=None ... ) output_list = raw_predictions # Output for the target_1 (NormalDistributionLoss) raw_target_1 = output_list[0] # Shape will be (batch_size, prediction_length, 4) # 2 from NormalDistributionLoss's arguments # + 2 from target_scale (concatenated) # Output for the target_2 (BetaDistributionLoss) raw_target_2 = output_list[1] # Shape will be (batch_size, prediction_length, 2) # 2 from BetaDistributionLoss's arguments. ``` * The `N-BEATS` model is independent of this assumption and donot changes the model output format based on the loss fucntions. The final forecast tensor always has the shape `(batch_size, prediction_length)`. * **Compatible Losses**: N-BEATS works seamlessly with any standard regression loss (e.g., `MSE`, `RMSE`, `MAE`). These loss functions expect exactly one prediction per target value, a contract that `N-BEATS` perfectly fulfills. * **Incompatible Losses**: Attempting to use a more complex loss function would cause a `RuntimeError` due to a fundamental mismatch in this contract: * `QuantileLoss`: Requires a prediction tensor with an additional dimension corresponding to the number of quantiles (e.g., shape (..., `num_quantiles`)), which `N-BEATS` does not provide. * `DistributionLoss` (and its subclasses): Expects a prediction tensor with an additional dimension for the distribution's parameters (e.g., shape (..., 2) for `loc` and `scale`), which `N-BEATS` is not designed to learn or output. ```python= # User defines the prediction_length and batch_size prediction_length = 20 # just an example batch_size = 32 # just an example # Setup the dataset and the dataloader dataset = TimeSeriesDataset( ..., max_prediction_length=prediction_length ) dataloader = dataset.to_dataloader( ..., batch_size = batch_size ) # Define Loss loss = MAE() model = NBeats.from_dataset( dataset, loss=loss, ... # other params ) raw_predictions = model.predict( dataloader, mode="raw", ... ) # shape of raw output of forward pass # will be (batch, prediction_length) ``` ##### Parked - `timesteps`: In a general sense, timesteps refers to the size of the lookback window/encoder length/context length of the model when a time series is input. Every timeseries is split into a torch.Tensor containing timesteps number of samples from the time series and is stacked in sets of batch_size to give a tensor - (batch_size, timesteps). Input tensors are considered to have timesteps as 2nd dimension. ### Metrics #### High-level summary * metrics are classes, init defines metric parameters. Example: `MAE`, `MSE` * `update` method for data ingestion, `loss` and `compute` for evaluateion * `loss` is single pass, `compute` is aggregate * internally, the private `Metric` class is used for evaluation, this inherits from `LightningMetric` There are three types of metrics, differring in API: * point prediction metrics, like `MSE`, `MAE`. Point metrics also have a sub-category called "point classification" - `CrossEntropy` which handles multi-horizon prediction on categorical variables. * quantile prediction metrics, like `QuantileLoss` * distribution prediction metrics, like `NormalDistributionLoss`, `MQF2DistributionLoss` All losses used in the code inherit from `Metric`, which inherits from `LightningMetric`. There are some metrics which inherit from `LightningMetric` (e.g., `CompositeMetric`), which we have not seen used in code. #### vignette for use of metrics - overview All metrics follow the same vignette, however the expected dimensions for predictions (`y_pred`) and actual values (`y_actual`) vary; the metric may have constructor arguments, e.g., `quantiles` in the `QuantileLoss`. Illustrative vignettes follow. ##### Simple use - single batch Example for MAE loss: ```python= import torch from pytorch_forecasting.metrics import MAE # Sample data: batch_size=2, sequence_length=5 y_pred = torch.randn(2, 5) # Network predictions y_actual = torch.randn(2, 5) # Ground truth targets # Initialize loss function - construct instance mae_loss = MAE() # Calculate loss mae_loss.update(y_pred=y_pred, target=y_actual) final_loss = mae_loss.compute() # Returns: scalar tensor print(f"Input shape: {y_pred.shape}") # torch.Size([2, 5]) print(f"Output loss: {final_loss}") # tensor(1.2345) print(f"Loss shape: {final_loss.shape}") # torch.Size([]) ``` Example for CrossEntropy loss: ```python= import torch from pytorch_forecasting.metrics import CrossEntropy # Sample data: batch_size=2, sequence_length=5, num_classes=3 y_pred = torch.randn(2, 5, 3) # Network predictions (logits for each class) y_actual = torch.randint(0, 3, (2, 5)) # Ground truth class indices # Initialize loss function - construct instance ce_loss = CrossEntropy() # Calculate loss ce_loss.update(y_pred=y_pred, target=y_actual) final_loss = ce_loss.compute() # Returns: scalar tensor print(f"Input shape: {y_pred.shape}") # torch.Size([2, 5, 3]) print(f"Target shape: {y_actual.shape}") # torch.Size([2, 5]) print(f"Output loss: {final_loss}") # tensor(1.2345) print(f"Loss shape: {final_loss.shape}") # torch.Size([]) ``` Example for quantile loss: ```python= import torch from pytorch_forecasting.metrics import QuantileLoss # Sample data with quantile predictions batch_size, seq_len, n_quantiles = 2, 5, 7 y_pred = torch.randn(batch_size, seq_len, n_quantiles) # 7 quantiles per prediction y_actual = torch.randn(batch_size, seq_len) # Single target value # Initialize quantile loss quantile_loss = QuantileLoss(quantiles=[0.1, 0.5, 0.9]) # (default: [0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98]) # Calculate loss quantile_loss.update(y_pred=y_pred, target=y_actual) final_loss = quantile_loss.compute() print(f"Prediction shape: {y_pred.shape}") # torch.Size([2, 5, 7]) print(f"Target shape: {y_actual.shape}") # torch.Size([2, 5]) print(f"Output loss: {final_loss}") # tensor(2.3456) ``` Example for distribution loss: ```python= import torch from pytorch_forecasting.metrics import NormalDistributionLoss from pytorch_forecasting.data.encoders import TorchNormalizer # Sample data: predicting 2 parameters (mean, std) for Normal distribution batch_size, seq_len, n_params = 2, 5, 2 y_pred = torch.randn(batch_size, seq_len, n_params) # [mean, log_std] parameters y_actual = torch.randn(batch_size, seq_len) # Actual values # Dummy target scale and encoder (replace with real ones in practice) target_scale = torch.ones(batch_size, 2) # e.g., [mean, std] for each sample encoder = TorchNormalizer() # or your dataset's normalizer # Initialize distribution loss normal_loss = NormalDistributionLoss() # Rescale network outputs to match target scale y_pred_rescaled = normal_loss.rescale_parameters( parameters=y_pred, target_scale=target_scale, encoder=encoder ) # Calculate negative log-likelihood loss normal_loss.update(y_pred=y_pred_rescaled, target=y_actual) final_loss = normal_loss.compute() print(f"Prediction shape: {y_pred.shape}") # torch.Size([2, 5, 2]) print(f"Rescaled shape: {y_pred_rescaled.shape}") # torch.Size([2, 5, 4]) (depends on implementation) print(f"Target shape: {y_actual.shape}") # torch.Size([2, 5]) print(f"Output loss: {final_loss}") # tensor(1.8765) ``` ##### Simple use - multi-batch The `update` method can be called multiple times, and it is intended - and typically expected, that multiple batches are added to the loss before computation. ```python= import torch from pytorch_forecasting.metrics import MAE # Sample data: batch_size=2, sequence_length=5 y_pred = torch.randn(2, 5) # Network predictions y_actual = torch.randn(2, 5) # Ground truth targets y_pred2 = torch.randn(2, 5) # Network predictions y_actual2 = torch.randn(2, 5) # Ground truth targets # Initialize loss function - construct instance mae_loss = MAE() # Calculate loss mae_loss.update(y_pred=y_pred, target=y_actual) mae_loss.update(y_pred=y_pred2, target=y_actual2) final_loss = mae_loss.compute() # Returns: scalar tensor print(f"Input shape: {y_pred.shape}") # torch.Size([2, 5]) print(f"Output loss: {final_loss}") # tensor(1.2345) print(f"Loss shape: {final_loss.shape}") # torch.Size([]) ``` ##### Simple use - by-entry Metrics also have a `loss` method, which computes the metric entry-wise. This is not available for all metrics. ```python= import torch from pytorch_forecasting.metrics import MAE # Sample data: batch_size=2, sequence_length=5 y_pred = torch.randn(2, 5) # Network predictions y_actual = torch.randn(2, 5) # Ground truth targets # Initialize loss function - construct instance mae_loss = MAE() # Calculate loss entrywise_loss = mae_loss.loss(y_pred=y_pred, target=y_actual) print(f"Input shape: {y_pred.shape}") # torch.Size([2, 5]) print(f"Loss shape: {entrywise_loss.shape}") # torch.Size([2, 5]) ``` #### key methods of `Metric` and child classes `update(y_pred, target, **more_args)` and `loss(y_pred, target, **more_args)`. Addtionally, `rescale_parameters` is also required for normalizing the params to the scale required for output, in distribution metrics. * `target` must be one of: * 2D `(batch, timepoints)` `torch.Tensor` for both methods * a tuple of tensors `(tensor, weights)`, where `tensor` is any acceptable tensor (2d) representing the ground truth values and `weights` is a tensor of the shape `(batch_size,1)`, `(1,timesteps)`, `(batch_size,)` or same shape as `tensor`. Broadcasting of weights is possible in the first three cases. * `rnn.PackedSequences` - a 2D array-like (list of 1D). * `y_pred` must be one of: * 3D `(batch, timepoints, loss_dim)` `torch.Tensor`, with `loss_dim` an integer specific to the loss (`loss_dim` >= 2). * 2D `(batch, timepoints)` - only permitted for point prediction metrics. Interpreted the same as `(batch, timepoints, 1)`. All metrics used in code inherit from `MultiHorizonMetric`, the lowest level abstract interface class. The following vignette lists all input conventions for the target dtypes and metric class methods. ```python= # Sample data # loss_dim is specific to the metric y_pred = torch.randn(2, 5, loss_dim) # Network predictions y_actual = torch.randn(2, 5) # Ground truth targets - 2d tensor. # option 1 - Simple tensors loss.update(y_pred=y_pred, target=y_actual) # option 2 - Weighted targets weights = torch.ones(2, 5) * 0.8 # Give less weight to some samples loss.update(y_pred=y_pred, target=(y_actual, weights)) # optione 3 - Variable-length sequences (PackedSequence) from torch.nn.utils.rnn import pack_sequence sequences = [torch.randn(3), torch.randn(5), torch.randn(2)] # Different lengths packed_targets = pack_sequence(sequences, enforce_sorted=False) # predictions would need to be appropriately shaped for packed format loss.update(predictions, packed_targets) ``` #### Inconsistencies in naming * in most metrics, the "actual value" argument is called `target`, in methods `loss` and `update`. * however, in the base class, it is called `y_actual` - but the methods are always overridden in concrete child classes, so this does not cause an error * in the `Metric` class it is `y_actual` but it is an abstract method. * in the `MultiHorizonMetric` (child class of `Metric`), `target` is the argument name used for representing "actual value" in `loss` and `update`. * in the concrete metric implementations of point, distribution and quantile loss functions, `target` is used for naming the "actual value" param in `loss` and `update` (if it exists) for point losses, and in distribution losses we use `y_actual`. * Conclusions: Weirdly, there is some inconsistency in naming the "actual value" parameter in these methods. However since concrete metric classes implement their own version of `loss()`, there is no issue with the code, it is only a naming concern, and might cause confusion. #### secondary methods Two public methods are exposed - an internally used - to convert between 3D and 2D prediction format - `to_prediction`: This method is a key part of the current efforts in v1 for enforcing standarisation in model outputs with point forecasts, contingent on the loss function. There are two instances of `to_prediction` - in `BaseModel` and `Metric` (additionally overriden by its child classes for actual loss function). - `to_prediction` follows the following contract for input and output based on the `metric_type`. - "Point metrics" : Input - 2D tensor, Output - 2D Tensor - "Quantile metrics": Input - 3D Tensor, Output - 2d tensor (along the median quantile) - "Distribution metrics" : Input - 3d tensor, Output - 2d tensor. ##### Usage vignette of `to_prediction` (AI generated, with above description as context) ```python= import torch from pytorch_forecasting.metrics.base_metrics import Metric, MultiLoss from pytorch_forecasting.metrics.point import PoissonLoss, CrossEntropy from pytorch_forecasting.metrics.distributions import NormalDistributionLoss, NegativeBinomialDistributionLoss # ============================================================================= # 1. BASE METRIC - Point Prediction Generation # ============================================================================= # Case 1A: 2D input -> 2D output (no change) base_metric = Metric() y_pred_2d = torch.randn(32, 10) # (batch, time) result = base_metric.to_prediction(y_pred_2d) print(f"Base 2D: {y_pred_2d.shape} -> {result.shape}") # [32, 10] -> [32, 10] # Case 1B: 3D input with quantiles -> 2D output (mean across quantiles) quantile_metric = Metric(quantiles=[0.1, 0.5, 0.9]) y_pred_3d = torch.randn(32, 10, 3) # (batch, time, quantiles) result = quantile_metric.to_prediction(y_pred_3d) print(f"Base 3D quantiles: {y_pred_3d.shape} -> {result.shape}") # [32, 10, 3] -> [32, 10] # Case 1C: 3D input no quantiles -> 2D output (extract first dim) single_metric = Metric(quantiles=None) y_pred_single = torch.randn(32, 10, 1) # (batch, time, 1) result = single_metric.to_prediction(y_pred_single) print(f"Base 3D single: {y_pred_single.shape} -> {result.shape}") # [32, 10, 1] -> [32, 10] # ============================================================================= # 2. DISTRIBUTION LOSS - Distribution-Based Predictions # ============================================================================= # Case 2A: Normal distribution (uses analytical mean) normal_loss = NormalDistributionLoss() y_pred_dist = torch.randn(32, 10, 4) # (batch, time, [scale_loc, scale_scale, mean, std]) result = normal_loss.to_prediction(y_pred_dist) print(f"Distribution: {y_pred_dist.shape} -> {result.shape}") # [32, 10, 4] -> [32, 10] # Case 2B: Negative Binomial (direct parameter extraction) negbin_loss = NegativeBinomialDistributionLoss() y_pred_negbin = torch.randn(32, 10, 2) # (batch, time, [mean, shape]) result = negbin_loss.to_prediction(y_pred_negbin) print(f"NegBin: {y_pred_negbin.shape} -> {result.shape}") # [32, 10, 2] -> [32, 10] # ============================================================================= # 3. POINT PREDICTION LOSSES - Specialized Transformations # ============================================================================= # Case 3A: Poisson (exponential transformation) poisson_loss = PoissonLoss() y_pred_poisson = {"prediction": torch.randn(32, 10)} # log-rates result = poisson_loss.to_prediction(y_pred_poisson) print(f"Poisson: dict -> {result.shape}, positive values: {(result > 0).all()}") # Case 3B: CrossEntropy (argmax for classification) ce_loss = CrossEntropy() y_pred_logits = torch.randn(32, 10, 5) # (batch, time, classes) result = ce_loss.to_prediction(y_pred_logits) print(f"CrossEntropy: {y_pred_logits.shape} -> {result.shape}") # [32, 10, 5] -> [32, 10] print(f"Class indices range: {result.min()}-{result.max()}") # ============================================================================= # 4. MULTI-LOSS - Multiple Targets # ============================================================================= # Case 4: Different loss types for different targets multi_loss = MultiLoss([ Metric(), # Target 1: Simple point prediction NormalDistributionLoss(), # Target 2: Probabilistic CrossEntropy() # Target 3: Classification ]) y_pred_multi = [ torch.randn(32, 10), # Target 1: Point predictions torch.randn(32, 10, 4), # Target 2: Distribution parameters torch.randn(32, 10, 3) # Target 3: Class logits ] results = multi_loss.to_prediction(y_pred_multi) # list of converted prediction for all target tensors. print(f"MultiLoss targets: {len(results)} outputs") for i, result in enumerate(results): print(f" Target {i+1}: -> {result.shape}") # All -> [32, 10] ``` - in the model side, we call the same functions as above, but wrapped in a `to_prediction` method inside `BaseModel`. same usage vignette as above, but substituted by models. - `to_quantiles`: Converts network predictions into quantile prediction. The function signature takes in two arguments, `y_pred` a 2D/3D tensor, and list of float values representing the quantiles (if `None`, defaults to `self.quantiles`). The `to_quantiles` is metric-dependent based on `y_pred` inputs and outputs. - "Point metrics": Input: 2D, Output: 3D tensor `(batch, timepoints, 1)` - "Quantile metrics" and "Point classification metric": Input: 3D, output: 3D tensor (`y_pred` is output as is). `QuantileLoss` does not accept `quantiles` as an argument and return the `y_pred`. `CrossEntropy` makes no use of the quantiles argument even though it is a provision in the arguments list. - "Distribution metrics": Input: 3D, output: 3D tensor `(batch, timepoints, len(quantiles)`. ##### Usage vignette of `to_quaniles`(ai generated on above description) ```python= import torch from pytorch_forecasting.metrics.base_metrics import Metric from pytorch_forecasting.metrics.distributions import NormalDistributionLoss from pytorch_forecasting.metrics.point import PoissonLoss, CrossEntropy, MAE # ============================================================================= # BASE METRIC - Tensor Dimension Conversion # ============================================================================= # 2D input -> 3D output (add quantile dimension) base_metric = Metric(quantiles=[0.1, 0.5, 0.9]) y_pred_2d = torch.randn(32, 10) quantiles_from_2d = base_metric.to_quantiles(y_pred_2d) # [32, 10] -> [32, 10, 1] # 3D input with multiple values -> quantile conversion y_pred_samples = torch.randn(32, 10, 100) quantiles_converted = base_metric.to_quantiles(y_pred_samples, quantiles=[0.25, 0.5, 0.75]) # [32, 10, 100] -> [32, 10, 3] - converts samples to requested quantiles # ============================================================================= # DISTRIBUTION LOSS - Analytical vs Empirical # ============================================================================= # Analytical approach (Normal distribution with icdf) normal_loss = NormalDistributionLoss(quantiles=[0.1, 0.5, 0.9]) y_pred_normal = torch.randn(32, 10, 4) # (batch, time, distribution_params) quantiles_normal = normal_loss.to_quantiles(y_pred_normal) # [32, 10, 4] -> [32, 10, 3] using distribution.icdf() # ============================================================================= # POINT PREDICTION LOSSES - Specialized Cases # ============================================================================= # MAE (inherits base behavior) mae_loss = MAE() y_pred_mae = torch.randn(32, 10) quantiles_mae = mae_loss.to_quantiles(y_pred_mae, quantiles=[0.1, 0.5, 0.9]) # [32, 10] -> [32, 10, 3] via unsqueeze # PoissonLoss (custom scipy implementation) poisson_loss = PoissonLoss() y_pred_poisson = {"prediction": torch.log(torch.tensor([[2.0, 3.0]]))} quantiles_poisson = poisson_loss.to_quantiles(y_pred_poisson, quantiles=[0.1, 0.5, 0.9]) # Uses scipy.stats.poisson.ppf() for discrete count quantiles # Result: tensor([[[0., 1., 3.], [1., 2., 5.]]]) # CrossEntropy (pass-through) ce_loss = CrossEntropy() y_pred_logits = torch.randn(32, 10, 5) quantiles_ce = ce_loss.to_quantiles(y_pred_logits) # [32, 10, 5] -> [32, 10, 5] unchanged # ============================================================================= # PRACTICAL SCENARIOS # ============================================================================= # Uncertainty quantification from point predictions point_predictions = torch.randn(16, 24) uncertainty_intervals = Metric().to_quantiles(point_predictions, quantiles=[0.05, 0.5, 0.95]) # Convert daily predictions to 90% confidence intervals # Converting model samples to business quantiles model_samples = torch.randn(8, 48, 200) # 200 MC samples business_quantiles = [0.1, 0.25, 0.5, 0.75, 0.9] converted = Metric().to_quantiles(model_samples, quantiles=business_quantiles) # [8, 48, 200] -> [8, 48, 5] # Count data with discrete quantiles for inventory planning count_output = {"prediction": torch.log(torch.tensor([[5.0, 10.0]]))} service_levels = [0.8, 0.9, 0.95, 0.99] inventory_quantiles = PoissonLoss().to_quantiles(count_output, quantiles=service_levels) # Discrete count values for different service levels # Key Patterns: # - Input: 2D/3D tensors or dicts (varies by loss type) # - Output: Always 3D tensors (batch_size, timesteps, n_quantiles) # - Methods: dimension manipulation, analytical icdf, or empirical sampling # - Use cases: uncertainty quantification, risk assessment, service level planning ``` NOTE: on the model side, `BaseModel` implements `to_quantiles` which uses the individual loss function's `to_quantiles` functions to convert network outputs out["prediction"] to quantile predictions. After pre-processing of target inputs formats of various cases by `Metric` and its child classes, the loss functions only receive a pure `torch.Tensor` and the loss is calculated using the `loss` method of the `Metric` and its child classes. #### Point forecast - loss functions vignette - All point forecast/regression loss functions take in `y_pred` and `y_actual`. - **MASE**: Mean Average Scaled Error requires two additional parameters -`encoder_target` `encoder_lengths` to calculate the scaling factor. This parameter is passed into `update()` when updating the metric state. - **Accepted input formats** - 2D tensors `(batch_size, timesteps)` - 3D tensors `(batch_size, timesteps, n_targets/n_quantiles)`. NOTE: It is not good practice to pass quantiles as a part of the input. But, a call `to_prediction` ensures a point prediction by taking a mean across the last dimension in the internal implementation of `loss()`. Ex: ```python= mae_loss = MAE(quantiles=[0.1,0.5,0.9]) target = data["target"] # 3d tensor with shape (batch_size, timesteps, n_quantiles ) # during prediction/inference mae_loss.update(predictions, target) # predictions from model and ground truth "target". output_loss = mae_loss.loss() # shape -> (batch_size, timesteps) after taking mean across the 3rd dimension with `to_prediction` method call. metric = mae_loss.compute() # single metric torch.Size([]) ``` - Usage vignette ```python= import torch from pytorch_forecasting.metrics import MAE, SMAPE, RMSE # Sample data: batch_size=2, sequence_length=5 y_pred = torch.randn(2, 5) # Network predictions y_actual = torch.randn(2, 5) # Ground truth targets # Initialize loss function mae_loss = MAE() # Calculate loss mae_loss.update(y_pred, y_actual) final_loss = mae_loss.compute() # Returns: scalar tensor print(f"Input shape: {y_pred.shape}") # torch.Size([2, 5]) print(f"Output loss: {final_loss}") # tensor(1.2345) print(f"Loss shape: {final_loss.shape}") # torch.Size([]) ``` #### Quantile forecast - **Accepted input formats** - 3D tensor `(batch_size, timsteps, n_quantiles)` - Usage vignette ```python= import torch from pytorch_forecasting.metrics import QuantileLoss # Sample data with quantile predictions batch_size, seq_len, n_quantiles = 2, 5, 7 y_pred = torch.randn(batch_size, seq_len, n_quantiles) # 7 quantiles per prediction y_actual = torch.randn(batch_size, seq_len) # Single target value # Initialize quantile loss (default: [0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98]) quantile_loss = QuantileLoss() # Calculate loss quantile_loss.update(y_pred, y_actual) final_loss = quantile_loss.compute() print(f"Prediction shape: {y_pred.shape}") # torch.Size([2, 5, 7]) print(f"Target shape: {y_actual.shape}") # torch.Size([2, 5]) print(f"Output loss: {final_loss}") # tensor(2.3456) # Convert to point prediction (uses 0.5 quantile) point_pred = quantile_loss.to_prediction(y_pred) print(f"Point prediction shape: {point_pred.shape}") # torch.Size([2, 5]) # Get all quantiles quantiles = quantile_loss.to_quantiles(y_pred) print(f"Quantiles shape: {quantiles.shape}") # torch.Size([2, 5, 7]) ``` #### Distribution/probabilistic forecast - **NormalDistributionLoss** / **NegativeBinomialDistributionLoss**: - 3d tensor - `(batch_size, timesteps, distribution_params)` - **MultivariateNormalDistributionLoss** - 3d tensor `(batch_size, timesteps, 2+factor)` works only on single target. - **MQF2DistributionLoss** - Only accepts a 3D tensor. - mandatory params: - prediction_length: Explicit definition of a prediction length / horizon upto which the metric can compute the loss. - **ImplicitQuantileNetworkDistributionLoss** - 3D tensor - Summary: - Distribution losses are niche use-case. - 3D tensors works well with distribution losses. - Usage vignette for outputs demo ```python= import torch from pytorch_forecasting.metrics import NormalDistributionLoss # Sample data: predicting 2 parameters (mean, std) for Normal distribution batch_size, seq_len, n_params = 2, 5, 2 y_pred = torch.randn(batch_size, seq_len, n_params) # [mean, log_std] parameters y_actual = torch.randn(batch_size, seq_len) # Actual values # Initialize distribution loss normal_loss = NormalDistributionLoss() # Calculate negative log-likelihood loss normal_loss.update(y_pred, y_actual) final_loss = normal_loss.compute() print(f"Prediction shape: {y_pred.shape}") # torch.Size([2, 5, 2]) print(f"Target shape: {y_actual.shape}") # torch.Size([2, 5]) print(f"Output loss: {final_loss}") # tensor(1.8765) # Convert to point prediction (distribution mean) point_pred = normal_loss.to_prediction(y_pred) print(f"Point prediction shape: {point_pred.shape}") # torch.Size([2, 5]) # Sample from distribution samples = normal_loss.sample(y_pred, 100) print(f"Samples shape: {samples.shape}") # torch.Size([2,5,100]) (after reshaping inside sample()) ``` note: we do not need to worry about the `MultiLoss` input in this case, because it merely iterates through all the metrics provided to it, delegates loss functions to each target in the set of targets and returns a scalar after aggregating the loss from all the targets i.e it ensures that the total aggregate sum of all losses is minimised. ### Metric outputs for different cases #### High level vignette for metric outputs and usage ```python= # Internal loss() method returns unreduced losses raw_losses = loss_fn.loss(predictions, targets) # Shape: (batch_size, sequence_length) for most losses # Shape: (batch_size, sequence_length, n_quantiles/distributiuon_params) for QuantileLoss/DistributionLoss # Final compute() returns scalar after reduction, used for logging metrics loss_fn.update(predictions, targets) final_loss = loss_fn.compute() # Shape: [] (scalar) # loss computation during backpropagation, this happens in lightning backend during training. loss = loss_fn(prediction, target) loss.backward() ``` #### High level vignette to demonstrate v1 pipeline for setting up datasets before training with a loss function. ```python= data = pd.DataFrame({ 'date': pd.date_range('2020-01-01', periods=1000, freq='D'), 'time_idx': range(1000), 'group_id': ['A'] * 500 + ['B'] * 500, 'revenue': np.random.randn(1000) * 100 + 1000, 'sales': np.random.randn(1000) * 50 + 500, 'volume': np.random.randn(1000) * 20 + 100, 'feature_1': np.random.randn(1000), 'feature_2': np.random.randn(1000), }) single_target_dataset = TimeSeriesDataset( data[lambda x: x.date < training_cutoff], target = "volume", ... ) multi_target_dataset = TimeSeriesDataset( data[lambda x: x.date < training_cutoff], target = ["revenue", "sales", "volume"], ... ) trainer = pl.Trainer(...) # enter suitable params ``` This is the typical boilerplate for setting a simple pipeline for training. The above example is assumed as a precursor while demonstrating usage of loss functions below. #### Single Target 1) **Single Loss (point)**: 2D Tensor of shape `(batch_size, timesteps)` ```python= val_dataset =TimeSeriesDataset.from_dataset(single_target_dataset, data, predict=True) val_dataloader = val_dataset.to_dataloader(train=False, batch_size=64) mae_loss = MAE() model = <Model>.from_dataset( single_target_dataset, ..., loss = mae_loss, ... ) trainer.fit(model) model.eval() with torch.no_grad(): sample_batch = next(iter(val_dataloader)) y_pred = model(sample_batch) output = y_pred["prediction"] # shape -> (batch_size, timesteps) target = y_pred["target"] loss_output = mae_loss.loss(output, target) # shape -> (batch_size, timesteps) final_loss = mae_loss(output, target) # stateless loss computation, returns a single value. ``` 2) **Single Quantile/Distribution Loss**: Tensor of shape `(batch_size, timesteps, distribution_params)` - **Quantile Loss**: `distribution_params` is `n_quantiles`. ```python= val_dataset =TimeSeriesDataset.from_dataset(single_target_dataset, data, predict=True) val_dataloader = val_dataset.to_dataloader(train=False, batch_size=64) quantile_loss = QuantileLoss(quantiles=[0.1,0.5,0.9]) model = <Model>.from_dataset( single_target_dataset, ..., loss = quantile_loss, ... ) trainer.fit(model) model.eval() with torch.no_grad(): sample_batch = next(iter(val_dataloader)) y_pred = model(sample_batch) output = y_pred["prediction"] # shape -> (batch_size, prediction_length, 3) since len(quantiles)=3. target = sample_batch["target"] loss_output = quantile_loss.loss(output,target) # 3d tensor as above. final_loss = quantile_loss(output, target) # single metric ``` - **Distribution Loss**: There are several kinds: - **Normal**: `(batch_size, timesteps, 2)` where the params are loc and scale. - **NegativeBinomialDistribution**: `(batch_size, timesteps, 2)`, where params are mean and shape - **LogNormalDistribution**: `(batch_size, timesteps, 2)` where params are loc and scale - **BetaDistributionLoss**: `(batch_size, timesteps, 2)`, where params are mean and shape - **MultivariateNormalLoss**: `(batch_size, timesteps, 2+rank)` where params are loc, shape, cov_factor - In a general sense it is guaranteed to be `(batch_size, timesteps, distribution_params)`. These loss functions are specialised for models like DeepAR and are not very commonly used. 3) **MultiLoss**: MultiLoss technically works with a single target, provided only a single loss function is specified inside `MultiLoss`. Otherwise, it will not work if multiple loss functions are present, against a single target. #### Multi targets 1) **MultiLoss**: In case of multiple targets, there is one-one correspondence between the loss function and a target at the same index. * Returns a list of tensors, where each tensor correponds to the value of the target at the `ith` target whose prediction is optimised for the `ith` loss in the initialisation of `MultiLoss` using a list. Ex to explain a complicated case: ```python # Define three targets. val_dataset =TimeSeriesDataset.from_dataset(multi_target_dataset, data, predict=True) val_dataloader = val_dataset.to_dataloader(train=False, batch_size=64) multi_loss = MultiLoss([ MAE(), QuantileLoss(quantiles=[0.1, 0.5, 0.9]), SMAPE() ]) ## <Model> can be any model compatible with Multiloss and the loss functions specified inside MultiLoss. model = <Model>.from_dataset( multi_target_dataset, ..., loss = multi_loss, ... ) trainer.fit(model, ...) model.eval() with torch.no_grad(): sample_batch = next(iter(val_dataloader)) y_pred = model(sample_batch) output = y_pred["prediction"] # output is a list of tensor. # [ # torch.Tensor, -> shape (batch_size, prediction_length) # torch.Tensor, -> shape (batch_size, prediction_length, 3) # torch.Tensor -> shape (batch_size, prediction_length) # ] target = sample_batch["target"] targets_for_multiloss = ( [target[:, :, i] for i in range(target.shape[-1])], # List of individual targets target # Original target tensor for extracting masks, weights etc i.e y_actual ) final_loss = multi_loss(output, target_for_multiloss) # always return a single value, aggregate sum of all individual losses. ``` Each loss function passed into MultiLoss requires an input format as specified in the vignette for single target. ### Summary of the testing framework for metric - PR #1907 and some observations Refer description of #1907 - https://www.github.com/sktime/pytorch-forecasting/pull/1907. We test all metrics on an integration test. We skip - MASE (not tested) - MQF2DistributionLoss (skips the integration test) - PoissonLoss (skips the integration test). Reasons for the same are listed in the PR. ## Proposed changes (under draft, wip) #### FK thoughts * there is a "typical" case for predict and for metrics, but there are inconsistencies and exceptions * tests are not systematic * we do not fully understand the status quo even, it is frustrating to investigate it * instinctively, I would prioritize fixing the testing situation on v1, and using that to gain informatoin and check our understanding * suggested next working items * 1. add a "test all metrics" test - Pranav * use the "should work for all metrics" vignettes as a test! * loop over all metrics * need to implement some logic to infer `loss_dims` * for the test, can this not just be replaced by * a dict loss -> loss_dim * or, simply, an attribute added to metrics? * 2. add a proper test for `predict` - Aryan * current `TestALlestimators` already calls `predict` * but not all models are covered by a v1 pkg (need to add models) * but it does not test the expected output type and size of the tensor! * this should be added * possibly also with increased coverage for losses - step 2 of this * try to see if we can vary losses per model * expected `predict` format should be tested for `loss_dim` - same logic as in 1. * idea: the same tests can then be used for v2 with minor modifications * any requirements that we are unaware of would already be coming up with v1 * for instance, the funny list-of-tensors output format ## Questions: - AG. 1. *iii) MultiLoss: In case of a single target, we can use MultiLoss.* This must me explored better: what if I use quantile loss + RMSE? The model should give 3 + 1 outputs? Not clear what happens then in the predict method * FK: in the "list of tensors" design, that would be list of two tensors? * AG: ok, but from an architectural point of view, the model needs to produce 4 outputs (3 for quantile and 1 for RMSE)? 2. There was a discussion about using custom losses, we need to figure out the base skeleton of those losses so that the user can easily plug in their loss, that can probably depends on oher columns of the initial dataset. Not urgent for sure, but the design of the general class should be as easy as possible * FK: I think that will be "easy" once you have figured out the design for the current losses. 4. Not clear the approach for dealing multitarget mutiloss, is it possible with this design have two targets and one has MSE and the other multiloss [RMSE, Quantile]? Or is it something we do not support? - PB. 1. I have updated an example under single target with multi loss section. It will give two tensors in a list for each loss. - AG ok I see, my question is: what the model should provide as output? Or better, shoould the model, in your example, return 4 values per each future lag? This can be done, but then it is difficult to weight the two losses, one of the two can be on a different scale and leads the training process. 3. Yes, this is something we can discuss, but not sure from my side. 4. Good question on this edge case..Will look into it. - AG: probably the output will be a list of list of tensors, from one side this is ok, from the other if a user want to import its own model it will be a nightmare. Idea: the model should return always a 4 d tensor, the last dimension is computed by the base class and the formatting of the output (4D tensor to list of lists) is done by the base model again. I, as a user, need only to be sure that the output is a 4d shape where the last dimension is given by the model layer - AS - Do we really need a 4D output in models? I mean we are using the lsit of tensors for `Multiloss` then why do we need the extra dim for `num_targets`? Whenever we would need more than one target we can simply use list of tensors? - The main reason for my proposal is that would make even the current implementation of metrics compatible (Pranav please correct me if I am wrong) and we dont need to make changes to v1 models then? - pb: - Could you maybe compile what the loss functions currently expect. If a 4D tensor input is requiring too many changes. - AS: Well we tried doing so in #1897 but debugging is really hard in Metrics and models also I donot see any reason to keep the extra dim for `num_targets` if we are using list of tensors - Ok, I get what you are suggesting, but I'm not fully sure how efficient it would be to use a list of tensor for every case and restrict the tensor size to 3D for all cases i.e list size will be 4th dimension right? I was assuming pure tensors are always better to use in pytorch rather than looping through targets whenever we have more than one? - I understand that but changing the legacy code is really hard, maybe we should move this discussion to issue so that others can also comment? ### FK structure suggestion FK comment: can you please structure the document similar to an enhancement proposal? Typical structure: * introduction * description of the aim * background/context * description of status quo * use case vignettes * list all important cases! * isolated and integrated * model output formats * metrics * proposed change * high-level summary * model output formats * metrics * implementation plan * pull request list - planned * testing * downwards compatibility ### FK 1st review input * good start, we can make a PR now to enable feedback and discussion * this document is about a design for model output - so status quo should acctually review the functions or classes that produce outputs, in particular, the *models*. Currently no code related to models is shown or reviewed! * the "proposed change" should not just be a single vague paragraph that mentions a single type. It should describe precisely which methods or class signatures are changed to what precisely * obviously the methods/classes that get changed should have been previously reviewed already in the "status quo" section, and referenced * one example question: which methods in the models even produce "outputs" at the moment? From practical perspective: to avoid "diffusion of responsibilities" between Aryan and Pranav: please assign responsibilities with respect to this document clearly who works on what, and do a mutual review at every increment of working on this. ### AG comments * we need to understand - and review - how the information from the loss (required output format) is propagated to the model * e.g., losses needing different input format * in case of multiple loss on the same target: how the model knows how may output to predict? * also important: how loss choice impacts model after forward loop: * e.g. LogNormalDistribution expects the second value to be positive, i think there is a manipulation phase in the loss function computation. If this is the case, how does we can reflect it to the inference step

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.