Refine Retries for TaskRuns and CustomRuns

--- status: implementable title: Refine Retries for TaskRuns and CustomRuns creation-date: '2022-09-08' last-updated: '2022-11-14' authors: - '@XinruZhang' - '@jerop' - '@pritidesai' - '@lbernick' see-also: - TEP-0069 --- # TEP-0121: Refine Retries for TaskRuns and CustomRuns  - [Summary](#summary) - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals](#non-goals) - [Use Cases](#use-cases) - [Related Work](#related-work) - [Design Details](#design-details) - [Timeout per Retry](#timeout-per-retry) - [Retries in TaskRuns and CustomRuns](#retries-in-taskruns-and-customruns) - [Conditions.Succeeded](#conditionssucceeded) - [RetriesStatus](#retriesstatus) - [Alternatives](#alternatives) - [1. Implement retries in PipelineRun](#1-implement-retries-in-pipelinerun) - [2. Implement retries in TaskRun/Run](#2-implement-retries-in-taskrunrun-use-retryattempts-instead-of-retriesstatus) - [3. Conditions.RetrySucceeded](#3-conditionsretrysucceeded) - [References](#references)  ## Summary This TEP proposes to clearly define the behavior of `Retries`: - Task-level `Timeout` is for each **retry attempt** for both `TaskRun` and `CustomRun`. - `TaskRun` reconciler implements the `Retries` logic. - Pipeline Controller MUST ONLY use `Condition.Succeeded` to determine the termination status of a `TaskRun`/`CustomRun`. - Keep `retriesStatus` in both `TaskRun` and `CustomRun` (though optional) to contain details of the intermediate retries. ## Motivation Two distinct imperfections on `Retries` drove this TEP: - `Retries` on `Timeout` is designed inconsistently between TaskRun and CustomRun. - For CustomRun, [the document](https://github.com/tektoncd/community/blob/33ca1d5254a405b1d479f2350443f6c7979a0b72/teps/0069-support-retries-for-custom-task-in-a-pipeline.md#proposal) instructs developers to **set `Timeout` for all retry attempts**. While in the actual implementation, it is **set for each retry attempt**. See the [ref](https://github.com/tektoncd/pipeline/issues/5582). - For TaskRun created out for a PipelineTask, the `Timeout` is **set for each retry attempt**. - For Standalone TaskRun, there's no `Retries` implemented. - Both `PipelineRun` and `TaskRun`|`CustomRun` reconcilers are partially responsible for implementing the `Retries` as of today. See https://github.com/tektoncd/pipeline/issues/5248. ### Goals 1. `Timeout` must be set for **each retry attempt** in the four runtime objects (independent `TaskRun`, `TaskRun` part of a Pipeline, independent `CustomRun`, `CustomRun` part of a `Pipeline`) that support `Retries` including no `Timeout` (`Timeout` set to 0). 2. `TaskRun` reconciler which is part of the Tekton Pipeline Controller implements `Retries` for two runtime objects (independent `TaskRun` and `TaskRun` part of a `Pipeline`). ### Non-Goals 1. Define `Retries` behavior for PipelineRuns. 2. The collective timeout for `tasks`, collective timeout for `finally` tasks, and the `timeout` at the `pipeline` level does not change. ### Use Cases #### Retry when Timeout **The current behavior**, say we have a `Pipeline`: ```yaml spec: tasks: - name: task-run-example taskRef: name: task-run-example retries: 1 timeout: "10s" - name: custom-run-example taskRef: apiVersion: example.dev/v1alpha1 kind: Example retries: 1 timeout: "10s" ``` `TaskRun` `task-run-example` and `CustomRun` `custom-run-example` created out of the Pipeline behave differently: - `task-run-example` will be **retried** once after 10s. - `custom-run-example` will be **failed on timeout** after 10s, if Custom Task authors follow the documentation. But if Custom Task authors implement `Retries` **for *each* attempt** (different from what's documented, **retry for all attempts**), then the `custom-run-example` would be retried once after 10s, working similarly to the `task-run-example`. #### Retry TaskRun Independently As a standalone runtime object, TaskRuns can be used independently (outside of a PipelineRun) in production environment, here are several use cases: - https://github.com/tektoncd/catalog/tree/main/task/send-to-webhook-slack/0.1 which is used in [Tekton CI](https://github.com/tektoncd/plumbing/blob/5c0e8e0e7ac9ceadc14d9a4d8f6957de31b4fca2/tekton/resources/cd/notification-template.yaml) - https://github.com/tektoncd/catalog/tree/main/task/sendmail/0.1 - Tekton CD: [cleanup runs](https://github.com/tektoncd/plumbing/blob/b5c568cbc794bd4be10b0c09498bc7dcc3d7bb01/tekton/resources/cd/cleanup-template.yaml#L74). Transient errors are everywhere especially in the Cloud Environment, services can be down for a short period of time making the entire TaskRun fails. https://learn.microsoft.com/en-us/azure/architecture/best-practices/transient-faults#why-do-transient-faults-occur-in-the-cloud explains how common the transient errors are in the Cloud env. With retries supported, customers are able to write robust TaskRuns to support such use cases. ## Related Work In this section, we'd like to compare the general retry strategy in the CI/CD industry, particularly, **compare if they retry when timeout** (where there are deviation between CustomRun and TaskRun). So that we can decide if we'd like to specify retries for all retry attempts or for each individual retry in both `CustomRun` and `TaskRun`. Typically, a retry strategy includes: 1. When to retry 2. The amount of attempts 3. Actions to take after a failed attempt 4. Timeout of each attempt 5. Retry until a certain condition is met | | [Retry Action in GA](https://github.com/marketplace/actions/retry-action) | [GitLab Job](https://docs.gitlab.com/ee/ci/yaml/#retry) | [Ansible Task](https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html#retrying-a-task-until-a-condition-is-met)| [Concourse Step](https://concourse-ci.org/attempts-step.html#attempts-step) | |:---|:---|:---|:---|:---| | **When to Retry** | on failure |configurable|[always retry, conditional stop](https://github.com/ansible/ansible/pull/76101) [^ansible-conditional-stop]|configurable| | **Attempts amount** |supported|supported|supported|supported| | **Timeout for each attempt** |supported|[supported](https://docs.gitlab.com/ee/ci/yaml/#retrywhen)|supported|supported| | **Timeout for all attempts** |[supported](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes)|-|-|-| Several observations regarding to the feature table above: - We can configure **timeout duration per attempt** in all CI systems that support the `retry` functionality. - GitHub Action doesn't support retry natively, but because the flexibility of **customized actions**, some users write their own `retry` action to make it work, and those customized actions even support what to do before retrying a failed attempt. - Concourse mentioned the reason it [retries per attempt is somewhat arbitrary](https://concourse-ci.org/attempts-step.html#attempts-step). ## Design Details ### Timeout per Retry Task-level Timeout (`TaskRunSpec.Timeout` and `RunSpec.Timeout`) is set for each `Retry` attempt. The same strategy applies to the `timeout` specified as part of the `pipelineTask` in a `pipeline`. ### Retries in TaskRuns and CustomRuns Add a new `Retries` field to [TaskRunSpec](https://pkg.go.dev/github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1#TaskRunSpec). Model the `CustomRunSpec` based on the `RunSpec` to use the existing `retries` field. The `PipelineTask.Retries` value, which is specified at `Pipeline` authoring time, is passed to the `TaskRunSpec.Retries` and `CustomRunSpec.Retries` during the execution of a `PipelineRun`. The `TaskRun` and `CustomRun` controllers handle their own `Retries`. The `PipelineRun` controller does not check for `len(retriesStatus)` to determine whether a `TaskRun` or `CustomRun` is done executing. Instead, uses `ConditionSucceeded` as the only way to decide if the `TaskRun` or `CustomRun` has completed execution. Before this change, the `pipelineRun` controller created a `taskRun` for any `pipelineTask` and scheduled the same `pipelineTask` if it had failed but not exhausted all the `retries`. The reason for implementing it this way was the `taskRun` reconciler marked that particular `taskRun` as `failed`. Now with this change, the `pipelineRun` controller still schedules a `pipelineTask` and creates a `taskRun` but `taskRun` reconciler will not mark a `taskRun` as failure until all the `retries` are exhausted. This way, `pipelineRun` no longer need to check for any additional clause other than `ConditionSucceeded` set to `failed`. ### Conditions.Succeeded The `TaskRun` and `CustomRun` controllers MUST set `Conditions.Succeeded` to `False` only upon eventual failure of the `TaskRun` or `CustomRun` when all the Retries have been exhausted. We will implement this behavior for `TaskRuns` and clearly document this requirement for `CustomRuns`. This is a change to meaning of `Conditions.Succeeded` for `TaskRuns` so this change is a blocker for V1 **software** release. ### RetriesStatus Keep `RetriesStatus` for both `CustomRun` and `TaskRun` to hold information about `intermediate` retries. So that users are able to know the current status of the runtime objects -- how many retries were executed until now, the result and logs of each retry. Note that this field is optional. Custom Task implementers have the freedom to implement the `Retries` as what they want. ## Future Work ## Alternatives ### 1. Implement `retries` in PipelineRun No matter how we implement the retry functionality, we propose to set `Timeout` for each retry attempt. This is proposed based on the existing behavior and the investigation about other CI/CD systems, see [related work](#related-work). - Make `retries` a `PipelineRun` concern - Remove `retries` from `CustomRun` spec - Move logic for `retries` to PipelineRun reconciler and create new `TaskRun`s and `Runs` at each attempt. - Remove `retriesStatus` from TaskRun & CustomRun **Benefits:** - Consistent interface for `retries` - Custom task controller developers get a default implementation of retries for free (by embedding in a pipeline) - "Pipelines in pipeline" can be retried the same as the other resources - Improve the retries of TaskRuns created from PipelineTasks by using separate TaskRuns for each retry - No changes to the PipelineRun API (not in the spec at least) - No changes to the TaskRun API (not in the spec at least) **Concerns:** - API Change for `Run` and `CustomRun` (need to remove `retries` & `retriesStatus`) - We are moving Custom Task Run from alpha (Run) to beta (CustomRun) (see [TEP-0114](https://github.com/tektoncd/community/blob/main/teps/0114-custom-tasks-beta.md)), which is a great timing for us to remove fields from `Run`. - Dashboard and CLI may need extra works if we remove `retriesStatus` - Standalone `TaskRun` can't retry on its own. - It's not quite user-friendly if a CustomRun controller implements its own retry strategy, for example: ```yaml apiVersion: tekton.dev/v1beta1 kind: PipelineRun metadata: generateName: pr-custom-task- spec: pipelineSpec: tasks: - name: wait timeout: "1s" retries: 1 // The common retries field in the PipelineTask taskSpec: specialized-retries: 5 // Specialized retries field in Custom Task Spec. other-spec-fields: foobar ``` The custom task users would be confused about which retries field to use in order to retry a Run. ### 2. Implement `retries` in TaskRun/Run, use `retryAttempts` instead of `retriesStatus` #### Two API Changes 1. New `Retries` field in`TaskRunSpec` ```golang type TaskRunSpec struct { // Retries represents how many times this task should be retried in case of task failure: ConditionSucceeded set to False // +optional Retries string } ``` 2. New `RetryAttempts` field in `TaskRunStatus` ```golang type TaskRunStatusFields struct { // RetryAttempts record the names of TaskRuns which are created for retry // +optional RetryAttempts []string } ``` #### Two New Labels Label `tekton.dev/retry-count: <retry number>` is attached to every TaskRun. For a TaskRun that's not a retry, the `retry number` will be set as `0`. We'll use this this label to decide the value of [`context.task.retry-count`](https://github.com/tektoncd/pipeline/blob/main/docs/variables.md) (instead of using [`len(tr.Status.RetriesStatus)`](https://github.com/tektoncd/pipeline/blob/07bf4702e6d6b35bdff40ed760cf3280b74c4375/pkg/reconciler/taskrun/resources/apply.go#L168) in the current implementation) Label `tekton.dev/retry-parent: <parent taskrun name>` is attached to each retry TaskRun. #### How the `Retries` Works Say we submit the following TaskRun: ```yaml apiVersion: tekton.dev/v1beta1 kind: TaskRun metadata: name: tr labels: tekton.dev/retry-count: 0 spec timeout: 1s retries: 1 ... status: conditions: - status: True reason: Unknown retryAttempts: ``` 1 second elapsed, TaskRun reconciler needs to retry the TaskRun `tr`: - Create a new TaskRun `tr-attempt-1` - Attach the following labels to the new TaskRun - `tekton.dev/retry-count: 1` - `tekton.dev/retry-parent: tr` - Add the new TaskRun name to `status.retryAttempts` of its parent TaskRun. - Update the Reason of the Condition as `Retrying`, keep Status as True. Now we have two TaskRuns: ```yaml apiVersion: tekton.dev/v1beta1 kind: TaskRun metadata: name: tr labels: tekton.dev/retry-count: 0 spec timeout: 1s retries: 1 ... status: conditions: - status: True reason: Retrying retryAttempts: - tr-attempt-1 --- apiVersion: tekton.dev/v1beta1 kind: TaskRun metadata: name: tr-attempt-1 labels: tekton.dev/retry-count: 1 tekton.dev/retry-parent: tr spec timeout: 1s retries: 1 ... status: conditions: - status: True reason: Unknown retryAttempts: ``` 1 second elapsed again, `tr-attempt-1` is timeout. In the reconciliation loop of `tr-attempt-1`, the reconciler checks that the value of `tekton.dev/retry-count` is equivalent to `Spec.Retries`, it updates the Condition of `tr-attempt-1` as `Status=False, Reason=TimedOut`. Then in the reconciliation loop of `tr`, the reconciler checks that the last attempt in `retryAttempts` is `tr-attempt-1` and it has already failed on TimedOut, it updates the condition of `tr` as `Status=False, Reason=TimedOut`. ```yaml apiVersion: tekton.dev/v1beta1 kind: TaskRun metadata: name: tr labels: tekton.dev/retry-count: 0 spec timeout: 1s retries: 1 ... status: conditions: - status: False reason: TimedOut retryAttempts: - tr-attempt-1 --- apiVersion: tekton.dev/v1beta1 kind: TaskRun metadata: name: tr-attempt-1 labels: tekton.dev/retry-count: 1 tekton.dev/retry-parent: tr spec timeout: 1s retries: 1 ... status: conditions: - status: False reason: TimedOut retryAttempts: ``` The relationship of the original TaskRun and TaskRuns created for retry is: ``` originalTaskRun / \ taskRun-attempt-1 ... taskRun-attempt-n ``` ### 3. `Conditions.RetrySucceeded` For TaskRuns, introduce a new ConditionType `Conditions.RetrySucceeded` to report intermediate status and sending events for failed attempts (instead of using `RetriesStatus` to keep everything managed in one `TaskRun` object): `status`| Description :-------|:---------- True | Retry succeeded False | Retry failed Unknown | Running a retry attempt In this way, we are able to easily show the status as following ```shell > tkn tr list NAME STARTED DURATION STATUS RETRYSTATUS (new status field) tr-587rp 30 minutes ago 5s Failed RetryFailed tr-xyzcs 1 minutes ago --- Running Retrying tr-ffbjg 4 seconds ago --- Running RetryFailed ``` Implementors of Custom Tasks can choose to implement this approach. Note that though we are able to utilize the `retriesStatus` to achieve the same goal, but using `ConditionType` is more appropriate to report status. ## References - [TEP-0002: Custom Tasks](https://github.com/tektoncd/community/blob/main/teps/0002-custom-tasks.md) - [TEP-0069: Custom Tasks Retries](https://github.com/tektoncd/community/blob/main/teps/0069-support-retries-for-custom-task-in-a-pipeline.md) - [TEP-0100: Slim down PipelineRunStatus](https://github.com/tektoncd/community/blob/main/teps/0100-embedded-taskruns-and-runs-status-in-pipelineruns.md) - [Issue #5248: Decouple Retries implementation between TaskRun reconciler and PipelineRun reconciler](https://github.com/tektoncd/pipeline/issues/5248) - [PR #5393: Clarify the behavior of CustomRun retries](https://github.com/tektoncd/pipeline/pull/5393) [^ansible-conditional-stop]: https://github.com/ansible/ansible/pull/76101 [^retry-strategy]: https://docs.microsoft.com/en-us/azure/architecture/best-practices/transient-faults#challenges [^transient-errors]: https://learn.microsoft.com/en-us/azure/architecture/best-practices/transient-faults#why-do-transient-faults-occur-in-the-cloud

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.