KEP-20200921: Subsecond Probe Timeouts

# KEP-20200921: Subsecond Probe Timeouts Probe timeouts are limited to seconds and that does NOT work well for clients looking for finer and coarser grained timeouts.   - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) - [Story 2](#story-2) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [Test Plan](#test-plan) - [Graduation Criteria](#graduation-criteria) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - [Feature Enablement and Rollback](#feature-enablement-and-rollback) - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) - [Monitoring Requirements](#monitoring-requirements) - [Dependencies](#dependencies) - [Scalability](#scalability) - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional)  ## Release Signoff Checklist  Items marked with (R) are required *prior to targeting to a milestone / release*. - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) - [ ] (R) KEP approvers have approved the KEP status as `implementable` - [ ] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input - [ ] (R) Graduation criteria is in place - [ ] (R) Production readiness review completed - [ ] Production readiness review approved - [ ] "Implementation History" section is up-to-date for milestone - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes  [kubernetes.io]: https://kubernetes.io/ [kubernetes/enhancements]: https://git.k8s.io/enhancements [kubernetes/kubernetes]: https://git.k8s.io/kubernetes [kubernetes/website]: https://git.k8s.io/website ## Summary  The Probe struct contains int32 fields that specify seconds for timeouts. Some users would like to have timeouts less than one second. ## Motivation  #### Istio TODO(???): explain this scenario, presumably around sidecar readiness? #### Knative Knative will create Pods (via Deployment) and wait for them to become `Ready` when handling an HTTP request from an end-user. In this case, Pod readiness latency has a direct impact on HTTP request latency. (In the steady state, Knative will re-use an existing Pod, but this situation can happen on scale-up, and is guaranteed to happen on scale-from-zero.) ### Goals  An ability to specify timeouts that are less than one second. Add additional tests cases to the timeout test cases. ? link to existing test cases ? ### Non-Goals  V2 API for existing objects. Converting fields from int32 to resource.Quantity. Subsecond resolution less than one millisecond. ## Proposal - Plan-A  Add a new int32 field to existing Probe struct for timeoutSeconds that exists of timeoutMilliseconds. int32 data type is used for consistency with existing fields. Using the fully available range of an int32 with a maximum value of 2147483647 represents a value of ~24.8 days. If the Milliseconds variant of a field is set, *use it in preference* to the existing Seconds field, and **completely ignore** the value of the existing field. This behavior makes it opt-in on setting a non-zero Milliseconds field. ## Proposal - Plan-B Add a new field: ReadSecondsAs string (“seconds” || “milliseconds”) where default "" translates to "seconds" If using “milliseconds” the minimums would need to be no less than 100millseconds and defaults will be set to existing values in seconds. ### User Stories (Optional)  #### Knative ### Notes/Constraints/Caveats (Optional)  ### Risks and Mitigations  Changing defaults is a strict no-go. Accidentally setting a timeout too low could DOS kubelet if many are used. Mitigate by preventing timeout values too small. Could be configurable, 10-100milliseconds is a first guess. UX reviewed by existing users of Probe struct. ? add details on who that is? #### Overriding an existing field How does the change to overriding a field effect the users of the existing field. ## Design Details Potentially proposed changes as implemented code: https://github.com/kubernetes/kubernetes/compare/master...MHBauer:probe-timeouts?expand=1  ### Existing Struct https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go https://github.com/kubernetes/kubernetes/blob/9983a521149a0de02a052658a9d3665ff7b27708/pkg/apis/core/types.go#L2010-L2029 ``` // Probe describes a health check to be performed against a container to determine whether it is // alive or ready to receive traffic. type Probe struct { // The action taken to determine the health of a container Handler // Length of time before health checking is activated. In seconds. // +optional InitialDelaySeconds int32 // Length of time before health checking times out. In seconds. // +optional TimeoutSeconds int32 // How often (in seconds) to perform the probe. // +optional PeriodSeconds int32 // Minimum consecutive successes for the probe to be considered successful after having failed. // Must be 1 for liveness and startup. // +optional SuccessThreshold int32 // Minimum consecutive failures for the probe to be considered failed after having succeeded. // +optional FailureThreshold int32 } ``` ### Existing Behavior Taking the Seconds fields, in order of the struct. InitialDelaySeconds TimeoutSeconds PeriodSeconds #### Defaulting logic - InitialDelaySeconds, has no defaulting logic. Therefore it would get the golang defaulting logic, and default to 0. There is no particular reason to have an initial delay that is millisecond based. Timing is not a good way to do sequencing. 1 second, 2 seconds, etc. - TimeoutSeconds, defaults to 1 seconds if unset (which includes the explicitly set to 0 state). This is the last ending time before the timeout fails. Slow Failures. https://github.com/kubernetes/kubernetes/blob/e19964183377d0ec2052d1f1fa930c4d7575bd50/pkg/apis/core/v1/defaults.go#L224-L226 ``` if obj.TimeoutSeconds == 0 { obj.TimeoutSeconds = 1 } ``` - Period Seconds is the biggest hurdle, but also the most useful. PeriodSeconds defaults to 10 seconds if unset (which includes the explicitly set to 0 state). Fast Successes. https://github.com/kubernetes/kubernetes/blob/e19964183377d0ec2052d1f1fa930c4d7575bd50/pkg/apis/core/v1/defaults.go#L227-L229 ``` if obj.PeriodSeconds == 0 { obj.PeriodSeconds = 10 } ``` #### Validation of fields Must be non-negative, Zero or greater. ``` allErrs = append(allErrs, ValidateNonnegativeField(int64(probe.InitialDelaySeconds), fldPath.Child("initialDelaySeconds"))...) allErrs = append(allErrs, ValidateNonnegativeField(int64(probe.TimeoutSeconds), fldPath.Child("timeoutSeconds"))...) allErrs = append(allErrs, ValidateNonnegativeField(int64(probe.PeriodSeconds), fldPath.Child("periodSeconds"))...) ``` #### Fields to Add What fields may be necessary to add? ``` // Length of time before health checking is activated. In milliseconds. // +optional InitialDelayMilliseconds int32 // Length of time before health checking times out. In milliseconds. // +optional TimeoutMilliseconds int32 // How often (in milliseconds) to perform the probe. // +optional PeriodMilliseconds int32 ``` #### Logic for Added Fields What is the least logic that could be used? Setting a value for PeriodMilliseconds overrides, completely, the value value of PeriodSeconds. Setting a value for TimeoutMilliseconds overrides, completely, the value value of TimeoutSeconds. Setting a value for InitialDelayMilliseconds overrides, completely, the value value of InitialDelaySeconds. In isolation, each of the probes can independently be either on the scale of seconds or milliseconds. If a Milliseconds field is set, the Seconds field is completely ignored. ### Existing use of Probe struct fields. #### InitialDelaySeconds https://github.com/kubernetes/kubernetes/blob/e19964183377d0ec2052d1f1fa930c4d7575bd50/pkg/kubelet/prober/worker.go#L225-L228 ``` // Probe disabled for InitialDelaySeconds. if int32(time.Since(c.State.Running.StartedAt.Time).Seconds()) < w.spec.InitialDelaySeconds { return true } ``` #### TimeoutSeconds https://github.com/kubernetes/kubernetes/blob/e19964183377d0ec2052d1f1fa930c4d7575bd50/pkg/kubelet/prober/prober.go#L156-L201 ``` timeout := time.Duration(p.TimeoutSeconds) * time.Second ``` #### ProbeSeconds https://github.com/kubernetes/kubernetes/blob/e19964183377d0ec2052d1f1fa930c4d7575bd50/pkg/kubelet/prober/worker.go#L127-L160 ``` // run periodically probes the container. func (w *worker) run() { probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second // XX // If kubelet restarted the probes could be started in rapid succession. // Let the worker wait for a random portion of tickerPeriod before probing. time.Sleep(time.Duration(rand.Float64() * float64(probeTickerPeriod))) probeTicker := time.NewTicker(probeTickerPeriod) defer func() { // Clean up. probeTicker.Stop() if !w.containerID.IsEmpty() { w.resultsManager.Remove(w.containerID) } w.probeManager.removeWorker(w.pod.UID, w.container.Name, w.probeType) ProberResults.Delete(w.proberResultsSuccessfulMetricLabels) ProberResults.Delete(w.proberResultsFailedMetricLabels) ProberResults.Delete(w.proberResultsUnknownMetricLabels) }() probeLoop: for w.doProbe() { // Wait for next probe tick. select { case <-w.stopCh: break probeLoop case <-probeTicker.C: // continue } } } ``` ### Summary Depending on the importance of the various Probe settings, it may be best to focus on one field. The Probe.Period looks to be the most effective to focus on. Probe.Period describes the 'repeat-rate' for how often a probe will run. Where Probe.Timeout describes an endpoint for when to stop probing. Probe.InitialDelay describes how long to wait before starting, but can be set to zero. ### Test Plan  Existing unit tests of prober `k8s.io/kubernetes/pkg/kubelet/prober/prober_manager_test.go`. Existing node-e2e test `/home/mhb/go/src/k8s.io/kubernetes/test/e2e/common/container_probe.go` Enhanced with additional test cases. ### Graduation Criteria  ### Upgrade / Downgrade Strategy  ### Version Skew Strategy  ## Production Readiness Review Questionnaire  ### Feature Enablement and Rollback _This section must be completed when targeting alpha to a release._ * **How can this feature be enabled / disabled in a live cluster?** - [ ] Feature gate (also fill in values in `kep.yaml`) - Feature gate name: - Components depending on the feature gate: - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control plane? - Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). * **Does enabling the feature change any default behavior?** Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. * **Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?** Also set `disable-supported` to `true` or `false` in `kep.yaml`. Describe the consequences on existing workloads (e.g., if this is a runtime feature, can it break the existing applications?). * **What happens if we reenable the feature if it was previously rolled back?** * **Are there any tests for feature enablement/disablement?** The e2e framework does not currently support enabling or disabling feature gates. However, unit tests in each component dealing with managing data, created with and without the feature, are necessary. At the very least, think about conversion tests if API types are being modified. ### Rollout, Upgrade and Rollback Planning _This section must be completed when targeting beta graduation to a release._ * **How can a rollout fail? Can it impact already running workloads?** Try to be as paranoid as possible - e.g., what if some components will restart mid-rollout? * **What specific metrics should inform a rollback?** * **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can't do that now. * **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** Even if applying deprecation policies, they may still surprise some users. ### Monitoring Requirements _This section must be completed when targeting beta graduation to a release._ * **How can an operator determine if the feature is in use by workloads?** Ideally, this should be a metric. Operations against the Kubernetes API (e.g., checking if there are objects with field X set) may be a last resort. Avoid logs or events for this purpose. * **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?** - [ ] Metrics - Metric name: - [Optional] Aggregation method: - Components exposing the metric: - [ ] Other (treat as last resort) - Details: * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** At a high level, this usually will be in the form of "high percentile of SLI per day <= X". It's impossible to provide comprehensive guidance, but at the very high level (needs more precise definitions) those may be things like: - per-day percentage of API calls finishing with 5XX errors <= 1% - 99% percentile over day of absolute value from (job creation time minus expected job creation time) for cron job <= 10% - 99,9% of /health requests per day finish with 200 code * **Are there any missing metrics that would be useful to have to improve observability of this feature?** Describe the metrics themselves and the reasons why they weren't added (e.g., cost, implementation difficulties, etc.). ### Dependencies _This section must be completed when targeting beta graduation to a release._ * **Does this feature depend on any specific services running in the cluster?** Think about both cluster-level services (e.g. metrics-server) as well as node-level agents (e.g. specific version of CRI). Focus on external or optional services that are needed. For example, if this feature depends on a cloud provider API, or upon an external software-defined storage or network control plane. For each of these, fill in the following—thinking about running existing user workloads and creating new ones, as well as about cluster-level services (e.g. DNS): - [Dependency name] - Usage description: - Impact of its outage on the feature: - Impact of its degraded performance or high-error rates on the feature: ### Scalability _For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them._ _For beta, this section is required: reviewers must answer these questions._ _For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field._ * **Will enabling / using this feature result in any new API calls?** Describe them, providing: - API call type (e.g. PATCH pods) - estimated throughput - originating component(s) (e.g. Kubelet, Feature-X-controller) focusing mostly on: - components listing and/or watching resources they didn't before - API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y) - periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.) * **Will enabling / using this feature result in introducing new API types?** Describe them, providing: - API type - Supported number of objects per cluster - Supported number of objects per namespace (for namespace-scoped objects) * **Will enabling / using this feature result in any new calls to the cloud provider?** * **Will enabling / using this feature result in increasing size or count of the existing API objects?** Describe them, providing: - API type(s): - Estimated increase in size: (e.g., new annotation of size 32B) - Estimated amount of new objects: (e.g., new Object X for every existing Pod) * **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?** Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details. * **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** Things to keep in mind include: additional in-memory state, additional non-trivial computations, excessive access to disks (including increased log volume), significant amount of data sent and/or received over network, etc. This through this both in small and large cases, again with respect to the [supported limits]. ### Troubleshooting The Troubleshooting section currently serves the `Playbook` role. We may consider splitting it into a dedicated `Playbook` document (potentially with some monitoring details). For now, we leave it here. _This section must be completed when targeting beta graduation to a release._ * **How does this feature react if the API server and/or etcd is unavailable?** * **What are other known failure modes?** For each of them, fill in the following information by copying the below template: - [Failure mode brief description] - Detection: How can it be detected via metrics? Stated another way: how can an operator troubleshoot without logging into a master or worker node? - Mitigations: What can be done to stop the bleeding, especially for already running user workloads? - Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? Not required until feature graduated to beta. - Testing: Are there any tests for failure mode? If not, describe why. * **What steps should be taken if SLOs are not being met to determine the problem?** [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos ## Implementation History  ## Drawbacks  What do preexisting readers of the field do? They are not aware that they should ignore the seconds and use millis. To address this, we should set whole second probe value to the minimum allowed if milliseconds is set. ## Alternatives  ### Combine old and new fields Allow for 0 second Probe, with setting millisecond. Cons: Loosens validation. ### v2 api for probe. I think this means a v2 API for Container therefore Pod. This seems too invasive. All Seconds fields become resource.Quantity instead of int32. This supports subdivision in a single field. ### OffsetMilliseconds Use a negative offset, and combine with the existing field. Example: Existing Field: int32 PeriodSeconds New Field: int32 PeriodOffsetMilliseconds If I want to set to 0.5 seconds, 500 milliseconds, PeriodSeconds <= 10 & PeriodOffsetmilliseconds <= -9500 OR PeriodSeconds <= 1 & PeriodOffsetmilliseconds <= -500 OR PeriodSeconds unset, Default of 10, and PeriodOffsetmilliseconds <= -9500 Detail: Same number of added fields. Pros: Uses the existing field, thus readers of only the existing field will still be acting on something that has been logically set. Without the compensation of a negative offset, the output doesn't make much sense as a decision is being made on half of the information, but there is a logical process to why behavior would occur, rather than allowing something to be set before throwing it out. Cons: Complicated logic. Multiple ways to get same resulting time. Changing behavior in the future involves touching more code than other solutions. ### Reconcile seconds field to nearest whole second. Minimum of 1 second remains. Extra logic in defaulter to use the milliseconds field and automatically set the seconds field, allowing those using the seconds field to get something close-enough. This doesn't make much sense for solving rapidity of probes, only for increasing the granularity, such as if I wanted to run Pros: Cons: ## Infrastructure Needed (Optional)  Usual infrastructure depending on the complexity of the test cases needed.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.