Try   HackMD

KEP-NNNN: Modular Cloud Controller Manager Testing

Release Signoff Checklist

Items marked with ® are required prior to targeting to a milestone / release.

  • ® Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • ® KEP approvers have approved the KEP status as implementable
  • ® Design details are appropriately documented
  • ® Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • ® Ensure GA e2e tests meet requirements for Conformance Tests
    • ® Minimum Two Week Window for GA e2e tests to prove flake free
  • ® Graduation criteria is in place
  • ® Production readiness review completed
  • ® Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

To support the migration of in-tree cloud controllers to out-of-tree, Kubernetes
should have a common suite of tests that can be exercised with the external cloud
controller managers to ensure the expected the behavior and aid in finding
regressions. Testing cloud controllers requires that infrastructure-specific
operations be performed during the tests to confirm proper execution. To ensure that
testing of external cloud controller managers meets the expectations of the Kubernetes
community, a pattern for allowing dynamic injection of infrastructure-specific
functionality should be created. This enhancement describes an architecture for building
a modular cloud controller manager test interface and workflow for use by cloud
provider implementors in Kubernetes.

Motivation

  • in-tree stuff is being removed, this means from the tests also
  • previously only tested on aws, gce, gke, want to test on more cloud providers
  • testing is limited in functionality due to necessary constraints
  • want to establish a workflow that can include more ccms

Goals

  • define an interface for infrastructure-specific action in the tests
  • describe a workflow for integrating tests in k/k with a cloud-provider repo as part of continuous integration
  • copy current tests in tests/e2e/cloud, and some in tests/e2e/network (TBD) to a new cloud provider package in tests/e2e/cloud/external, utilizing the new interface for infrastructure-specific action

Non-Goals

  • removal of current tests
  • implementation of interface for specific cloud providers
  • add new tests?

Proposal

User Stories (Optional)

Story 1

Story 2

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

elmiko notes:
things we need to do

  1. enumerate all the current tests that live k/k test/e2e/cloud
  • Nodes should be deleted on API server if it doesn't exist in the cloud provider
  • Master upgrade should maintain a functioning cluster
  • Cluster upgrade should maintain a functioning cluster
  • Downgrade should maintain a functioning cluster
  • Reboot each node by ordering clean reboot and ensure they function upon restart
  • Ensuring the same number of pods are running and ready after restart
  1. enumerate possible service tests that live in k/k test/e2e/network
  • LoadBalancers should be able to change the type and ports of TCP/UDP service
    When a Service is changed from ClusterIP or NodePort to LoadBalancer, CCM provisions a cloud LoadBalancer.
    If ports change, CCM updates the LoadBalancer’s forwarding rules.
    If service type changes back to ClusterIP or NodePort, CCM ensures the LoadBalancer is deleted.
  • LoadBalancers should have session affinity work for LoadBalancer service with Local traffic policy
    It interacts with cloud APIs to enable session affinity (e.g., AWS ELB's stickiness policy or GCP's SessionAffinity).
  • LoadBalancers should handle load balancer cleanup finalizer for service
    CCM is responsible for ensuring that before the service is deleted, the cloud LoadBalancer and associated resources (IP, firewall rules, etc.) are properly cleaned up.
  • LoadBalancers should be able to create LoadBalancer Service without NodePort and change it
    <<Need to figure out>></Need>
  • LoadBalancers should not have connectivity disruption during rolling update with externalTrafficPolicy=Cluster
    It updates LoadBalancer backend configurations dynamically as Pods are replaced.
  • LoadBalancers should not have connectivity disruption during rolling update with externalTrafficPolicy=Local
    During rolling updates, CCM updates the backend nodes dynamically to prevent traffic blackholes.
  • LoadBalancers ExternalTrafficPolicy: Local should work for type=LoadBalancer
    Can be covered in earlier cases
  • LoadBalancers ExternalTrafficPolicy: Local should target all nodes with endpoints
    To test it maintains backend target pool
  • LoadBalancers ExternalTrafficPolicy: Local should work from pods
    To confirm CCM internal pod-to-pod traffic follows the correct LoadBalancer rules.
  • Services should be possible to connect to a service via ExternalIP when the external IP is not assigned to a node
    If this IP isn’t assigned to any node, CCM ensures traffic is forwarded to the correct backend nodes.
  • Services should be able to update service type to NodePort listening on same port number but different protocol
    For example, switching from TCP to UDP on the same port may require CCM to reconfigure the LoadBalancer.
  • Services should be able to change the type from ExternalName to ClusterIP
    <Not sure about this></Not>
  • Services should be able to change the type from ExternalName to NodePort
    <Not sure about this></Not>
  • Services should be able to change the type from ClusterIP to ExternalName
    <Not sure about this></Not>
  • Services should be able to change the type from NodePort to ExternalName
    <Not sure about this></Not>
  • Services should support externalTrafficPolicy=Local for type=NodePort
    It updates the LoadBalancer to target only nodes with Pods.
  • Services should fallback to terminating endpoints when there are no ready endpoints with internalTrafficPolicy=Cluster
    This prevents downtime during rolling updates.
  • Services should fallback to local terminating endpoints when there are no ready endpoints with internalTrafficPolicy=Local
    CCM ensures the LoadBalancer only sends traffic to terminating endpoints if no other healthy Pods exist.
    *Services should release NodePorts on delete
    NodePorts (if allocated) are also cleaned up as part of this process.
    *Service should delete collection of services
    If multiple services are deleted together, CCM ensures that all corresponding LoadBalancer resources (external IPs, forwarding rules, firewall rules, etc.) are cleaned up.
  1. distill an API for the interface for the tests from the identified tests

Test Plan

[ ] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

Prerequisite testing updates
Unit tests
  • <package>: <date> - <test coverage>
Integration tests
  • <test>: <link to test coverage>
e2e tests
  • <test>: <link to test coverage>

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name:
    • Components depending on the feature gate:
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control
      plane?
    • Will enabling / disabling the feature require downtime or reprovisioning
      of a node?
Does enabling the feature change any default behavior?
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
What happens if we reenable the feature if it was previously rolled back?
Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?
Will enabling / using this feature result in introducing new API types?
Will enabling / using this feature result in any new calls to the cloud provider?
Will enabling / using this feature result in increasing size or count of the existing API objects?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ) in any components?
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)