2024/10/24 12:32:26

test interface

Added my thoughts - https://docs.google.com/document/d/1ho9zygkMesay6Ymenpuhlhq2Up3qhx7N4P3H1g8mhcU/edit

2024/11/18 18:55:28

thanks Milind, i think there is some possibility of using the cloud-provider-kind to do some of this testing as well.

2024/11/26 07:24:38

created - https://github.com/kubernetes/kubernetes/compare/master...miyadav:testingccminterfacev2?expand=1 Explored other way like - https://github.com/miyadav/ccm-conformance-tests (Edited)

KEP-NNNN: Modular Cloud Controller Manager Testing

Release Signoff Checklist

Items marked with ® are required prior to targeting to a milestone / release.

Summary

To support the migration of in-tree cloud controllers to out-of-tree, Kubernetes
should have a common suite of tests that can be exercised with the external cloud
controller managers to ensure the expected the behavior and aid in finding
regressions. Testing cloud controllers requires that infrastructure-specific
operations be performed during the tests to confirm proper execution. To ensure that
testing of external cloud controller managers meets the expectations of the Kubernetes
community, a pattern for allowing dynamic injection of infrastructure-specific
functionality should be created. This enhancement describes an architecture for building
a modular cloud controller manager test interface and workflow for use by cloud
provider implementors in Kubernetes.

Motivation

in-tree stuff is being removed, this means from the tests also
previously only tested on aws, gce, gke, want to test on more cloud providers
testing is limited in functionality due to necessary constraints
want to establish a workflow that can include more ccms

Goals

define an interface for infrastructure-specific action in the tests
describe a workflow for integrating tests in k/k with a cloud-provider repo as part of continuous integration
copy current tests in tests/e2e/cloud, and some in tests/e2e/network (TBD) to a new cloud provider package in tests/e2e/cloud/external, utilizing the new interface for infrastructure-specific action

Non-Goals

removal of current tests
implementation of interface for specific cloud providers
add new tests?

Proposal

User Stories (Optional)

Story 1

Story 2

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

elmiko notes:
things we need to do

enumerate all the current tests that live k/k test/e2e/cloud

Nodes should be deleted on API server if it doesn't exist in the cloud provider
Master upgrade should maintain a functioning cluster
Cluster upgrade should maintain a functioning cluster
Downgrade should maintain a functioning cluster
Reboot each node by ordering clean reboot and ensure they function upon restart
Ensuring the same number of pods are running and ready after restart

enumerate possible service tests that live in k/k test/e2e/network

LoadBalancers should be able to change the type and ports of TCP/UDP service
When a Service is changed from ClusterIP or NodePort to LoadBalancer, CCM provisions a cloud LoadBalancer.
If ports change, CCM updates the LoadBalancer’s forwarding rules.
If service type changes back to ClusterIP or NodePort, CCM ensures the LoadBalancer is deleted.
LoadBalancers should have session affinity work for LoadBalancer service with Local traffic policy
It interacts with cloud APIs to enable session affinity (e.g., AWS ELB's stickiness policy or GCP's SessionAffinity).
LoadBalancers should handle load balancer cleanup finalizer for service
CCM is responsible for ensuring that before the service is deleted, the cloud LoadBalancer and associated resources (IP, firewall rules, etc.) are properly cleaned up.
LoadBalancers should be able to create LoadBalancer Service without NodePort and change it
<<Need to figure out>></Need>
LoadBalancers should not have connectivity disruption during rolling update with externalTrafficPolicy=Cluster
It updates LoadBalancer backend configurations dynamically as Pods are replaced.
LoadBalancers should not have connectivity disruption during rolling update with externalTrafficPolicy=Local
During rolling updates, CCM updates the backend nodes dynamically to prevent traffic blackholes.
LoadBalancers ExternalTrafficPolicy: Local should work for type=LoadBalancer
Can be covered in earlier cases
LoadBalancers ExternalTrafficPolicy: Local should target all nodes with endpoints
To test it maintains backend target pool
LoadBalancers ExternalTrafficPolicy: Local should work from pods
To confirm CCM internal pod-to-pod traffic follows the correct LoadBalancer rules.
Services should be possible to connect to a service via ExternalIP when the external IP is not assigned to a node
If this IP isn’t assigned to any node, CCM ensures traffic is forwarded to the correct backend nodes.
Services should be able to update service type to NodePort listening on same port number but different protocol
For example, switching from TCP to UDP on the same port may require CCM to reconfigure the LoadBalancer.
Services should be able to change the type from ExternalName to ClusterIP
<Not sure about this></Not>
Services should be able to change the type from ExternalName to NodePort
<Not sure about this></Not>
Services should be able to change the type from ClusterIP to ExternalName
<Not sure about this></Not>
Services should be able to change the type from NodePort to ExternalName
<Not sure about this></Not>
Services should support externalTrafficPolicy=Local for type=NodePort
It updates the LoadBalancer to target only nodes with Pods.
Services should fallback to terminating endpoints when there are no ready endpoints with internalTrafficPolicy=Cluster
This prevents downtime during rolling updates.
Services should fallback to local terminating endpoints when there are no ready endpoints with internalTrafficPolicy=Local
CCM ensures the LoadBalancer only sends traffic to terminating endpoints if no other healthy Pods exist.
*Services should release NodePorts on delete
NodePorts (if allocated) are also cleaned up as part of this process.
*Service should delete collection of services
If multiple services are deleted together, CCM ensures that all corresponding LoadBalancer resources (external IPs, forwarding rules, firewall rules, etc.) are cleaned up.

distill an API for the interface for the tests from the identified tests

Test Plan

[ ] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

Prerequisite testing updates

Unit tests

<package>: <date> - <test coverage>

Integration tests

<test>: <link to test coverage>

e2e tests

<test>: <link to test coverage>

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name:
- Components depending on the feature gate:
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
  plane?
- Will enabling / disabling the feature require downtime or reprovisioning
  of a node?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

KEP-NNNN: Modular Cloud Controller Manager Testing

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories (Optional)

Story 1

Story 2

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)