This document discusses the different performance tests and results.
Previous performance benchmarks: https://hackmd.io/@akashsinghal/rkEZqxxW5
We measured the time it took for Ratify to process a single external data request. This ED request had multiple unique subject images for each container. Each unique image had variable number of signatures attached.
Image and Signature creation/push as well as deployment yaml generation were done using tools in this repository: https://github.com/anlandu/ratify-perf
Test Parameters:
For each test deployment, we restarted the ratify pod to flush all cached signatures and auth credentials.
Applied the deployment yaml to the cluster directly
100 images * 2 signatures = 200 artifacts verified
Audit time per pod = ~500ms
The initial ED request fails due to 429's returned from the registry. The Deployment attempts retries to create the pod which eventually succeeds after a few attempts. Ratify logs show fewer 429's until eventually none. This is most likely due to the eventual caching of all signatures bringing the request rate down.
After adding back off retry
Initial request timing durations: 3297ms, 3155ms, 2958ms = ~3.1 seconds
Audit time per pod = 614ms, 669ms, 644ms, 552ms, 682ms = ~0.6 seconds
100 images * 5 signatures = 500 artifacts verified
Audit time per pod > 1200ms (not really accurate)
Ratify never returns true. The logs follow a similar pattern to above test case however, 429's never completely dissipate.
After adding back off retry
Initial request timing durations: 6137ms, 6289ms, 6159ms = ~6.2 seconds
Audit time per pod = 1846ms, 1914ms, 1996ms, 1879ms, 1997ms = ~1.9 seconds
The first ED request triggers the most number of calls to registry since credentials and signatures are not cached. Each subject is handled concurrently resulting in multiple simulataneous request to the the same registry host. ACR performs rate limiting per client IP and per registry host both of which are likely limits we are hitting in these test cases. However, as some subjects complete validation, subsequent retries result in fewer 429's.
Why does 100img/2sigs eventually succeed and 100img/5sigs fail?
Each ED request triggers at least one call to registry to fetch the referrers for that image. One theory is that the larger number of artifacts plus the 100 referrer api calls each audit always exceeds the ACR rate limit threshold thus stalling any caching efforts (This is not validated and purely a guess at this point).
10k pods
1 image * 150 signatures = 150 artifacts verified
Audit time per pod = ~400ms
10k pods * 400ms = ~ 1hr 10min per audit cycle (Gatekeeper applies constraint serially per pod)
cc: @AnlanDu for more information
The main bottlneck for Ratify is handling multiple request concurrently to the same registry. We start seeing 429's if many simulataneous requests are made to the same registry from the same client.
We tested various large scale pod scenarios:
BOLD = incoming/outgoing http request operation(s) to ACR