Try   HackMD

Ratify Performance Limits

This document discusses the different performance tests and results.

Previous performance benchmarks: https://hackmd.io/@akashsinghal/rkEZqxxW5

Pod Level Analysis

We measured the time it took for Ratify to process a single external data request. This ED request had multiple unique subject images for each container. Each unique image had variable number of signatures attached.

Image and Signature creation/push as well as deployment yaml generation were done using tools in this repository: https://github.com/anlandu/ratify-perf

Test Parameters:

  • Workload Identity enabled AKS cluster
  • Private ACR Premium SKU with MI given ACRPull role
  • Ratify deployed with GateKeeper 3.9+
  • Notation v2 verifier configured
  • Ratify Workload Identity Auth Provider
  • Constraint on pod resource creation

For each test deployment, we restarted the ratify pod to flush all cached signatures and auth credentials.

Applied the deployment yaml to the cluster directly

100 containers/2 Signatures

100 images * 2 signatures = 200 artifacts verified
Audit time per pod = ~500ms

The initial ED request fails due to 429's returned from the registry. The Deployment attempts retries to create the pod which eventually succeeds after a few attempts. Ratify logs show fewer 429's until eventually none. This is most likely due to the eventual caching of all signatures bringing the request rate down.

After adding back off retry

Initial request timing durations: 3297ms, 3155ms, 2958ms = ~3.1 seconds
Audit time per pod = 614ms, 669ms, 644ms, 552ms, 682ms = ~0.6 seconds

100 containers/5 Signatures

100 images * 5 signatures = 500 artifacts verified
Audit time per pod > 1200ms (not really accurate)

Ratify never returns true. The logs follow a similar pattern to above test case however, 429's never completely dissipate.

After adding back off retry

Initial request timing durations: 6137ms, 6289ms, 6159ms = ~6.2 seconds
Audit time per pod = 1846ms, 1914ms, 1996ms, 1879ms, 1997ms = ~1.9 seconds

Analysis

The first ED request triggers the most number of calls to registry since credentials and signatures are not cached. Each subject is handled concurrently resulting in multiple simulataneous request to the the same registry host. ACR performs rate limiting per client IP and per registry host both of which are likely limits we are hitting in these test cases. However, as some subjects complete validation, subsequent retries result in fewer 429's.

Why does 100img/2sigs eventually succeed and 100img/5sigs fail?
Each ED request triggers at least one call to registry to fetch the referrers for that image. One theory is that the larger number of artifacts plus the 100 referrer api calls each audit always exceeds the ACR rate limit threshold thus stalling any caching efforts (This is not validated and purely a guess at this point).

Large Cluster Analysis

Audit Interval testing

10k pods

1 image * 150 signatures = 150 artifacts verified

Audit time per pod = ~400ms
10k pods * 400ms = ~ 1hr 10min per audit cycle (Gatekeeper applies constraint serially per pod)

cc: @AnlanDu for more information

Admission Testing

The main bottlneck for Ratify is handling multiple request concurrently to the same registry. We start seeing 429's if many simulataneous requests are made to the same registry from the same client.

We tested various large scale pod scenarios:

  • Deployment with 10k replicas:
    • It seems like the Deployment rollout does not send all 10k pod requests at once. The requests that reach ratify are staggered allowing ratify to successfully verify all pods. After the intial operation, all subsequent verification operations on replicas take ~150-200ms to complete.
    • As a control, we tested on a cluster without Ratify installed. The deployment took ~8min for all pods to be admitted. (pending/starting status)
    • On a cluster with Ratify installed, the deployment took ~13min for all pods to be admitted.
  • multiple deployments at once
    • how would we test this? applying pod specs manually seems to be a serial operation.

ACR Operations

  1. Gatekeeper sends a request to verify an image
  2. Verify Handler processes this request and creates new go routine per subject image
    • Executor calls ORAS to resolve the subject desciptor
    • ORAS calls Auth Provider's Provide function to get credentials for registry
      -NOTE: Send request for new AAD token if token has expired
    • Provide sends request to ACR to exchange AAD token
    • ACR sends refresh token
    • Create new ORAS registry client using credentials from Provide
    • If subject descriptor not cached, ORAS sends request to ACR to resolve subject descriptor
      • ACR sends resolved subject descriptor to ORAS
    • ORAS returns to Executor resolved subject descriptor
    • Executor calls ORAS to return referrers to subject
    • ORAS sends ACR request(s) to return referrers to subject
    • ACR sends back referrers to ORAS
    • ORAS returns referrers to Executor
    • Executor creates new go routine for each referrer returned
      • Notary calls ORAS to get Reference Manifest
      • If not cached, ORAS sends request(s) to get reference Manifest
        • ACR sends ORAS matching reference Manifest
      • ORAS returns reference manifest to Notary
      • Notary calls ORAS to get blob content
      • If not cached, ORAS sends request to ACR for blob content
        • ACR sends blob content to ORAS
      • ORAS returns blob content to Notary
      • Notary returns blob content verification result to Executor
  3. Executor returns blob content verification result to Verify Handler
  4. Handler sends verification payload to Gatekeeper

BOLD = incoming/outgoing http request operation(s) to ACR

Questions

  1. How many unique containers do we anticipate customers will have?
  2. How many artifacts per image will customers have?
  3. Will customers run multiple containers in the same pod with the same image?
    • If true, caching would be necessary to avoid redundant validation