Ratify is at risk as being a single point of failure in the admission process in K8s. Gatekeeper by default fails open so failures in reconciling the admission result of a particular resource will not block resource creation. However, this provides a weak security guarantee which is directly dependent on the availability for Ratify. For large clusters with constant resource creation, a single Ratify pod could be more susceptible to failure. The goal is for Ratify to be deployed as a replicaset.
Many in-memory resources including all in-memory caching, file-based blob stores, and certificate files must be shared between the replicas to avoid very expensive network operations to the same resources. Remote resources like registries have throttling limits. Replicas that don't share common resources will almost certainly trigger throttling.
Ratify must:
Please reference this doc for overview of current caching state in Ratify
Ratify has two primary cache categories: in memory caches & blob store cache.
There are 4 separate in-memory caches backed by 3 different cache types. This makes it very difficult to standardize cache interactions and emit uniform metrics. Furthermore, supporting multiple cache types will make it difficult to easily switch between in-memory and distributed caching for high availability scenarios.
Each of these levels of caching is integral to nearly every operation for ratify. All verification operations, all registry operations, and all auth-related operations rely on caching to provide performant execution.
During the design process, we discovered that the existing in memory caching strategy will not be suitable for external verifier/store plugins. Ratify invokes each external plugin as a separate process and as a result any shared in memory caches during the main ratify execution process cannot be shared with the sub plugin processes. This particularly affects external verifier plugins which instantiate and interact directly with the ORAS store to perform registry operations as required. Without a shared global cache with the main process, there could be potentially many redundant and expensive auth and descriptor requests to a remote registry.
Should we use a distributed redis cache with a persistent OCI blob store?
Should we change the verifier's responsibility so it does not interact with Referrer Store?
Every verifier implementation would only have access to a byte array to validate against. It would be up to the executor to extract the blob content and invoke the verifier after.
Each cache implementation will follow the factory/provider paradigm of other ratify components: A cache implementation will register a factory Create
method with the global factory provider map. The cache-specific factory will be responsible for creating an instance of the cache and returning it. The NewCacheProvider
function will be responsible for calling the corresponding Create
function for the cache implementation name specified. Finally, the static memoryCache
variable will be set to the created cache instance. An accessor method will be used throughout ratify packages to retrieve the global cache reference.
The intialization of the cache will be done only for the serve
CLI command. New flags for cache-type
and cache-endpoint
can be specified to override the default values of ('redis', and 'localhost:6379'). Cache initialization is a blocking operation that will stall Ratify startup if not successful.
The cache interface returns whether each operation was successful or not rather than the error. All cache operations should be non-blocking and will fall-back to a non cache implementation if any errors occur. It is left to the cache implementation to log any errors so users are informed.
A well known cache key schema must be defined for each caching purpose. This will avoid Key conflicts.
The Create
method will initialize a new go-redis
client with with the provided cacheEndpoint
. Each of the interface methods will implement the corresponding Redis accessor, writer, deleter, and clear methods.
Redis is key/value string typed. Any values passed will be JSON marshaled into string form before being written to cache. This requires all value types to be marshalable. Similarly, all values returned from the cache will be generic string typed. Since the redis cache implementation is agnostic of the value passed in, it is left to the invoker to unmarshal the interface{}
into the desired type.
Please refer to this document for detailed Redis security analysis
Dapr (Distributed Application Runtime) is a portable runtime that allows applications to easily integrate with many resources and services. It's built for distributed applications which adopt a micro-service architecture. The centralized control of the various integration points allows users to build platform agnostics microservices.
Dapr in K8s is deployed as a sidecar container on each pod. It intercepts all requests made through the Dapr client. The Dapr Operator Pod is reponsible for managing the various Dapr Integration Resource CRDS generated. Each CR represents a 3rd party supported integration. Dapr Sentry Pod is responsible for injecting the trusted certs used by the side car for all requests (mTLS). The Dapr Side-Car Injector watches for new pod creations and injects Daprd side containers as necessary.
Dapr has robust state-store support. This shim allows the application to be state-store agnostic. There's support for many many different state stores including Redis. The application initiates a Dapr client using the sdk. A state-store specific resource is applied on the cluster to point configure Dapr.
Pros:
Cons:
Cache unification design will remain: all 4 in memory caches will be unified behind a single CacheProvider
implementation. The were will only be two cache providers:
1. Ristretto: An in-memory cache provider. This will provide feature parity with Ratify's current capabilitiy. This will be the default mode used and should ONLY be used for Ratify single pod usage.
2. Dapr: distributed state store shim. Equivalent Dapr sdk methods will be used in the CacheProvider
interface implementation. NOTE: Installing Dapr and Redis will be documented pre requisities but the Ratify chart will NOT handle automatic installation (Gatekeeper is taking the same approach). Dapr will live behind a feature flag and be turned off by default.
Gatekeeper is introducing support for publishing constraint violations to external sources via pub-sub providers. The first provider they have added is Dapr. The message broker selected with Dapr is Redis. Ratify's should plan to share the same Redis instance as GK's pub-sub integration.
Further investigation revealed that there's a hard requirement for caches to be shared across processes. External plugins for verifiers/referrer stores are invoked as separate processes by Ratify. This means an in memory cache cannot be accessed by external plugin processes.
Assumptions:
verify
command on Ratify CLI requires caching, user can point ratify to a preconfigured Redis instance like the serve
commandRatify does not enable a cache by default. All cache references to a unified cache would check to make sure cache is initialized or not. If not initialized it will skip any cache interaction and treat it as a cache miss.
serve
command. (Note: we could potentially explore having the serve
command start up a docker container with redis on it but don't think this is ideal).