Notary V2 - Rescinding Signatures
Notary issue #72
Terminology
- Publisher - User who builds and signs artifacts and publishes them to a registry.
- Consumer - User, or systems whom consume (and deploy) signed artifacts from a registry. In the Notary Scenarios Wabbit Networks is a Publisher, and ACME Rockets is a Consumer.
- Node - The Consumer owned infrastructure (physical or virtual machine) on which the artifact is pulled and executed.
- Orchestrator - The Consumer owned software that manage aspects like deployment, container placement, and scaling of a node cluster. E.g. Kubernetes, Amazon ECS, EKS, Azure AKS .
- Node Agent - A daemon on the Node that Orchestrator communicates with to perform actions on the Node (manage container lifecycle, health checks etc.) e.g. Kubernetes kubelet.
- Trust Store -List of certificates/public keys trusted by the Consumer. A trust store is used for signature validation.
- Trust Policy -Configuration that controls aspects of signature validation e.g. enable/disable signature validation, require timestamp signature, etc.
- Air Gapped Network - A network which is completely isolated from the public internet. A User operating services in an air gapped network can only rely on infrastructure and services available within this network. For Container based services, it is assumed that the Consumer only has access to registry, and other services local to the air gapped network, and the registry operator does not have access to public internet. There are two ways in which customers build software that run in air gapped network.
- The artifacts are authored, signed, distributed and consumed within the air gapped network.
- The artifacts are authored and signed outside the air gapped network then "imported" and distributed inside the air gapped network for consumption through a one way data transfer.
- Constrained Network - The infrastructure which consume artifacts (orchestrators and nodes) may not have access to external network (internet or air gapped network) based on the network configuration defined by the Consumer. Again we assume that the infrastructure has access to the registry from which it pulls artifacts (e.g. A Kubernetes cluster where the control plane and nodes have no access to public internet).
Context
Why is Revocation required?
Signed artifacts attest to the integrity (artifact was unmodified after it was published) and authenticity (artifact was published by the party claiming to be the Publisher), but makes no guarantees about the quality of the content in the artifacts itself. There can be multiple conditions in which a Publisher no longer considers the artifact fit for consumption. Though a Consumer can have their own mechanism (e.g. image scanner) to check artifacts, the Publisher maybe in the best position to make assertions about the artifacts they distribute.
Unlike large organizations, a majority of Users do not have dedicated resources that monitor their dependencies (through mechanisms like CVE databases). The primary scenario enabled by revocation is that Publishers have a consistent way to signal consumers about untrusted artifacts, across multiple registries, and to provide Consumers controls to react to it as they see fit. It’s not expected that every user will use this feature, revocation events are rare, and support for revocation usually means additional overhead for Publishers and Consumers. Publishers that vend popular artifacts that are widely used, and security conscious Consumers are likely to use this feature.
Alternatives in the absence of Revocation
When a Publisher no longer considers an artifact as trusted, a Publisher could make the artifact unavailable by deleting it and the tag associated with it.
- A Consumer would no longer be able build new software with the artifact.
- If a Consumer already pulled and used the artifact in the past, future pulls will fail, and cause an outage for the Consumer.
- If a Consumer pulled the artifact, pushed it to another repository, and used the artifact from the downstream repository, they could be unaware of the revocation event. This may mean the Consumer continues to operate with a possblily degraded security posture. Additionally, an attacker may copy the artifact before it was deleted and distribute it through another repository.
How do Consumers use this information?
Using Trust policies, Consumers can specify whether they want execution of the artifact to proceed or block when an artifact is revoked.
How Consumers react to revocation is highly dependent on a User's specific usecase/workload:
- A User who deploys highly available online services may prefer to be notified of revocation event, evaluate the impact and roll forward with an update, instead of blocking execution (causing an outage). Roll forwards may not be automated as the newer version may not be compatible or have different behavioral characteristics, which require verification/testing by the Consumer.
- A different User who workload are build nodes that build device drivers for secure devices, may prefer that the execution fails on revocation events.
- Additionally, a User may have more confidence to fail closed for a revocation event where the User is also the Publisher, compared to a Third Party Publisher.
Scenarios
- A Publisher discovers a critical vulnerability in a previously signed artifact, and revokes the specific artifact. Whether the revoked artifact is subsequently used by a Consumer is a policy decision for the Consumer.
- A Publisher signing key was compromised. The Publisher wants to indicate that any signed artifacts associated with the signing key are revoked.
- Cross registry usage - a Consumer pulls an artifact from a public repository, validates it and pushes it to an internal registry and uses the artifact. Subsequently the Publisher revokes the artifact, and the Consumer discovers that the artifact was revoked.
- Air gapped network - An artifact authored outside the air gapped network is revoked by the Publisher, and the Consumer discovers that the artifact was revoked.
Requirements
- A Publisher MUST have a mechanism to indicate that a signed artifact is no longer trusted.
- A Publisher MUST have a mechanism to indicate that multiple signed artifacts associated with a signing key are no longer trusted. For blast radius control, a Publisher can OPTIONALLY indicate the time after which the signing key should be considered untrusted.
- The Consumer MUST be capable of enabling revocation checks and failure mode (error, warn) when revocation checks fail.
- The Consumer SHOULD have mechanisms to discover this information from an air gapped or constrained network.
Alternatives Considered
Allowlist in Repository
- Allowlists explicitly list the trusted artifacts. This can be a list of digests that the publisher considers trusted for consumption.
- The allowlist can be signed, providing protection against a compromised registry.
Pros
- Anything not in the list is implicitly untrusted, no separate revocation mechanism is required. A Publisher removes a digest from the list to indicate that it's no longer trusted.
- If signed allowlist are used, the artifacts themselves may not need to be signed.
Cons
- The allowlist needs to be updated for every artifact being pushed to the repository.
- In scenarios where artifacts move across multiple registries, when a Publisher revokes an artifact in the source repository, downstream repositories are not automatically updated without additional sync mechanism to propagate the allowlist.
- It does not allow movement of specific artifacts to another registry without moving the complete signed allowlist along with it.
- May need to maintain a large allowlist which may be an overhead to distribute.
- Registries are optimized for inserts, not updates. Updating a single artifact (the allowlist) can lead to inconsistency as registries tend to work with eventual consistency.
- Requires registries to support storage and querying of repository level metadata.
Denylist in Repository
- Denylists explicitly list artifacts which are untrusted.
- The denylist can be signed, providing protection against a compromised registry.
Pros
- Updates to denylist are less frequent, unlike updates to allowlist for every update of an artifact
Cons
- Similar to allowlists, an additional sync mechanism to propagate the denylist across repositories.
- Requires registries to support storage and querying of repository level metadata.
Certificate Revocation Lists (CRL) and Online Certificate Status Protocol (OCSP)
These are related mechanisms used by PKI to provide revocation information for certificates issued by a CA.
CRL
- Certificate Revocation Lists (CRL) are signed deny lists which contains the list of revoked certificates. The CRL is signed by the CA that issued the certificate. For each revoked certificate, the CRL contains the serial number, revocation time and reason.
- A CRL is published by the issuing CA and periodically refreshed. For certificates issues by public CAs, CRLs that contain signing certificates (end-entity certificates) are refreshed once every 7 days by issuing CA, and CRLs that contains Subordinate CA certificates at least once 12 months or within 24 hours of a revocation event. Clients that perform signature validation cache CRLs and refresh them periodically before they expire.
- As part of signature validation, clients validate the revocation status of each certificate within the certificate chain of a signature. This may involve multiple CRL downloads if CRLs are not cached.
- CRL only support certificate revocation and not single artifact/signature revocation.
- For code signing certificates, as timestamping can be used to extend signature expiry, revoked certificates are maintained in the CRL for a longer time (minimum 10 years post expiry of certificate).
- A certificate contains the CRL endpoint where its status can be checked, so no additional mechanism is required to discover the CRL endpoint.
OCSP
- Online Certificate Status Protocol (OCSP) allows querying the revocation status of a certificate on a need basis rather than downloading and refreshing CRL periodically. OCSP responses can be cached, similar to CRLs.
- They avoid some of the downsides associated with CRLs such as download and parsing of large CRL files. Typical OCSP reponses can be ~2.5kb whereas large CRLs can be 10s of MBs.
Pros for CRL/OCSP
- Relies on centralized mechanism to get revocation status of a certificate. Revocation status can be fetched even when artifacts move across multiple registries.
- Relies on existing mechanisms that are used by Publishers that code sign software they publicly distribute.
Cons for CRL/OCSP
- CRL for code signing certificates can get large over time as revoked code signing certificates are not pruned at their expiry. Systems consuming the CRL need to download and parse large files which can add latency to signature validation. CRL sharding is used by public CAs to address this downside, where multiple CRL URLs are used and maximum size of individual CRL is controlled.
- CRL and related mechanisms do not support artifact level revocation.
- Requires Publishers to use certificates issued through a CA (e.g. a public CA) that maintains publicly accessible CRL/OCSP endpoints. If an internal organization's CA is used, the organization will require the CRL/OCSP endpoints to be public, and highly available, which can add operational burden.
- CRL and OCSP use HTTP endpoints and can be vulnerable to man in the middle attack where the attacker can cause DoS or replay attack.
- CRL and OCSP endpoints for the issuing CA may not be accessible from an air gapped network or constrained network.
Recommendations
Use CRL/OCSP
For cross registry scenarios Allowlists and Denylists stored in registry require either registry operator or Consumer to periodically refresh the lists from upstream registry.
In both cases additional metadata is required in the artifact or signature about the source repository, from where the allow/deny list can be fetched when sync occurs.
In comparision, CRL/OCSP rely on a centralized endpoints, which addresses the distribution problem, and therefore do not need additional sync mechanism. CRL/OCSP endpoints are discoverable as they are present in certificate metadata. This metadata is signed by the certificate issuer.
- Notary V2 tooling performs signature validation based on local trust store and trust policy. External tooling (e.g. a Node agent) can setup the trust store and policy before a signature validation occurs. Consumer can enable/disable revocation checks through the trust policy.
- When revocation check is enabled the validation component will fetch CRLs or make OCSP requests as part of signature validation.
- Consumers can specify alternate CRL endpoints in the trust policy.
- If unexpired, cached CRLs and OCSP responses are available locally, the validation component will use the cached version.
- The tooling must differentiate between a clear response that a revocation occured vs. being unable to receive a revocation response, which can be caused both by transient failures (e.g. unavailable, throttled, DNS resolution failure) or persistent failures (e.g. misconfigured environment causing DNS resolution failure). External tooling (e.g. a Node agent) can prefetch and populate CRL/OCSP local cache to make revocations checks more reliable.
Artifact level revocation
- A new spec for Artifact Revocation List(ARL) and Online Artifact Status Protocol (OASP) (name not finalized) to support artifact revocation must be provided. This spec will be derive from existing standards for CRL/OCSP but is not a standard itself. It should support simplistic implementations (E.g. A public hosted endpoint with signed ARL)
- In content addressable storage like registries, the digest of the artifact (e.g. image manifest, blob) is its unique identifier. The same artifact may have multiple signatures associated with it. ARL/OASP must use the artifact digest when referring to an artifact instead of its signature.
- Publishers that want to support for artifact level revocation must host an endpoint that supports the spec.
- ARL/OASP must be signed (similar to CRL) by a party that is included in the trust store, have an expiry, and need to be periodically resigned. The trusted parties for ARL/OASP should be explicitly be configured by the Consumer.
- Publishers must include the ARL/OASP endpoint in signed artifacts for signature validation to discover these endpoints.
- The trust policy will contain additional options to enable disable artifact level revocation checks.
- Cloud service providers can implement these specs as an added service for customers.
Revocation checks on local container
- Developers must be able to configure revocation checks on local artifacts by configuring local trust policy
Revocation check at deployment and runtime
- Consumers must be able to configure trust store (trusted certificates/public keys) and trust policies in artifact execution services supporting them (e.g. Orchestration frameworks like Kubernetes, Azure AKS, AWS ECS, EKS )
- Users must be able to enable revocation checks in supported Execution services
- Execution servives may provide capabilities to run revocation checks at Orchestrator level E.g. A Kubernetes admissions controller can perform revocation check before deployment to the cluster. Or a Kubernetes component that performs periodic revocation check.
Revocation checks for air gapped environment/constrained networks
- If artifact was signed outside the air gap
- Consumer must be capable of providing alternate CRL endpoints through trust policy. CRLs are signed and do not rely on secure transport (like TLS) for distribution for authenticity of content.
- Consumer must setup a mechanism by which CRLs are aggregated and replicated within an air gapped environment/constrained network to an accessible endpoint periodically, within the expiry period for CRLs.
Additional Considerations
SLA for revocation check
- The SLA for revocation check can depend on CRL validity and refresh, client caching and client refresh period. These are considerations for Publishers and Consumers based on how fast they want the revocation event to be published and consumed. The Notary V2
Revocation checks at orchestrator vs node
- Consumers should rely on execution service/orchestrator level revocation checks instead of Node level revocation check at runtime
- Revocation checks include calls to multiple external endpoints which may not be available (CRL/OCSP outage) and cause revocation checks to fail.
- Revocation checks may rely on public endpoints that may not be accessible from Nodes.
- Revocation check at each artifact run may add latency as Nodes are ephemeral and CRL/OSCP responses may need to be fetched more frequently. Revocation check for a single signature can involve multiple CRL/OCSP calls as the revocation check is performed on each certificate in the certificate chain of the signature.
- Revocation check at each run may be an overkill, if the Consumer only wants warnings/notifications instead of artifact run to fail on revocation check failures.
Air Gapped/Constrained Networks
- Consumers can validate artifacts before transfer into air gap, and re-sign artifacts with a PKI managed entirely within an air-gapped network.
- Vendors managing constrained networks can provide CRL aggregation and replication capabilities.