GitBOM is neither git
nor an SBOM.
It is an application of the git DAG, a widely used merkle tree with a flat-file storage format, to the challenge of creating build artifact trees in today's language-heterogeneous open source environments.
By generating artifact trees at build time, embedding the hash of the tree in produced artifacts, and referencing that hash in the next build step, GitBOM will enable the zero-end-user-effort creation of verifiable build trees. Furthermore, it will enable launch-time comparison of vulnerability data against a complete artifact tree for both open source and proprietary projects (if vuln data is traceable back to source file).
It is desirable to enable efficient launch-time comparison of the verifiable and complete build tree of any executable component [1] against a then-current list of undesirable source files [2] which are known to be undesirable, where such a build tree contains unique referents for all sources from which the given executable object was composed.
[1]: binary, dynamically-linked library, container image, etc.
[2]: because vulnerabilities may be discovered between the time an executable is created and the time when it is run, these processes must be decoupled
In an ideal scenario, an open source consumer would have available to them a complete artifact tree, tracing dependencies to their ultimate depth. Even if we do not achieve this ideal, we should seek a solution with the lowest cost of adoption so as to enable the greatest buy-in across all open source ecosystems and communities.
For this reason we propose two areas of work:
Following from (1), this approach will require minimal to no effort on the part of open source project maintainers, thus significantly increasing its chances of widespread adoption as compared to any approach which requires maintainers to perform additional actions (e.g., implementing substantive changes in their CI/CD or package build pipeline to generate an SBOM).
Following from (2), this on-disk format provides an efficient and already well-understood method for cross referencing artifacts and source files by a deterministically-generated UUID (SHA1 or SHA256).
┌─────────────────────────────┐
│ Build-time Tree Generation │
│ │
│ ┌────────┐ ┌────────┐ │
│ │ Src A │ │ Src B │ │
│ └───┬────┘ └──┬─────┘ │
│ │ │ │
│ ▼ │ │
│ ┌───────┐ │ │
│ │ Obj A │ │ │
│ └─────┬─┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ │
│ │ Compilation │ │
│ │ & Signing │ │
│ └─────┬───────┘ │
│ │ │
└──────────────┼──────────────┘
│
┌──▼─┐
▼ ▼
┌──────────┐ ┌──────┐
│ [header] ├──►│gitbom│
│executable│ │ tree │
└──────┬───┘ └┬─────┘
│ │
┌───────┼────────┼────────────────────────────┐
│ │ │ Launch Time Comparison │
│ │ │ │
│ ▼ ▼ ┌────────────┐ │
│ ┌───────────────┐ │ Public │ │
│ │ Policy │◄─────►│ Vuln │ │
│ │ Enforcement │ │ Database │ │
│ └─┬─────────────┘ └────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Runtime Environment │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────┘
GitBOM is an approach which has the following properties:
[]byte(artifact1) == []byte(artifact2)
.Two artifacts are equivalent if []byte(artifact1) == []byte(artifact2)
.
Two artifacts are said to be equivalent if and only if they are byte-for-byte identical. This implies that GitBOM is not concerned with questions of provenance, origination, licensure, or many others aspects which are encompassed by a software bill of materials, and which could differ between byte-equivalent artifacts.
Independent parties, presented with equivalent artifacts, derive the same artifact identity.
This implies that a deterministic hashing function may be used to derive artifact identity, such as SHA256.
An identified artifact can not be modified without also changing its identity. Non-equivalent artifacts have distinct identities.
"An identified artifact" means an artifact whose identity has been determined. "Can not be modified without also changing its identity" means that the deterministic hashing function has no collissions, and therefore any change to the artifact results in a change to its identity. In this way, the relationship between artifact and identity is immutable.
An artifact can have precisely one artifact identity graph. All equivalent artifacts have the same graph.
This implies that we must not include build tooling in the artifact tree, as doing otherwise would violate the Uniqueness requirement. For example, two reproducible build systems which rely on different auxiliary libraries (e.g., zlib) and result in byte-equivalent outputs must yield identical GitBOMs.
For further exploration of this topic, see Wheeler's work on reproducibility as a means to verify trustability: Countering Trusting Trust through Diverse Double-Compiling
Note the implication that for any artifact, there can only be one artifact identity graph, but the reverse is not true. Each artifact identity graph may generate multiple artifacts (e.g., if different build parameters are used, or it is compiled on a different architecture, or different metadata, such as compile time, were embedded in the built artifact).
Artifacts and associated metadata may be obfuscated when sharing the artifact identity graph, while preserving other properties.
Metadata about artifacts and their associated artifact trees may have varying levels of sensitivity. GitBOM allows the supplier to reveal as little or as much as they, in negotiation with their consumers, choose. The GitBOM tree itself is just a merkel tree of opaque hashes. This provides transparency about the artifact tree and its structure, while allowing supplier modulated levels of opaequeness about the metadata.
Artifacts may be associated, through their identity, to independently generated metadata stored outside of the artifact identity graph, such as an SBOM containing license and provenance metadata.
There are many many many use cases that could use GitBOMs. An incomplete list would include:
Undoubtably, more will arise. Independence of metadata independent permissionless innovation around each use case without the need for cross domain coordination. This lowers the cost of innovation and thus allows more productive innovation in this space.
GitBOM is not an SBOM standard.
From the GitBOM perspective, any SBOM document is a type of artifact which could be referenced in an artifact tree.
From an SBOM perspective, GitBOM is a common precise way to identify artifacts and their artifact trees, and nothing more. This makes GitBOM incapable of fulfilling many of the objectives of SBOMs, such as recording provenance, origination, build environment information, licensure, and other qualities.
Speaking strictly from an SPDX 3.0-draft perspective, GitBOM is a lossy serialization format that only includes the minimum metadata field of "Identifier".
Current metadata formats, such as SPDX 2.x, as well as current systems to sign and transport metadata documents, do not efficiently support our use case in the general case. They may well, however, support this use case in a specialized case, which we will discuss.
An argument can be made that current metadata formats can enable run-time analysis of the complete artifact tree. Achieving this would require (1) that generation of SBOM metadata be performed using compatible tooling by every project within the tree, (2) the documents' distrubion be consistent, and, crucially, (3) that a separate system exist to recursively fetch and parse metadata documents for all related projects and index them in a manner enabling efficient search.
Let us look briefly at these three adoption requirements in more detail to understand the implications for (and, at least, one motivation for hesitancy in uptake of) volunteer-maintained open source projects.
Current tooling to generate SBOM documents requires effort on the part of every OSS project maintainer to integrate with their build systems. While full SBOM generation could be integrated into compilers and linkers, as we propose for GitBOM, many view the complexity as overly burdensom on small projects, creating a source of friction that has and may continue to hamper adoption. On the other hand, due to the pervasiveness of Git itself, we believe a minimalist approach that already feels familiar will be better received by this long tail of OSS projects.
One obstacle in the distribution and adoption of SBOMs has been competing standards (see the "Landscape" document for examples in addition to SPDX). By proposing to capture only the bare minimum metadata necessary to enable this scenario, we believe this proposal will avoid the ongoing debates about competing standards. N.B.: Early socialization of this idea has received fairly wide support for the principle of a minimalist disk-based representation of the artifact tree.
Run-time comparison, as described in the Objective, must be within the capabilities of even small and independent consumers of open source. A proposal which required large investments in infrastructure (e.g., that an operator maintain a database containing complete SBOM documents for the totality of open source) will not be seen as a reasonable requirement for smaller and independent organizations (even though it may make for a very compelling product offering, were someone to build and license it!).
TODO
TODO: Santiago
Imagine we have the following two files:
hello.c
has gitref c64efd8bd8bceca8c69f9b5b7647cf0ff61fed59
and includes stdio.h
stdio.h
has gitref c0f35b8ae567f5348df3711496fdc0ef6f634169
From these two inputs, we compile hello.o
. The resulting GitBOM is a document (text file) containing the lexically ordered sequence of the gitrefs of each input artifact related to this build step:
blob⎵c0f35b8ae567f5348df3711496fdc0ef6f634169\n
blob⎵c64efd8bd8bceca8c69f9b5b7647cf0ff61fed59\n
The gitref of the resulting document is 85322091b1d50a23d1c2a0f5933788a2a958f2ad
, and this document is written out to disk in a directory in the build environment, e.g.:
./.bom/object/85/322091b1d50a23d1c2a0f5933788a2a958f2ad
The compiler would also embed this gitref in a new elf section of the resulting hello.o
binary; this adds a total of 89 bytes when accounting for elf section formatting.
Imagine we have the following Dockerfile:
FROM <baseimage>:<release>
RUN <command to install package>
We calculate the hash of <baseimage>:<release>
, which is: 000TODO
.
Things get a little trickier when we go to calculate the hash of the next layer.
Also, we want to produce an artifact tree that can reference the gitbom of any artifacts added to that layer, not merely a hash of the whole layer. We'll do that by … TODO …
Combining these together, we produce the following GitBOM document:
blob_000TODO
blob_000TODO
… and embed the gitref of this gitbom in the image manifest's annotations
field, like so:
{
"schemaVersion": 2,
"config": {...},
"layers": [ {...}, {...} ],
"annotations": {
"gitbom”: “sha256:abc123TODO”
}
}
NOTE: The annotation type 'gitbom' is not yet standardized or accepted to OCI. In the above snipped, 'gitbom' is merely an example.
TODO
TODO
TODO: Replace / reformat examples as a specification
TODO
Identify languages/compilers of initial interest:
What else?
Temporary Channel: Join the 'gitbom' channel on the OpenSSF Slack Instance.
Permanent Community location TBD
Many thanks to Ed Warnicke, who pitched this to me while I was stuck in Puget Sound traffic, and who has graciously accomodated my awkward schedule as we continued to discuss GitBOM while one or both of us were in a car.
Many thanks to everyone who added input and feedback to my "Landscape" document, though I now prefer the metaphor of a backpack: these are the tools one may choose to pack before embarking on a journey into the OSS SSC landscape. This reframing allowed me to identify a tool that was missing from my "supply chain backpack": the GitBOM.
Presentation & Discussion at CNCF STAG Supply Chain Working Group: https://youtu.be/FJRCKQAbhhY
Is SHA1 trustable for this purpose? Yes - https://badhomb.re/git/sha1/rant/2017/03/04/shattered.html