# Git is the (S)BOM **NOTE: Much of this thinking has evolved into a simpler form [here](https://hackmd.io/aZ7czCDvRl2atAxhtYecrA?view)** There are a number of existing SBOM approaches already in existence. This proposal does not seek to replace any of them. It is also agnostic amongst all of them. This proposal fundamentally focuses on allowing the tracking of the tree of artifafts inherent to an SBOM through the entire chain and allow them to be associated with metadata. From its perspective any and all of the other SBOM approaches may be treated as metadata. # SBOM is a tree A Software Bill of Materials (SBOM) is fundamentally a tree of artifacts and associated metadata. ## Examples: ### C/C++ ![Simple Binary](https://i.imgur.com/xMafCVO.png) ![Binary with Static Library](https://i.imgur.com/dddNNBS.png) ![Binary with Dynamic Library](https://i.imgur.com/LAH4ecC.png) ### Go ![Go Binary](https://i.imgur.com/S6AnWOF.png) ### Java ![Java](https://i.imgur.com/r26fuvx.png) ### Python ![Python](https://i.imgur.com/kb1cLBi.png) ### Generic artifact tree ![Generic artifact tree](https://i.imgur.com/5sxlT1k.png) # Learning from Git Git specifies a simple generalizable 'object' format consisting of ```${objectype} ${size of blog in bytes written in characters}${nul character '\0'}${data}``` Git allows us to assign an **identifier** to any object by taking its sha1 sum. This proposal is to specify the fundamental objects in the BOM in git object format. ## Every artifact is a 'blob' Git defines its most fundamental object as a 'blob', which is simply an array of bytes. It is most commonly used to represent the files in a git repo. Any file that is stored in a git repo is stored as a 'blob' object. No matter which repo it is in, no matter where in the tree of that repo it is located, a file is the same blob, and has the same identifier. ## Every artifact has a BOM a BOM is a git object of type 'bom', and thus every BOM has an identifier. The data in a bomb consists of a series of lines seperated by a newline character ('\n') ``` blob ${identifier1} bom ${identifier2} ``` Where ```${identifier1}``` is the identifier of an child artifact of the subject of the BOM and ```${identifier2}``` is the identifier of the BOM for that child artifact. If the blob is a leaf in an artifact tree (ie, it has no children) then ```bom``` is omitted If the BOM embedded into the artifact itself (in an elf section, ar entry, java class file, tar file, etc) it will naturally omit the ```blob``` ## Metadata about a BOM Metadata about a BOM should be expressed in terms of a new git object type, and should reference other git objects (including BOMs) by their identifiers. ## Object Archive An Object Archive is a gzipped contenation of git objects. It is recommmended that the BOM for an artifact be collocated in the same Object Archive with the BOMs if its children, and any relavent metadata about its BOM or its childrens BOMs. ## Location Metadata 'location' shall be a new git object type of the format: ``` ${type} ${identifier} ${format} ${url} ``` A location specifies the url containing the referenced git object. The resource specified by the url may contain other git objects. Most commonly would be of format 'objectarchive' for boms, but could be of other formats. ## Possible Additional Metadata types This section contains some ideas on possible optional metadata that might be included in an Object Archive with BOM(s) ### License Object type 'license' format: ```{type} ${identifier} ${SPDX License Tag}``` ### Copyright Object type 'copyright' format: ``` ${type} ${identifier} ${one or more Copyright lines} ``` # Embedding BOMs in artifacts ### ELF Files Embed the an Object Archive containing the 'bom' and 'location' git object into an elf section named '.bom' ### ar Files Embed the an Object Archive containing the 'bom' and 'location' git object into an archive entry named '.bom' ### General Archive files (tar,gzip,etc) Embed the an Object Archive containing the 'bom' and 'location' git object into an archive entry named '.bom' ### Java class file Embed the an Object Archive containing the 'bom' and 'location' git object into an annotation in the .class file. ### Python .pyc files Embed the an Object Archive containing the 'bom' and 'location' git object into an ```__bom__``` in the .pyc file. # Toolchain integration ## Compiler Integration Compilers could relatively simply be augmented to include the .bom in their output (elf files, ars, class files). Building BOMs into the compile greatly increases the reliability of the system. ### Special consideration for languages with #include C/C++ compilers such as gcc and llvm preprocess C files to 'include' their .h files. As part of this they [routinely garner metadata about those files](https://gcc.gnu.org/onlinedocs/cpp/Preprocessor-Output.html). The preprossors and compiler can relatively simply be augmented to automatically create and include the .bom entries in the elf files and ar files they generate. # Benefits of Approach ## Git Git is ubiquitous, has an enormous amount of support across many many tools. Additionally, most code in the modern era is stored in git, and so by referencing source code artifacts by their git identifier, it becomes easily possible to correlate 'leaf' artifacts with their actual source code. ## Seperation of artifact tree from metadata The 'bom' git object itself intentionally contains no metadata. This is because an artifact is defined by the child artifacts that went into its creation. An excutable is not different because it has different contacts associated with it. A source file is not different because it resides in a different git repo. Any needed metadata can be constructed as a git object type, referencing the 'bom' by its identifier, and embedded in the Object Archive with the BOM. ## Make SBOMs part of the build By making SBOMs just another part of the build ensures their ubiquity. ## Extensibility of Metadata The fundamental entities of the BOM are simply 'blob','bom','location' object types. All other metadata can be standardized separately, extensibly, and independently. ## Compatibility with other BOM approaches There are a number of existing BOM approaches that produce SBOMs. Any or all of them could be treated as metadata of their own type and added to the Object Archive either directly or via a 'location' entry. ## Format flexiblity While the canonical format upon which identifiers are computed is always that of a git object, more human readible non-canonical formats that can be converted to a git object may be used.