owned this note
owned this note
Published
Linked with GitHub
###### tags: `Archived`
# Standardized Metadata
This is an initiative within the [CDF SIG Interoperability](https://github.com/cdfoundation/sig-interoperability).
## Quick links
* [Volunteers](#Volunteers)
* [Introduction](#Introduction)
* [Scope](#Scope)
* [Existing Efforts](#Existing-Efforts)
* [Jenkins X Releases](#Jenkins-X-Releases)
* [SPDX](#SPDX)
* [Definitions](#Definitions)
* [Git Commit SHA](#Git-Commit-SHA)
* [Artifacts](#Artifacts)
# Volunteers
* Tracy Miranda, CDF
* Steve Taylor, DeployHub
* James Strachan, CloudBees
* Fatih Degirmenci, Ericsson Software Technology
# Introduction
The tools and technologies that are used to construct CI/CD pipelines and the pipelines themselves produce lots of data. Organizations collect, store, and use this data in various ways to get value out of it.
One of the challenges when it comes to metadata is that there is no standardized way for CI/CD tools and technologies with regards to what metadata to produce and consume. In addition to what to produce and consume, how such metadata could be consumed and produced is yet another topic to think about. One aspect of that is already being looked at within CDF - [Events in CI/CD Workstream](https://github.com/cdfoundation/sig-interoperability/tree/master/workstreams/events_in_cicd).
It's true that there are initiatives across various open source projects to align the way this is done such as in-toto, Kubernetes, Jenkins X, and Tekton. However, this topic requires a holistic approach to identify the needs and challenges first, followed by analysing existing efforts to explore possibilities to streamline how this is done either based on existing efforts or by combining them to come up with a standard way collaboratively.
# Scope
The scope of this work could perhaps be best described by using an imaginary pipeline on a relatively high level.
* Commit: Commit data on SCM systems such as SHA, commit message
* Build & Packaging: Artifact related data produced by build systems and artifact repositories such as artifact checksum, version
* Test: Test results produced by test tools and frameworks such as pass/fail rate, logs
* Security and Vulnerability Scanning: Analysis results generated by security and vulnerability tools such as existing issues and level of criticality
* Promotion: The data used for taking the verdict to promote a certain version to next stage, e.g. staging, production
* Deployment: The data produced upon deployment of the artifact
As one can notice, the pipeline itself and the tools and technologies employed by different stages within pipeline generate huge amount of data.
* commit SHA
* commit message
* artifact checksum
* artifact version
* test pass/fail rate
* test logs
* list of vulnerabilities
* confidence level
* promotion metadata
* release metadata
* deployment status
As highlighted above, this data could be used in different ways both real-time and historical purposes.
Some examples to this are
* real-time use: take certain decisions based on generated metadata in order to determine whether the next stages within the pipeline should proceed or not
* historical use: generation of bill of materials for a specific version of the deliverable
Based on the example above, here are some of the questions that could help setting the scope
* types of metadata to consider
* the means to produce/consume/transport the metadata
* events is one of the ways to transport the metadata which is currently being worked on in [Events in CI/CD Workstream](https://github.com/cdfoundation/sig-interoperability/tree/master/workstreams/events_in_cicd)
* how metadata is used
* analysis of existing efforts in order not to reinvent the wheel
# Existing Efforts
* eBay Metadata
* Ortelius Hermetic Manifest/Helm Chart
* [Tekton Chains](https://github.com/tektoncd/chains)
* [in-toto](https://in-toto.io/specs/)
* Jenkins X Metadata Example
* Input from Events in CI/CD workstream
* [SPDX Package Specificiation](https://spdx.github.io/spdx-spec/3-package-information/)
* [SPDX Relationship Specification](https://spdx.github.io/spdx-spec/7-relationships-between-SPDX-elements/)
* [Eiffel Protocol](https://github.com/eiffel-community/eiffel) with a vocabulary for metadata
* [The SIG Interoperability Vocabulary initiative](https://github.com/cdfoundation/sig-interoperability/blob/master/docs/vocabulary.md)
## Jenkins X Releases
[Jenkins X](https://jenkins-x.io/v3/about/) includes a [Release](https://github.com/jenkins-x/jx-api/blob/master/pkg/apis/jenkins.io/v1/types_release.go#L16) CRD which includes the metadata for an individual release.
It includes details of where in git the release came from, the version, SHA, changelog, release notes along with commits, issues, PRs included.
This CRD is then included in the helm chart thats generated so it can be shipped with the actual application. Longer term we could look at integrating this into the helm chart metadata too maybe.
e.g.
```bash
kubectl get release
NAME NAME VERSION GIT URL
nodey545-1.0.1 nodey545 v1.0.1 https://github.com/jstrachan/nodey545
```
Here's an example released resource...
```yaml
apiVersion: jenkins.io/v1
kind: Release
metadata:
name: nodey545-1.0.1
spec:
commits:
- author:
email: jenkins-x@googlegroups.com
name: jenkins-x-bot
branch: master
committer:
email: jenkins-x@googlegroups.com
name: jenkins-x-bot
message: |
chore: release 1.0.1
sha: 5e09f11287f68843759ee3b6c31b88c4f95745b6
- author:
email: jenkins-x@googlegroups.com
name: jenkins-x-bot
branch: master
committer:
email: jenkins-x@googlegroups.com
name: jenkins-x-bot
message: |
chore: add variables
sha: 5a5926c40cd1482cab4d12136c3b8c4852057769
- author:
email: james.strachan@gmail.com
name: James Strachan
branch: master
committer:
email: james.strachan@gmail.com
name: James Strachan
message: |
chore: Jenkins X build pack
sha: c7730637f441f9045297c1e20dccf8842e2d14b1
gitHttpUrl: https://github.com/jstrachan/nodey545
gitOwner: jstrachan
gitRepository: nodey545
name: nodey545
releaseNotesURL: https://github.com/jstrachan/nodey545/releases/tag/v1.0.1
version: v1.0.1
```
## SPDX
TBD
# Definitions
## Git Commit SHA
The Git Commit SHA is derived from taking all of the changes made in the commit and passing them into the SHA-1 algorithm to create the 40 character SHA string. Versions of Git ealier that 2.29 used the SHA-1 algorithm. SHA-1 has known security vulnerabilities. SHA-256 algorithm is available in 2.29 as an [experimental option](https://www.infoq.com/news/2020/10/git-2-29-sha-256/). SHA-256 fixes the security issues and will become the new standard.
### SHA Formats
The long SHA will be 40 character hexidecimal string. The string maybe prefaced by `sha:`.
For example:
`sha:948687c82a19eb0fbaa832c65a753c34da519c58`
or
`948687c82a19eb0fbaa832c65a753c34da519c58`
It appears that the git client will handle the SHA-1 and SHA-256 translation so the git server does not have to change. This approach will enable easly migration from SHA-1 to SHA-256.
The short SHA is represented as the first 7 characters of the long version. The short SHA does not provide uniqueness as the long SHA in that 2 commits can have the same short SHA but different long SHA strings. For example: `948687c`.
### Persistence
The long SHA should be persisted instead of the short SHA. This is due to lack of uniquness in the short SHA described above.
## Artifacts
Artifacts represent an object that needs to be tracked or deployed. There are fives types of artifacts: file, database, container image, end point and hardware. Artifacts will reside in an repository or registry. For example, a maven repository for jar files or a docker registry for container images.
### File Artifact
These artifacts are typically a result of a compile/link process such as Java compile and Java jar. After the file has been created it is uploaded to a repository using a version tag. Plain files, non-compiled, can also be artifacts. For example html, gif, svg and css files. These artifacts would reside in a git repository for versioning.
Specific versions of File Artifacts are retrieved from the repository and acted upon, such as deploying the File Artifact or security scanning it.
Checksum, MD5, SHA, and size values can be calculated for the File Artifact.
#### Attributes
- Name
- Path
- Size
- Create Time
- Modified Time
- Checksum
- MD5
- SHA
- Repository
- Version/Tag/Commit
### Database Artifact
These artifacts have the same characteristics as the File Artifact but have two parts. The SQL to roll-forward of a database object, ie add a column to a table. The SQL to roll-back of a database object, ie drop the column from the table. Database Artifacts are used to move the state of a database incrementally forward or backward. Following the version chain of the artifact enables the incremental sequence to be determined and applied to the database.
#### Attributes
- Roll Forward Name
- Path
- Size
- Create Time
- Modified Time
- Checksum
- MD5
- SHA
- Repository
- Version/Tag/Commit
####
- Roll Back Name
- Path
- Size
- Create Time
- Modified Time
- Checksum
- MD5
- SHA
- Repository
- Version/Tag/Commit
### Container Image Artifact
These artifacts are the result of a container image build which creates the container image that is made up of the multi-layers. The container image is then pushed to a container image registry. Container images can be tagged but the tags are mutable. The image digest SHA is the immutable reference to the image. Image digest SHA can only be calculated once the image is pushed to the registry. Retrieval of a specific version of the image can be done using the tag or the digest. Different versions of the image can happen when retrieving by the tag. In order to guarantee a specific version of the image the digest SHA should be used.
#### Attributes
- Registry
- Organization
- Name
- Tag
- Digest
- Create Time
### End Point Artifact
These artifacts are typically a File Artifact that describes an End Point, ie server, virtual machine, EC2 instance, or kubernetes cluster. Example definition file would be Terraform or AWS Cloud Formation json/yaml files. These End Point Artifacts should reside in the repository for versioning. They can be treated like the File Artifact except when acted upon for provisioning an End Point.
#### Attributes
- Name
- Path
- Size
- Create Time
- Modified Time
- Checksum
- MD5
- SHA
- Repository
- Version/Tag/Commit
### Hardware Artifact
These artifacts represent a hardware stack, ie processor, network card, make, model, speed, firmware etc. The information in the Hardware Artifact would be persisted in a File Artifact and thus in a repository. The Hardware Artifact is subject to drift since there is a disconnect between physical changes made and the updating of the file representation of the Hardware Artifact. Hardware Artifacts fall into reporting purposes versus being acted upon.
#### Attributes
- Name
- Path
- Size
- Create Time
- Modified Time
- Checksum
- MD5
- SHA
- Repository
- Version/Tag/Commit