owned this note changed 3 years ago
Published Linked with GitHub
tags: Archived

Standardized Metadata

This is an initiative within the CDF SIG Interoperability.

Volunteers

  • Tracy Miranda, CDF
  • Steve Taylor, DeployHub
  • James Strachan, CloudBees
  • Fatih Degirmenci, Ericsson Software Technology

Introduction

The tools and technologies that are used to construct CI/CD pipelines and the pipelines themselves produce lots of data. Organizations collect, store, and use this data in various ways to get value out of it.

One of the challenges when it comes to metadata is that there is no standardized way for CI/CD tools and technologies with regards to what metadata to produce and consume. In addition to what to produce and consume, how such metadata could be consumed and produced is yet another topic to think about. One aspect of that is already being looked at within CDF - Events in CI/CD Workstream.

It's true that there are initiatives across various open source projects to align the way this is done such as in-toto, Kubernetes, Jenkins X, and Tekton. However, this topic requires a holistic approach to identify the needs and challenges first, followed by analysing existing efforts to explore possibilities to streamline how this is done either based on existing efforts or by combining them to come up with a standard way collaboratively.

Scope

The scope of this work could perhaps be best described by using an imaginary pipeline on a relatively high level.

  • Commit: Commit data on SCM systems such as SHA, commit message
  • Build & Packaging: Artifact related data produced by build systems and artifact repositories such as artifact checksum, version
  • Test: Test results produced by test tools and frameworks such as pass/fail rate, logs
  • Security and Vulnerability Scanning: Analysis results generated by security and vulnerability tools such as existing issues and level of criticality
  • Promotion: The data used for taking the verdict to promote a certain version to next stage, e.g. staging, production
  • Deployment: The data produced upon deployment of the artifact

As one can notice, the pipeline itself and the tools and technologies employed by different stages within pipeline generate huge amount of data.

  • commit SHA
  • commit message
  • artifact checksum
  • artifact version
  • test pass/fail rate
  • test logs
  • list of vulnerabilities
  • confidence level
  • promotion metadata
  • release metadata
  • deployment status

As highlighted above, this data could be used in different ways both real-time and historical purposes. Some examples to this are

  • real-time use: take certain decisions based on generated metadata in order to determine whether the next stages within the pipeline should proceed or not
  • historical use: generation of bill of materials for a specific version of the deliverable

Based on the example above, here are some of the questions that could help setting the scope

  • types of metadata to consider
  • the means to produce/consume/transport the metadata
  • how metadata is used
  • analysis of existing efforts in order not to reinvent the wheel

Existing Efforts

Jenkins X Releases

Jenkins X includes a Release CRD which includes the metadata for an individual release.

It includes details of where in git the release came from, the version, SHA, changelog, release notes along with commits, issues, PRs included.

This CRD is then included in the helm chart thats generated so it can be shipped with the actual application. Longer term we could look at integrating this into the helm chart metadata too maybe.

e.g.

kubectl get release
NAME             NAME       VERSION   GIT URL
nodey545-1.0.1   nodey545   v1.0.1    https://github.com/jstrachan/nodey545

Here's an example released resource

apiVersion: jenkins.io/v1
kind: Release
metadata:
  name: nodey545-1.0.1
spec:
  commits:
  - author:
      email: jenkins-x@googlegroups.com
      name: jenkins-x-bot
    branch: master
    committer:
      email: jenkins-x@googlegroups.com
      name: jenkins-x-bot
    message: |
      chore: release 1.0.1
    sha: 5e09f11287f68843759ee3b6c31b88c4f95745b6
  - author:
      email: jenkins-x@googlegroups.com
      name: jenkins-x-bot
    branch: master
    committer:
      email: jenkins-x@googlegroups.com
      name: jenkins-x-bot
    message: |
      chore: add variables
    sha: 5a5926c40cd1482cab4d12136c3b8c4852057769
  - author:
      email: james.strachan@gmail.com
      name: James Strachan
    branch: master
    committer:
      email: james.strachan@gmail.com
      name: James Strachan
    message: |
      chore: Jenkins X build pack
    sha: c7730637f441f9045297c1e20dccf8842e2d14b1
  gitHttpUrl: https://github.com/jstrachan/nodey545
  gitOwner: jstrachan
  gitRepository: nodey545
  name: nodey545
  releaseNotesURL: https://github.com/jstrachan/nodey545/releases/tag/v1.0.1
  version: v1.0.1

SPDX

TBD

Definitions

Git Commit SHA

The Git Commit SHA is derived from taking all of the changes made in the commit and passing them into the SHA-1 algorithm to create the 40 character SHA string. Versions of Git ealier that 2.29 used the SHA-1 algorithm. SHA-1 has known security vulnerabilities. SHA-256 algorithm is available in 2.29 as an experimental option. SHA-256 fixes the security issues and will become the new standard.

SHA Formats

The long SHA will be 40 character hexidecimal string. The string maybe prefaced by sha:. For example:

sha:948687c82a19eb0fbaa832c65a753c34da519c58 or 948687c82a19eb0fbaa832c65a753c34da519c58

It appears that the git client will handle the SHA-1 and SHA-256 translation so the git server does not have to change. This approach will enable easly migration from SHA-1 to SHA-256.

The short SHA is represented as the first 7 characters of the long version. The short SHA does not provide uniqueness as the long SHA in that 2 commits can have the same short SHA but different long SHA strings. For example: 948687c.

Persistence

The long SHA should be persisted instead of the short SHA. This is due to lack of uniquness in the short SHA described above.

Artifacts

Artifacts represent an object that needs to be tracked or deployed. There are fives types of artifacts: file, database, container image, end point and hardware. Artifacts will reside in an repository or registry. For example, a maven repository for jar files or a docker registry for container images.

File Artifact

These artifacts are typically a result of a compile/link process such as Java compile and Java jar. After the file has been created it is uploaded to a repository using a version tag. Plain files, non-compiled, can also be artifacts. For example html, gif, svg and css files. These artifacts would reside in a git repository for versioning.

Specific versions of File Artifacts are retrieved from the repository and acted upon, such as deploying the File Artifact or security scanning it.

Checksum, MD5, SHA, and size values can be calculated for the File Artifact.

Attributes

  • Name
  • Path
  • Size
  • Create Time
  • Modified Time
  • Checksum
  • MD5
  • SHA
  • Repository
  • Version/Tag/Commit

Database Artifact

These artifacts have the same characteristics as the File Artifact but have two parts. The SQL to roll-forward of a database object, ie add a column to a table. The SQL to roll-back of a database object, ie drop the column from the table. Database Artifacts are used to move the state of a database incrementally forward or backward. Following the version chain of the artifact enables the incremental sequence to be determined and applied to the database.

Attributes

  • Roll Forward Name
  • Path
  • Size
  • Create Time
  • Modified Time
  • Checksum
  • MD5
  • SHA
  • Repository
  • Version/Tag/Commit

  • Roll Back Name
  • Path
  • Size
  • Create Time
  • Modified Time
  • Checksum
  • MD5
  • SHA
  • Repository
  • Version/Tag/Commit

Container Image Artifact

These artifacts are the result of a container image build which creates the container image that is made up of the multi-layers. The container image is then pushed to a container image registry. Container images can be tagged but the tags are mutable. The image digest SHA is the immutable reference to the image. Image digest SHA can only be calculated once the image is pushed to the registry. Retrieval of a specific version of the image can be done using the tag or the digest. Different versions of the image can happen when retrieving by the tag. In order to guarantee a specific version of the image the digest SHA should be used.

Attributes

  • Registry
  • Organization
  • Name
  • Tag
  • Digest
  • Create Time

End Point Artifact

These artifacts are typically a File Artifact that describes an End Point, ie server, virtual machine, EC2 instance, or kubernetes cluster. Example definition file would be Terraform or AWS Cloud Formation json/yaml files. These End Point Artifacts should reside in the repository for versioning. They can be treated like the File Artifact except when acted upon for provisioning an End Point.

Attributes

  • Name
  • Path
  • Size
  • Create Time
  • Modified Time
  • Checksum
  • MD5
  • SHA
  • Repository
  • Version/Tag/Commit

Hardware Artifact

These artifacts represent a hardware stack, ie processor, network card, make, model, speed, firmware etc. The information in the Hardware Artifact would be persisted in a File Artifact and thus in a repository. The Hardware Artifact is subject to drift since there is a disconnect between physical changes made and the updating of the file representation of the Hardware Artifact. Hardware Artifacts fall into reporting purposes versus being acted upon.

Attributes

  • Name
  • Path
  • Size
  • Create Time
  • Modified Time
  • Checksum
  • MD5
  • SHA
  • Repository
  • Version/Tag/Commit
Select a repo