--- status: proposed title: Trusted Artifacts creation-date: '2023-07-26' last-updated: '2023-07-26' authors: - '@afrittoli' collaborators: - '@pritidesai' --- # TEP-XXXX: Trusted Artifacts --- <!-- toc --> - [Summary](#summary) - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals](#non-goals) - [Use Cases](#use-cases) - [Requirements](#requirements) - [Proposal](#proposal) - [Notes and Caveats](#notes-and-caveats) - [Design Details](#design-details) - [Design Evaluation](#design-evaluation) - [Reusability](#reusability) - [Simplicity](#simplicity) - [Flexibility](#flexibility) - [User Experience](#user-experience) - [Performance](#performance) - [Risks and Mitigations](#risks-and-mitigations) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Implementation Plan](#implementation-plan) - [Test Plan](#test-plan) - [Infrastructure Needed](#infrastructure-needed) - [Upgrade and Migration Strategy](#upgrade-and-migration-strategy) - [Implementation Pull Requests](#implementation-pull-requests) - [References](#references) <!-- /toc --> ## Summary The Tekton Data Interface working group has been working on for about 10 months now, identified a number of different problems to solve and proposed a number of different solutions. The number of different issues discussed and their sometimes conflicting requirements means that only a small fraction of the proposed solutions has actually been implemented in Tekton. This proposal is an attempt to take one of the problems identified, describe it in a way that is as much as possible self-contained, and provide a simple solution to it. The solution proposed does not need to address all the requirements and constrains of adjecent problems, but it should at least not make it harder for them to be addressed in future. ## Motivation The Tekton runtime model maps the execution of a `Task` (i.e. `TaskRun`) to a Kubernetes `Pod` and the execution of `Pipeline` (i.e. `PipelineRun`) to a collection of `Pods`. `Tasks` in a `Pipeline` may share data using the `Workspace` abstraction, which can be bound to a persistent volume (or `PV`) in Kubernetes. Because of the nature of `PVs`, a downstream `TaskRun` has no way of knowing whether the content of a `workspace` it receives as input has been tampered with. Using existing Tekton capabilities, a producer and a consumer task could share artifacts in a workspace, like a file or a folder, as shown by this [demo `PipelineRun`](https://gist.github.com/afrittoli/3e7600eac3172a9f683f294610218635): <details> <summary>Demo Pipeline:</summary> ```yaml= apiVersion: tekton.dev/v1 kind: PipelineRun metadata: generateName: trusted-artifacts spec: pipelineSpec: workspaces: - name: artifactStorage # In this example this is where we store artifacts tasks: - name: producer taskSpec: results: - name: aFileArtifact type: object description: An artifact file properties: path: type: string hash: type: string type: type: string - name: aFolderArtifact type: object description: An artifact folder properties: path: type: string hash: type: string type: type: string steps: - name: produce-file image: bash:latest script: | #!/usr/bin/env bash # Produce some content date +%s | tee "$(workspaces.artifactStorage.path)/afile.txt" - name: upload-hash-file image: bash:latest script: | #!/usr/bin/env bash # Uploads the file somewhere # This is noop in this case, as the file is passed through # the PVC directly. Note that this PVC could be backed # by different types of storage via CSI volumes, or we # could provide support for direct upload to OCI registries # or object storage # Produces a result which makes the file trustable # This step could be injected by the Tekton controller and be # transparent to users, except for some syntatic sugar, like # a special result kind or an "artifact" API A_FILE_PATH=$(workspaces.artifactStorage.path)/afile.txt A_FILE_HASH=$(md5sum "${A_FILE_PATH}" | awk '{ print $1 }') cat <<EOF | tee $(results.aFileArtifact.path) { "path": "${A_FILE_PATH}", "hash": "${A_FILE_HASH}", "type": "file" } EOF - name: produce-folder image: bash:latest script: | #!/usr/bin/env bash A_FOLDER_PATH=$(workspaces.artifactStorage.path)/afolder mkdir "$A_FOLDER_PATH" date +%s | tee "${A_FOLDER_PATH}/a.txt" date +%s | tee "${A_FOLDER_PATH}/b.txt" date +%s | tee "${A_FOLDER_PATH}/c.txt" - name: upload-hash-folder image: bash:latest script: | #!/usr/bin/env bash A_FOLDER_PATH=$(workspaces.artifactStorage.path)/afolder # Uploads the folder somewhere # This is noop in this case, as the folder is passed through # Depending on the storage file we could upload each file in the folder # some compressed form of the folder A_FOLDER_HASH=$(tar zcf - "$A_FOLDER_PATH" | md5sum | awk '{ print $1 }') cat <<EOF | tee $(results.aFolderArtifact.path) { "path": "${A_FOLDER_PATH}", "hash": "${A_FOLDER_HASH}", "type": "folder" } EOF - name: consumer taskSpec: params: - name: aFileArtifact type: object properties: path: type: string hash: type: string type: type: string - name: aFolderArtifact type: object properties: path: type: string hash: type: string type: type: string steps: - name: download-verify-file image: bash:latest script: | #!/usr/bin/env bash set -e # Check the md5sum if [ "$(params.aFileArtifact.type)" == "file" ]; then echo "$(params.aFileArtifact.hash) $(params.aFileArtifact.path)" | md5sum -c else tar zcf download.tgz $(params.aFileArtifact.path) echo "$(params.aFileArtifact.hash) download.tgz" | md5sum -c fi - name: download-verify-folder image: bash:latest script: | #!/usr/bin/env bash set -e # Check the md5sum if [ "$(params.aFolderArtifact.type)" == "file" ]; then echo "$(params.aFolderArtifact.hash) $(params.aFolderArtifact.path)" | md5sum -c else tar zcf download.tgz $(params.aFolderArtifact.path) echo "$(params.aFolderArtifact.hash) download.tgz" | md5sum -c fi - name: consume-content image: bash:latest script: | #!/usr/bin/env bash # Do something with the verified content # Here I need to use a workspace variable to trigger propagation of the workspace find $(workspaces.artifactStorage.path) -type f params: - name: aFileArtifact value: $(tasks.producer.results.aFileArtifact) - name: aFolderArtifact value: $(tasks.producer.results.aFolderArtifact) workspaces: - name: artifactStorage volumeClaimTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi ``` </details> </br> <details> <summary>Example execution log:</summary> ```log [producer : produce-file] 1690234279 [producer : upload-hash-file] { [producer : upload-hash-file] "path": "/workspace/artifactStorage/afile.txt", [producer : upload-hash-file] "hash": "77c5df93c80c4847891407f22c955527", [producer : upload-hash-file] "type": "file" [producer : upload-hash-file] } [producer : produce-folder] 1690234281 [producer : produce-folder] 1690234281 [producer : produce-folder] 1690234281 [producer : upload-hash-folder] tar: removing leading '/' from member names [producer : upload-hash-folder] { [producer : upload-hash-folder] "path": "/workspace/artifactStorage/afolder", [producer : upload-hash-folder] "hash": "ce344e90cd05a43e44db451dc9d91354", [producer : upload-hash-folder] "type": "folder" [producer : upload-hash-folder] } [consumer : download-verify-file] /workspace/artifactStorage/afile.txt: OK [consumer : download-verify-folder] tar: removing leading '/' from member names [consumer : download-verify-folder] download.tgz: OK [consumer : consume-content] /workspace/artifactStorage/afile.txt [consumer : consume-content] /workspace/artifactStorage/afolder/a.txt [consumer : consume-content] /workspace/artifactStorage/afolder/b.txt [consumer : consume-content] /workspace/artifactStorage/afolder/c.txt ``` </details> </br> The example pipeline shows a few things: - The solution of sharing content is generic, there's nothing in it specific to pipeline in question - Already in the pipeline the same code is used more than once. Replicating this solution in a pipeline with multiple producer or consumers would lead to a lot of duplication - The metadata (path, hash and type) required to trust an artifact is stored as a result in the status of the `TaskRun`. To complete the chain of trust we need to be able to trust the `status` of the `TaskRun`, feature that would be provided by the integration with Spire proposed in TEP-0089 This suggests that we may be able to use a combination of API sugar-coating and controller injected steps to achieve the very same functionality while providing an API which is very similar to the one they are familiar with, but more powerful. ### Goals - Contribute to the chain of trust by allowing consumer `Tasks` to trust artifacts on a `workspace` from producer `Tasks`, as long as results can be trusted. ### Non-Goals - This proposal is restricted to artifact in a `workspace`. There is no reason however that would prevent this mechanism to be extended to artifacts stored somewhere else. The same step injection mechanism could upload and download artifacts to and from other storage types (like OCI registries or object storage). This feature would be similar to what `PipelineResources` used to do, and it shall be designed in a separate TEP for the very reasons described in the [summary](#summary). - This proposal does not discuss how to expose artifacts outside of a pipeline, even though it sets foundations that could be used to achieve that - This proposal does not discuss how to inject artifact as inputs to a pipeline, even though it sets foundations that could be used to achieve that. For instance, one could use a workspace preprovisioned with artifacts and use artifact type params as inputs for a pipeline ### Use Cases - Extend the chain of trust across Tasks for provenance produced by Tekton Chains based on the `TaskRuns` and `PipelineRuns` executed by Tekton Pipeline ### Requirements - TBD ## Proposal A thourough proposal is not available yet; a rough approximation involves the following: - extend parameters and result types to a new type `artifact`, an object type with a fixed schema - extend the `Pod` logic in the controller to inject hashing and checking steps when required The same pipeline from demo, rewritten after implementation, is shown by this [demo pipeline](https://gist.github.com/afrittoli/7236be5fca524b752c221d2346497bb7): <details> <summary>Demo Pipeline:</summary> ```yaml= apiVersion: tekton.dev/v1 kind: PipelineRun metadata: generateName: trusted-artifacts-sugar spec: pipelineSpec: workspaces: - name: artifactStorage # In this example this is where we store artifacts artifacts: true # this will result in failed validation if the workspace is bound to a readonly backend like a secret tasks: - name: producer taskSpec: results: - name: aFileArtifact type: artifact # inbuilt object schema (path, hash, type) description: An artifact file - name: aFolderArtifact type: artifact # inbuilt object schema (path, hash, type) description: An artifact folder steps: - name: produce-file image: bash:latest script: | #!/usr/bin/env bash # Produce some content. The result "data.path" will resolve to the workspace marked for artifacs. date +%s | tee "$(results.aFileArtifact.data.path)/afile.txt" # The controller appends a step that builds the object result json, # and stores it under $(results.aFileArtifact.path) # The type is detected from the context of $(results.aFileArtifact.data.path) # If it's a single file, it's type, if one or more files and folders it's folder # The hash is calculated and added to into the json. - name: produce-folder image: bash:latest script: | #!/usr/bin/env bash A_FOLDER_PATH=$(results.aFolderArtifact.path)/afolder mkdir "$A_FOLDER_PATH" date +%s | tee "${A_FOLDER_PATH}/a.txt" date +%s | tee "${A_FOLDER_PATH}/b.txt" date +%s | tee "${A_FOLDER_PATH}/c.txt" - name: consumer taskSpec: params: - name: aFileArtifact type: artifact # inbuilt object schema (path, hash, type) - name: aFolderArtifact type: artifact # inbuilt object schema (path, hash, type) steps: - name: consume-content image: bash:latest script: | #!/usr/bin/env bash # A step is prepended, which will automatically check the hashes # and fail the task with a specific reason if there is no match # this behaviour could be enabled via some Pipeline/PipelineRun flag # Do something with the verified content. # The path from the object params corresponds to the result's "data.path" # and resolves to a path on the workspace echo "File content" cat $(params.aFileArtifact.path) echo "Folder content" find $(params.aFolderArtifact.path) -type f params: - name: aFileArtifact value: $(tasks.producer.results.aFileArtifact) - name: aFolderArtifact value: $(tasks.producer.results.aFolderArtifact) workspaces: - name: artifactStorage volumeClaimTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi ``` </details> </br> ### Notes and Caveats TBD ## Design Details TBD ## Design Evaluation ### Reusability Adopting trusted artifacts would require users to make changes to their Tasks and Pipelines, albeit minimal ones. ### Simplicity The proposed functionality relies as much as possible on existing Tekton features, it uses a syntax that users are already familiar with by extending it consistently. ### Flexibility The proposed functionality relies on workspaces and `PVCs`, however it could easily be extended to support additional storage formats. In terms of flexibility of adoption in pipelines, there are no assumptions made on the `Tasks` and `Pipelines` where this is used. The artifact schema could extended in future, or it could support custom fields to be specified by users in the same way they do today for object paramters and results, to allow users to attach additional metadata to their artifacts/ ### Conformance TBD ### User Experience The API surface change is minimal and consistent with the API that users are familiar with today. ### Performance Injected steps would impact the execution of `TaskRuns` and `PipelineRuns`, however impact should be minimal: - a single producer and consumer step can be used to handle multiple artifacs to avoid the overhead of one container per artifact - steps shall be injected only where needed - the ability to use `workspaces` means that no extra data I/O is required, apart from that needed to tar/untar folders for hashing purposes ### Risks and Mitigations N/A ### Drawbacks N/A ## Alternatives We could document the demo pipeline and let users apply that approach explicitly in their pipelines. ## Implementation Plan (TBD) * Test Plan * Infrastructure Needed * Upgrade and Migration Strategy * Implementation Pull Requests * References