### Guardians of the Dataverse
Frederick Kautz
Director of R&D
TestifySec Inc.
---
## Introduction
- Co-Authored: NIST SP 800-204D, CNCF Cloud Native Security Whitepaper, The SPIFFE Book, etc...
- Extensive experience working with AI/ML, Linux containers, storage, networking, and security products.
- Architected large AI/ML projects designed to work in HIPAA environments
- Too much to write down here...
Note:
- Co-Chairing: CTA ANSI STANDARD on AI/ML Pipeline Security
- Steering Committee: SPIFFE, workload identity
---
## What is an AI Supply Chain?
Everything in your pipeline: Hardware, source code, data, hyper parameters, dependencies, build processes, testing, packaging, deployment, etc...
---
## How do I secure "Everything?"
- Break it down into steps and reason at it across multiple levels:
- Define what steps you need
- Define, guarantee, and prevent unauthorized steps.
- Secure each step in isolation
- Securing the interaction between individual steps
- Defining and securing information flow through the whole pipeline
- Repeat
Note:
- steps can run in parallel
- data acquisition, normalization, labeling are steps
---
## AI Pipelines and Software Supply Chains
- Similarities with software supply chains: CI/CD pipelines, source control, dependencies.
- Added complexities:
- Handling large volumes of data,
- Sensitive information in models,
- and data privacy concerns.
---
## Complexity in AI Pipelines
Challenges in AI Pipelines:
- Managing and securing large datasets.
- Ensuring data privacy and model integrity.
- Example: Data leakage and model inversion attacks.
---
## Importance of Good Human Processes
Human Factors in Software Supply Chains:
- Role of good practices and policies.
- Training and awareness for developers and operators.
- Example: Incident response plans, regular security audits.
---
## Introduction to in-toto
What is in-toto?
- Framework for securing software supply chains.
- Ensures integrity and authenticity of software artifacts from development to deployment.
- Core concepts: Layouts, link metadata, artifact rules.
---
## How in-toto Secures Software Supply Chains
Mechanisms of in-toto:
- Creating a supply chain layout with authorized functionaries.
- Collecting and verifying metadata at each step.
- Example: Verifying the integrity of a software package before deployment.
---
<img src="https://hackmd.io/_uploads/ryMjRvJHC.png" alt="Description of the image" style="max-height: 60vh; width: auto; height: auto; border: none;">
---
## in-toto Industry Adoption:
- SLSA (Supply Chain Levels for Software Artifacts): Providing end-to-end software supply chain security.
- NPM (Node Package Manager): Ensuring the security of packages distributed through the NPM registry.
---
## Extending in-toto for AI Pipelines
in-toto for AI Pipelines:
- Adapting in-toto to handle AI pipeline inputs and outputs.
- Securing data flow and model lifecycle.
- Example: Verifying data provenance and model updates.
---
```json
{
"_type": "layout",
"keys": {
"dev_key": {
"keyid": "f6e1dbb...5691",
"keytype": "rsa",
"scheme": "rsassa-pss-sha256",
"keyval": {
"public": "-----BEGIN PUBLIC KEY-----\nMIIBIj...==\n-----END PUBLIC KEY-----"
}
}
},
"steps": [
{
"name": "clone-repo",
"expected_materials": [],
"expected_products": [
["CREATE", "go-project"]
],
"pubkeys": ["dev_key"],
"expected_command": ["git", "clone", "https://github.com/example/go-project.git", "go-project"],
"threshold": 1
},
{
"name": "build",
"expected_materials": [
["MATCH", "go-project/*", "WITH", "PRODUCTS", "FROM", "clone-repo"]
],
"expected_products": [
["CREATE", "go-project/bin/go-project"]
],
"pubkeys": ["dev_key"],
"expected_command": ["go", "build", "-o", "go-project/bin/go-project", "./go-project/..."],
"threshold": 1
},
{
"name": "run-tests",
"expected_materials": [
["MATCH", "go-project/*", "WITH", "PRODUCTS", "FROM", "clone-repo"],
["MATCH", "go-project/bin/go-project", "WITH", "PRODUCTS", "FROM", "build"]
],
"expected_products": [
["ALLOW", "go-project/test-reports/*"]
],
"pubkeys": ["dev_key"],
"expected_command": ["go", "test", "./go-project/..."],
"threshold": 1
}
],
"inspect": []
}
```
---
## Wait!
## Aren't datasets and models much larger than software artifacts?
## How do I attest a large dataset?
---
## Introduction to Terrapin
What is Terrapin?
- Tool for creating and verifying data attestations using SHA-256 hashes.
- Ensures data integrity and provenance.
- Example: Managing large datasets with chunk-based hashing.
---
## Terrapin for Data Provenance in AI Pipelines
Using Terrapin for Large Datasets:
- Efficiently verifying large data sets by handling data in chunks.
- Ensuring data integrity throughout the AI pipeline.
- Example: Verifying the integrity of training data and model outputs.
---
## Detailed Process
### 1. Splitting the Dataset
The dataset is split into 2MB chunks. Each chunk is hashed independently using sha256-gitoid.
```mermaid
graph TD
A[Dataset] --> B[Chunk 1: 2MB]
A[Dataset] --> C[Chunk 2: 2MB]
A[Dataset] --> D[...]
A[Dataset] --> E[Chunk N-1: 2MB]
A[Dataset] --> F[Chunk N: <=2MB]
```
---
### 2. Hashing Chunks
Each chunk is hashed using sha256-gitoid. The type "blob" and size of the hashed chunk is prepended to each chunk before hashing.
Examle sha256-gitoid:
```
gitoid-sha256("hello world") = sha256("blob 11\0hello world")
```
```mermaid
graph TD
B[Chunk 1: 2MB] --> F[sha256-gitoid Hash 1]
C[Chunk 2: 2MB] --> G[sha256-gitoid Hash 2]
E[Chunk N: <=2MB] --> H[sha256-gitoid Hash N]
```
Design note: Gitoid style hashing allows us to detect silent short writes to the hasher.
---
### 3. Storing Hashes
All hashes from the first phase are stored in a file in byte format, with each hash being 32 raw bytes. Do not encode the bytes (e.g. no hex, base64, etc...)
```mermaid
graph TD;
F[sha256-gitoid Hash 1] --> I[Hash File]
G[sha256-gitoid Hash 2] --> I[Hash File]
H[...] --> I[Hash File]
J[sha256-gitoid Hash N-1] --> I[Hash File]
K[sha256-gitoid Hash N] --> I[Hash File]
```
---
### 4. Recursive Hashing
The same hashing process is applied recursively to the index of hashes until a single 32-byte sha256-gitoid is obtained.
```mermaid
graph TD;
I[1 Petabyte of data] --> J[16 GB of hashes]
J --> K[256 KB of hashes]
K --> L[Final 32 byte hash]
```
---
Total layer sizes for 1PB example:
```
Layer 1: 16 GB
Layer 2: 256 KB
Layer 3: 32 B
```
---
### 5. Efficient Validation
To validate a slice of the dataset, hashes are validated from the root down to the hash representing the target chunk.
```mermaid
graph TD;
N[Final sha256-gitoid] --> O[Intermediate Hash]
O --> P[...]
P --> Q[Target Chunk Hash]
```
---
E.g. With 1PB of data, if we want to validate 1GB of data starting at 500TB:
Chunks Needed:
```
Layer 1: Chunk Range: 250,000,000 - 250,000,499 (500 chunks, 16 KB)
Layer 2: Chunk: 400-400 (1 chunk, 32 bytes)
Layer 3: One 32 byte hash
```
Data transferred:
```
Layer 1: 2 MB (our chunk size)
Layer 2: 256 KB (the total layer is less than our chunk size)
Layer 3: 32 B (Likely provided through a side channel, e.g. signed attestation)
```
---
## Integration of in-toto and Terrapin
- Comprehensive security for AI pipelines.
- Ensuring both process and data integrity.
- Example: End-to-end security from data ingestion to model deployment.
---
<img src="https://hackmd.io/_uploads/ryMjRvJHC.png" alt="Description of the image" style="max-height: 60vh; width: auto; height: auto; border: none;">
---
## Conclusion
- Importance of securing software supply chains and AI pipelines.
- Role of in-toto in ensuring process integrity.
- Role of Terrapin in ensuring data provenance and integrity.
- Call to action: Implement these tools and practices to enhance the security of your AI pipelines.
---
## Q&A
and
## References
- NIST SP 800-204D
- in-toto Project <https://in-toto.io/>
- Terrapin Go Project (more mature) <https://github.com/fkautz/terrapin-go>
- Terrapin Rust Project <https://github.com/fkautz/terrapin-rs>