### Guardians of the Dataverse Frederick Kautz Director of R&D TestifySec Inc. --- ## Introduction - Co-Authored: NIST SP 800-204D, CNCF Cloud Native Security Whitepaper, The SPIFFE Book, etc... - Extensive experience working with AI/ML, Linux containers, storage, networking, and security products. - Architected large AI/ML projects designed to work in HIPAA environments - Too much to write down here... Note: - Co-Chairing: CTA ANSI STANDARD on AI/ML Pipeline Security - Steering Committee: SPIFFE, workload identity --- ## What is an AI Supply Chain? Everything in your pipeline: Hardware, source code, data, hyper parameters, dependencies, build processes, testing, packaging, deployment, etc... --- ## How do I secure "Everything?" - Break it down into steps and reason at it across multiple levels: - Define what steps you need - Define, guarantee, and prevent unauthorized steps. - Secure each step in isolation - Securing the interaction between individual steps - Defining and securing information flow through the whole pipeline - Repeat Note: - steps can run in parallel - data acquisition, normalization, labeling are steps --- ## AI Pipelines and Software Supply Chains - Similarities with software supply chains: CI/CD pipelines, source control, dependencies. - Added complexities: - Handling large volumes of data, - Sensitive information in models, - and data privacy concerns. --- ## Complexity in AI Pipelines Challenges in AI Pipelines: - Managing and securing large datasets. - Ensuring data privacy and model integrity. - Example: Data leakage and model inversion attacks. --- ## Importance of Good Human Processes Human Factors in Software Supply Chains: - Role of good practices and policies. - Training and awareness for developers and operators. - Example: Incident response plans, regular security audits. --- ## Introduction to in-toto What is in-toto? - Framework for securing software supply chains. - Ensures integrity and authenticity of software artifacts from development to deployment. - Core concepts: Layouts, link metadata, artifact rules. --- ## How in-toto Secures Software Supply Chains Mechanisms of in-toto: - Creating a supply chain layout with authorized functionaries. - Collecting and verifying metadata at each step. - Example: Verifying the integrity of a software package before deployment. --- <img src="https://hackmd.io/_uploads/ryMjRvJHC.png" alt="Description of the image" style="max-height: 60vh; width: auto; height: auto; border: none;"> --- ## in-toto Industry Adoption: - SLSA (Supply Chain Levels for Software Artifacts): Providing end-to-end software supply chain security. - NPM (Node Package Manager): Ensuring the security of packages distributed through the NPM registry. --- ## Extending in-toto for AI Pipelines in-toto for AI Pipelines: - Adapting in-toto to handle AI pipeline inputs and outputs. - Securing data flow and model lifecycle. - Example: Verifying data provenance and model updates. --- ```json { "_type": "layout", "keys": { "dev_key": { "keyid": "f6e1dbb...5691", "keytype": "rsa", "scheme": "rsassa-pss-sha256", "keyval": { "public": "-----BEGIN PUBLIC KEY-----\nMIIBIj...==\n-----END PUBLIC KEY-----" } } }, "steps": [ { "name": "clone-repo", "expected_materials": [], "expected_products": [ ["CREATE", "go-project"] ], "pubkeys": ["dev_key"], "expected_command": ["git", "clone", "https://github.com/example/go-project.git", "go-project"], "threshold": 1 }, { "name": "build", "expected_materials": [ ["MATCH", "go-project/*", "WITH", "PRODUCTS", "FROM", "clone-repo"] ], "expected_products": [ ["CREATE", "go-project/bin/go-project"] ], "pubkeys": ["dev_key"], "expected_command": ["go", "build", "-o", "go-project/bin/go-project", "./go-project/..."], "threshold": 1 }, { "name": "run-tests", "expected_materials": [ ["MATCH", "go-project/*", "WITH", "PRODUCTS", "FROM", "clone-repo"], ["MATCH", "go-project/bin/go-project", "WITH", "PRODUCTS", "FROM", "build"] ], "expected_products": [ ["ALLOW", "go-project/test-reports/*"] ], "pubkeys": ["dev_key"], "expected_command": ["go", "test", "./go-project/..."], "threshold": 1 } ], "inspect": [] } ``` --- ## Wait! ## Aren't datasets and models much larger than software artifacts? ## How do I attest a large dataset? --- ## Introduction to Terrapin What is Terrapin? - Tool for creating and verifying data attestations using SHA-256 hashes. - Ensures data integrity and provenance. - Example: Managing large datasets with chunk-based hashing. --- ## Terrapin for Data Provenance in AI Pipelines Using Terrapin for Large Datasets: - Efficiently verifying large data sets by handling data in chunks. - Ensuring data integrity throughout the AI pipeline. - Example: Verifying the integrity of training data and model outputs. --- ## Detailed Process ### 1. Splitting the Dataset The dataset is split into 2MB chunks. Each chunk is hashed independently using sha256-gitoid. ```mermaid graph TD A[Dataset] --> B[Chunk 1: 2MB] A[Dataset] --> C[Chunk 2: 2MB] A[Dataset] --> D[...] A[Dataset] --> E[Chunk N-1: 2MB] A[Dataset] --> F[Chunk N: <=2MB] ``` --- ### 2. Hashing Chunks Each chunk is hashed using sha256-gitoid. The type "blob" and size of the hashed chunk is prepended to each chunk before hashing. Examle sha256-gitoid: ``` gitoid-sha256("hello world") = sha256("blob 11\0hello world") ``` ```mermaid graph TD B[Chunk 1: 2MB] --> F[sha256-gitoid Hash 1] C[Chunk 2: 2MB] --> G[sha256-gitoid Hash 2] E[Chunk N: <=2MB] --> H[sha256-gitoid Hash N] ``` Design note: Gitoid style hashing allows us to detect silent short writes to the hasher. --- ### 3. Storing Hashes All hashes from the first phase are stored in a file in byte format, with each hash being 32 raw bytes. Do not encode the bytes (e.g. no hex, base64, etc...) ```mermaid graph TD; F[sha256-gitoid Hash 1] --> I[Hash File] G[sha256-gitoid Hash 2] --> I[Hash File] H[...] --> I[Hash File] J[sha256-gitoid Hash N-1] --> I[Hash File] K[sha256-gitoid Hash N] --> I[Hash File] ``` --- ### 4. Recursive Hashing The same hashing process is applied recursively to the index of hashes until a single 32-byte sha256-gitoid is obtained. ```mermaid graph TD; I[1 Petabyte of data] --> J[16 GB of hashes] J --> K[256 KB of hashes] K --> L[Final 32 byte hash] ``` --- Total layer sizes for 1PB example: ``` Layer 1: 16 GB Layer 2: 256 KB Layer 3: 32 B ``` --- ### 5. Efficient Validation To validate a slice of the dataset, hashes are validated from the root down to the hash representing the target chunk. ```mermaid graph TD; N[Final sha256-gitoid] --> O[Intermediate Hash] O --> P[...] P --> Q[Target Chunk Hash] ``` --- E.g. With 1PB of data, if we want to validate 1GB of data starting at 500TB: Chunks Needed: ``` Layer 1: Chunk Range: 250,000,000 - 250,000,499 (500 chunks, 16 KB) Layer 2: Chunk: 400-400 (1 chunk, 32 bytes) Layer 3: One 32 byte hash ``` Data transferred: ``` Layer 1: 2 MB (our chunk size) Layer 2: 256 KB (the total layer is less than our chunk size) Layer 3: 32 B (Likely provided through a side channel, e.g. signed attestation) ``` --- ## Integration of in-toto and Terrapin - Comprehensive security for AI pipelines. - Ensuring both process and data integrity. - Example: End-to-end security from data ingestion to model deployment. --- <img src="https://hackmd.io/_uploads/ryMjRvJHC.png" alt="Description of the image" style="max-height: 60vh; width: auto; height: auto; border: none;"> --- ## Conclusion - Importance of securing software supply chains and AI pipelines. - Role of in-toto in ensuring process integrity. - Role of Terrapin in ensuring data provenance and integrity. - Call to action: Implement these tools and practices to enhance the security of your AI pipelines. --- ## Q&A and ## References - NIST SP 800-204D - in-toto Project <https://in-toto.io/> - Terrapin Go Project (more mature) <https://github.com/fkautz/terrapin-go> - Terrapin Rust Project <https://github.com/fkautz/terrapin-rs>
{"title":"CloudNativeSecurityCon NA 2024","description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"bd92da0f-9fd2-4adb-9c69-8f4696031d62\",\"add\":4007,\"del\":3992}]"}
    197 views