PDF Parsing Feasibility study in Zero-Knowledge Proofs

# PDF Parsing Feasibility study in Zero-Knowledge Proofs Parsing PDFs in zero-knowledge can be helpful when you need to prove data-rich statements about a PDF—such as verifying a digital signature or confirming the presence of specific content is present or not. ## Problem with Digilocker PDF's > Note: The PDF’s QR code uses symmetric encryption, while the PDF itself is usually signed with asymmetric encryption (often RSA with SHA-256). One of the main challenges in verifying DigiLocker PDFs and other documents is that the embedded QR code uses symmetric encryption, which prevents us from verifying it within the PDF. However, because the PDF itself is signed with asymmetric encryption (often RSA with SHA-256), we can validate its digital signature. We have also observed that many of these documents use the same master key for signing. Currently, the only check we can **perform is whether the PDF has been tampered with, based on that signature.** If we can build a PDF content parser, we can prove the entire PDF content because all data is signed with a key. A potential solution is to implement a full PDF content parser in a zero-knowledge DSL. For example, if you want to generate a proof about a specific piece of information, you can use an inspector to define a regex pattern that runs over the raw PDF bytes. ## PDF Parsing ![upload_97ef34e8ae3f93344342607645cb7ec5](https://hackmd.io/_uploads/S1mK_oKckx.png) PDF files also follow a hierarchical data structure, somewhat similar to ASN.1. Here are a few points to consider: - **PDF Datatypes**: PDFs consist of objects, streams, dictionaries, arrays, and so on, each nested or referencing one another. - **Data Volume**: Even small PDFs (10–100 KB) contain multiple objects that need careful parsing. - **Goal**: - Parse the information according to PDF object types (e.g., catalog, pages, annotations). - Verify the PDF’s digital signature (if present) using a zero-knowledge circuit. - Finally, prove statements about the PDF contents without revealing unnecessary information. ### The Challenge of PDF Parsing Parsing a PDF is difficult because it has many different structures, and text content is often compressed. In a zero-knowledge circuit, we need to: 1. **Uncompress (deflate) the data.** 2. **Hash large amounts of data.** 3. **Verify the RSA signature.** 4. **Perform regex or string matching** to extract specific text. Example Digilocker pdf file Analaysis | **Property** | **Value** | **Description** | |------------------------------|---------------------------------|------------------------------------------------------------------------------------------------------| | **PDF Size** | 160.0 KiB | The total size of the PDF file. | | **Total Array Buffer Size** | 161,422 bytes | The entire byte array when the PDF is read into memory. | | **Signed Content ByteRange** | `[0 82897 94641 7890]` | Defines which parts of the PDF are covered by the signature. | | **Segment 1** | `0` to `82,897` | ~82.9 KiB of data. | | **Segment 2** | `94,641` to `94,641 + 7,890` | ~7.9 KiB of data. | | **Hashing** | 2 passes (double hashing) | Both segments are used to compute the final message digest. | ### Some Context: ASN.1 and Signature Parsing Typically, digital signatures in PDFs are **ASN.1-encoded**, which is why we developed an [ASN.1 parser in Circom](https://github.com/zkemail/asn1-parser-circom), where we proved ASN.1-encoded content in Circom. ### Key Learnings 1. **Nested Parsing** Parsing involves multiple nested `if-else` statements to correctly decode the data. 2. **Circom Constraints** - We cannot have dynamic-level outputs in Circom, so we set a maximum size for outputs. - Our current logic is limited by Circom’s constraints. - Parsing ASN.1 arrays in Circom outputs without using certain operators (like `<--`) requires significant extra work or an alternative approach. --- Since Noir addresses many existing limitations, it would be beneficial to have an ASN.1 parser in Noir as well. By combining **ASN.1 signature parsing** with **PDF parsing**, we can create proofs that certain PDF data is valid while keeping other parts private. This approach enables **zero-knowledge** verification of statements within a digitally signed PDF. ### Current Solution 1. **Extract Signature and Digest (ASN.1 Parsing)** - **Signed PDF Content**: Identify the exact byte range in the PDF that is covered by the signature (e.g., via `/ByteRange`). - **Signature Data**: Retrieve the public key, signature bytes, and expected digest from the ASN.1 structure. - **Digest Check**: Verify that the PDF content (within the specified byte range) produces a digest matching what’s in the signature header. - **Algorithm Verification**: Use the stated algorithm (e.g., RSA + SHA-256) to confirm the signature is valid. 2. **Optional Partial PDF Parsing** Instead of parsing the entire PDF, you can target only the parts you care about (for example, a specific object identifier). - Fully deflating (uncompressing) PDF streams in-circuit can be very expensive. If possible, handle decompression off-chain and only bring necessary data into the circuit. 3. **Text Extraction or Matching** Once the signature is verified and you trust the PDF’s content, you can run regex or string matching on the relevant sections. - **Byte Array as Message Digest**: Use the same PDF segments you hashed for the signature check. - **Text Extraction**: Search for specific tokens or patterns in those byte arrays to prove certain fields exist. ## In-Depth Exploration Areas ### 1. **Object Identification** - Can we parse the PDF content in detail, and do we have a way to identify each object? ```shell 27 0 obj embedded CID TrueType font program (ArialMT) ╰─ 26 0 << /Length1 18340 /Length 29 0 R /Filter /FlateDecode ``` In this example, there isn’t a specific, easily recognizable object identifier. We can extract text content through FlateDecode and potentially isolate the data we need based on its length and position. However, determining the best approach may require additional research. ### 2. **Text Compression** - Should we decompress (unzip) the text data before hashing, or is it better to hash the compressed data as it appears in the PDF? ### 3. **Zlib Compression** - How feasible is it to build an in-circuit (zero-knowledge) decompression method for Zlib? ### 4. Constraint Analysis - How many constraints might these operations require, and how large can the circuit become? Key factors include: - RSA-SHA256 verification - Digest hashing for large segments - Decompressing PDF content - Regex or string matching ### 5. **Building Client Proofs** - Can we generate these proofs using Noir? ### 6. **Data Size** - How big can the PDF be while still remaining practical for zero-knowledge proofs? - What is the best input format for feeding this data into the circuit? - Is Noir suitable for handling this? ### 7. **Section-Based Verification** - Can we parse text in specific parts of the PDF and verify the signature hash just for those sections? ### 8. **Whole-PDF Hash** - How practical is it to compute and verify a hash of the entire PDF within a zero-knowledge circuit?