PIR-SoK - HackMD

# [Working Title] Private Information Retrieval in Practice: A Systemization of Knowledge :::warning Work in Progress ::: ## Table of Contents [TOC] ## Taxonomy ```mermaid flowchart LR PIR --> Q1{Q1. Requires non-colluding servers?} Q1 --> |No| SSPIR[Single-Server PIR] SSPIR --> Q2{Q2. Requires client interaction during preprocessing?} Q2 --> |No| Q3{Q3. Server is stateless w.r.t. clients?} Q3 --> |No| A[Group A] Q3 --> |Yes| B[Group B] Q2 --> |Yes| Q4{Q4. Requires client input during preprocessing?} Q4 --> |No| C[Group C] Q4 --> |Yes| D[Group D] Q1 --> |Yes| MSPIR[Multi-Server PIR] MSPIR --> E[Group E] ``` :::info **Remark.** All practical PIR schemes rely on preprocessing of some kind; for example, FHE-based PIR assumes that the database is already *packed* into an efficient representation. ::: ### Why Each Decision Branch Matters in Practice Below, we explain why each decision branch in the above taxonomy has concrete implications for deployability. #### Q1. Requires non-colluding servers? Multi-server PIR schemes relying on non-colluding servers can achieve strong privacy guarantees, including information-theoretic privacy, and higher efficiency than single-server PIR. However, these benefits hinge on the non-collusion assumption, which is difficult to enforce in practice. There is typically no practical mechanism to prevent or detect collusion, making this assumption hard to validate in real-world deployments. #### Q2. Requires client interaction during preprocessing? Some PIR schemes require interactive preprocessing involving the client. While this can reduce online query latency, it introduces practical limitations when the database is updated. When preprocessing depends on client participation, each database update may require re-running preprocessing for all clients, causing **update costs to scale with the number of clients**. Clients must also **store preprocessed state** and **actively participate in updates** to remain consistent with the database. In contrast, schemes with client-independent preprocessing allow servers to handle updates without client involvement, making update costs depend only on the database. This enables stateless clients and scales better to large deployments with frequent updates. #### Q3. Server is stateless w.r.t. clients? #### Q4. Requires client input during preprocessing? --- ### Group A Most of early FHE-based PIR falls into this category. ### Group B - WhisPIR - HintlessPIR - YPIR - InsPIRe - [Paper](https://eprint.iacr.org/2025/1352.pdf ) and [Code](https://github.com/google/private-membership/tree/main/research/InsPIRe) - Performance (Intel Xeon CPU @ 2.6 GHz, Single-Thread) ![Screenshot 2025-12-17 at 6.39.19 PM](https://hackmd.io/_uploads/S1xut1Wm-e.png) ### Group C - SimplePIR & DoublePIR: [Paper](https://eprint.iacr.org/2022/949.pdf) ### Group D - Piano - RMS24 - Plinko: [Paper](https://eprint.iacr.org/2024/318.pdf) and [Tutorial](https://vitalik.eth.limo/general/2025/11/25/plinko.html) ### Group E - XOR-PIR - c.f. [SimplePIR](https://eprint.iacr.org/2022/949.pdf) ## Model We focus on the most basic setting (Single-server, Index, No DB-privacy, No Batch) We have relatively simple transformations for other settings of possible interest. - [Keyword PIR](https://eprint.iacr.org/2019/1483.pdf): - PIR + Cuckoo Hashing -> Keyword PIR - How about richer queries? - [Symmetric PIR](https://www.wisdom.weizmann.ac.il/~naor/PAPERS/ope.pdf): - PIR + OPRF -> Symmetric PIR ## Possible Approaches for Better Performance - Differential Privacy (Apple): https://arxiv.org/pdf/2406.06761 - split DBs into clusters: better online performance at the cost of reduced privacy - anonymous query (Tor?) + fake queries - works in epoch and controls timing - Distributional PIR: https://eprint.iacr.org/2025/132.pdf ## Other Primitives for Private Read - TEE+ORAM - MPC+ORAM ## Ethereum Data | Source | Type | Size | Notable consumers | Notable samples | |----------|----------|----------|----------|----------| | **State** ("World State", now MPT later to be [UBT](https://eips.ethereum.org/eips/eip-7864)) | Mutates with every block | ~100s GBs | wallets,frontends,lightclients | balances, contract code, merkle proofs | | **Logs** Receipts | Append-only | 100s GBs | wallets,frontends,tax software | ERC20 transfer events, shielded pool events | | Transactions | Append-only | 100s of GBs | light clients, stateless clients, indexers | transactions; `get_block`* and `get_tx`* rpc calls | | **Historical state** (archival nodes) | Append-only | 3-20 TBs dependingn on node | light clients, stateless clients, indexers | Usage example: _"at block `x` the value of this leaf was `y` and this is the merkle root to prove it_ | | Block headers | Append-only | ~10GBs | light clients, stateless clients | Usage example: light client validate header against previous block then fetch and validate state leafs of interest | | **Blobs** (stored as sidecars to CL clients, pruned every ~18 days but archival nodes expected to store long term) | Append-only | 10s of GBs* expected to grow as gas / ceiling blob count keeps increasing | L2 users in exceptional cases (forced exit), L2 nodes | state diffs/updates of rollups | \* 128 KiB per blob X currently max 6 blobs per blocks X ~7.2k blocks per day X 365 days per year ~= 2GB Note: users fetching historical blob data from an indexer must also query the state to get the KZG commitments fromn EL, that's how blob data integrity is verified.