# Block-level Access Lists (BAL): Benchmark and Overhead **By [Po @ QuarkChain](https://x.com/sanemindpeace)** ## TL;DR * Evaluated six BAL designs to trade off data overhead and parallelism. → Type-2/5/6 enables full parallel execution; Type-3/4 offer maximum I/O reduction, Type-1 has the lowest overhead. * I/O dominates Geth runtime (\~70%), primarily from account and storage reads. → Execution isn’t the bottleneck — most CPU time is spent on trie access, keccak hashing, and sender recovery. * Implemented Type-1, Type-3, and Type-4 BAL in latest Geth. → Benchmarks show Type-1 cuts total block processing time by \~60%, Type-3 by \~68% and Type-4 by \~70%. ## Overview of BAL Designs ### Definitions - Address: Account address in the BAL. - Slot: Storage key in a contract account referenced in the BAL. - Account value: - Pre-block execution values: [nonce, balance, codeHash] - Post-tx execution modified values: [nonce, balance] - Slot value: Value of a specific storage slot. - Pre-block execution values: Account or storage values before block execution, represented as: - `Map[address/slot, account/storage value]` - Post-tx execution modified values: Account or storage values modified by each transaction in the block, represented as: - `List[tx_index, modified account and storage values]` - Parallel I/O: Concurrent fetching of account/slot values via Account(addr) and Storage(addr, slot). - Near-zero I/O: I/O operations are mostly eliminated, except for account/contract code reads and post-execution commits. - Optimistic parallel execution: Executes transactions in parallel under the assumption of no conflicts, with rollbacks if conflicts are detected. - Fully parallel execution: Parallel execution without re-executing transactions with conflicting dependencies. - BAL validation overhead: Cost of verifying that the BAL accurately reflects access patterns and state changes during execution. ### BAL Design Variants We explored six BAL design variants (excluding contract code for brevity): 1. **Addresses and slots**: Enables parallel I/O. 2. **Addresses and slots + post-tx modified values (EIP-7928)**: Supports parallel I/O and fully parallel execution. 3. **Addresses and slots + pre-block slot values**: Enables near-zero I/O. 4. **Addresses and slots + pre-block account and slot values**: Enables near-zero I/O. 5. **Addresses and slots + pre-block slot values + post-tx modified values**: Combines near-zero I/O, fully parallel execution, and parallel state root computation. 6. **Addresses and slots + pre-tx account and slot values**: Enables near-zero I/O and fully parallel execution. The key difference between Type-5 and Type-6 lies in BAL validation: - **Type-5** supports parallel validation of the provided BAL and parallel computation of the new state root. However, it requires merging the pre-block-execution values with the post-tx-execution modified values to reconstruct the prestate needed for fully parallel transaction execution. - **Type-6** enables near-zero I/O reads but requires sequential validation, since the BAL for a transaction can only be verified after all preceding transactions have been executed. - We expect the validation overhead difference between Type-5 and Type-6 to be minimal, since it's limited to in-memory operations. [Overhead Table (exculdes contract code sizes), 2000 blocks:22347001-22349000](https://docs.google.com/spreadsheets/d/1W0tWBEoG-hV_REilXyu99F1WBplLBmbQscedoYGOEIA/edit?usp=sharing) ![image](https://hackmd.io/_uploads/rytehB3xlg.png) > Although contract codes were not included here, for optimistic or fully parallel execution, only newly created contract codes need to be provided. For existing contracts, only the `codeHash` is required, since parallel fetching of existing code is fast enough. A key observation is that the **average block size overhead of Type-3 and Type-6 BAL is smaller than Type-2 (EIP-7928)**, since most account and storage writes may not be accessed by later transactions within the same block. This makes Type-6 an attractive BAL option, offering a better balance between overhead and performance. Due to the space efficiency of Type-1, Type-3, and Type-4, as well as the complexity involved in enabling fully parallel execution for Type-2, Type-5, and Type-6 in Geth, we implemented only Type-1, Type-3, and Type-4 in latest Geth. ## Prototype Implementations of BAL in Geth 1.15.8 ### Type-1 BAL The BAL data structure: ``` var AllBlockAccessLists = map[uint64]types.AccessList{} ``` During `processBlock`, all BAL values are prefetched before transaction execution. ```go func (p *StateProcessor) Process(block *types.Block, statedb *state.StateDB, cfg vm.Config) (*ProcessResult, error) { ... statedb.PrefetchAccessList(block.Number().Uint64()) ... ``` Prefetched states are stored into `StateDB.stateObjects[addr].originStorage`. We first fetch accounts in parallel, then fetch storage values in parallel, thus eliminating almost all I/O reads in later execution (except for loading contract codes). (*Note: This is an initial version; parallel execution while parallel account and storage fetching optimizations have not yet been applied.*) Snippet for parallel prefetching of accounts and slot values, and setting slot values: ``` type StateObject struct { obj *stateObject keys []common.Hash } type StorageKV struct { obj *stateObject key *common.Hash val *common.Hash } tp := NewThreadPool(25) tp.Start() lenAccts := 0 lenMaxSlots := 0 accts := make(chan StateObject, len(bals)) accountReadStart := time.Now() for _, bal := range bals { addr := bal.Address lenAccts++ keys := bal.StorageKeys lenMaxSlots += len(keys) tp.AddTask(func() { // s.reader.Account is not thread due to keccakhash, so it's changed to AccountBAL acct, err := s.reader.AccountBAL(addr) if err != nil { log.Error("fail to fetch account:", addr) } obj := newObject(s, addr, acct) accts <- StateObject{obj, keys} }) } acctsArr := []StateObject{} // set prefetched accounts in stateDB for range lenAccts { state := <-accts acctsArr = append(acctsArr, state) // must set it first to avoid accounts read later in storageBAL s.setStateObject(state.obj) } close(accts) s.AccountReads += time.Since(accountReadStart) storages := make(chan *StorageKV, lenMaxSlots) lenSlots := 0 storageReadStart := time.Now() for _, state := range acctsArr { obj := state.obj keys := state.keys addr := obj.Address() if obj.origin == nil { continue } lenSlots += len(keys) acct := obj.origin tr, err := trie.NewStateTrie(trie.StorageTrieID(s.originalRoot, obj.addrHash, acct.Root), s.db.TrieDB()) if err != nil { log.Error("fail to create trie for:", addr) panic("fail to create trie for:") } for _, key := range keys { key := key tp.AddTask(func() { val, err := obj.db.reader.StorageBAL(addr, key, tr) if err != nil { log.Error("fail to fetch storage:", addr, key) } kv := &StorageKV{obj, &key, &val} storages <- kv }) } } tp.Stop() s.StorageReads += time.Since(storageReadStart) // set prefetched storage values in stateDB for range lenSlots { kv := <-storages kv.obj.originStorage[*kv.key] = *kv.val s.setStateObject(kv.obj) } ``` ### Type-3 BAL The Type-3 implementation is largely similar to Type-2, with the key difference being that slot values are directly set in the `stateDB` without fetching them from database. The BAL data structure: ```go type StorageKV struct { Key common.Hash `json:"key"` Val common.Hash `json:"val"` } type AccessListKV struct { Address common.Address `json:"address" gencodec:"required"` Nonce uint64 `json:"nonce"` Balance *uint256.Int `json:"balance"` Root common.Hash `json:"root"` CodeHash []byte `json:"codeHash"` StorageKV []StorageKV `json:"storageKeys" gencodec:"required"` } ``` During `processBlock`, all BAL values are prefetched before transaction execution. Prefetched states are also stored into `StateDB.stateObjects[addr].originStorage`. We first fetch accounts in parallel, then set corresponding storage values directly from the BAL. Parallel prefetching accounts and setting storages snippet: ```go tp := NewThreadPool(25) tp.Start() lenAccts := 0 accts := make(chan *stateObject, len(bals)) for _, bal := range bals { addr := bal.Address lenAccts++ kvs := bal.StorageKV tp.AddTask(func() { acct, err := s.reader.AccountBAL(addr) if err != nil { log.Error("fail to fetch account:", addr) } obj := newObject(s, addr, acct) if acct != nil { for _, kv := range kvs { obj.originStorage[kv.Key] = kv.Val } } accts <- obj }) } for range lenAccts { obj := <-accts s.setStateObject(obj) } ``` ### Type-4 BAL The Type-4 implementation closely follows Type-3, but both account and slot values are directly set in the `stateDB` without any fetching. The BAL data structure remains the same as in Type-3. ```go for _, bal := range bals { addr := bal.Address kvs := bal.StorageKV if bal.Balance == nil { bal.Balance = ZeroBalance bal.Root = types.EmptyRootHash bal.CodeHash = types.EmptyCodeHash.Bytes() } acct := types.StateAccount{ Nonce: bal.Nonce, Balance: bal.Balance, CodeHash: bal.CodeHash, Root: bal.Root, } obj := newObject(s, addr, &acct) for _, kv := range kvs { obj.originStorage[kv.Key] = kv.Val } s.setStateObject(obj) } ``` ## Benchmark Results and Discussion ### Experimental Setup - **Software**: Geth 1.15.8 with `state.scheme=path` - **Hardware**: 32 cores, 7TB SSD, 128GB RAM - **BAL generation and loading**: Used `debug_traceTransaction` with `prestateTracer` to collect the BAL across 2000 blocks ([22347001, 22349000]), stored it in JSON format, and loaded it at Geth startup. - **Benchmarking method**: Used Geth’s `export` and `import` commands across 2000 blocks ([22347001, 22349000]). ```bash geth export --datadir ./geth_full/snap/ dump.dat 22347001 22349000 geth --pprof.cpuprofile cpu1.prof --datadir ./geth_snap import --nocompaction true dump.dat ``` ### Results The [time cost breakdown](https://docs.google.com/spreadsheets/d/1W0tWBEoG-hV_REilXyu99F1WBplLBmbQscedoYGOEIA/edit?gid=779708818#gid=779708818) covers I/O, execution, validation and commit phases: ![image](https://hackmd.io/_uploads/rJ3rGIhxxx.png) Key findings: - **I/O costs for account and storage reads**, account for nearly **70%** of vanilla Geth’s total block execution time. - Using Type-1 BAL (account addresses and storage keys only), our prototype achieved a ~60% reduction in total execution time by eliminating 85% of I/O reads. Type-4 BAL eliminates all I/O reads and achieved a ~70% reduction in total execution time. Notably, account read time is much lower than storage read time. - In an earlier prototype (implemented 3 years ago, not listed here), **optimistic parallel execution** can reduced toal execution time by **80%**. - As Toni pointed out, including **post-execution storage key values per transaction** would enable **fully parallel execution**, making total block execution time limited only by the slowest transaction. - Commit time will become the main bottleneck once I/O read is eliminated and execution is fully parallelized with BAL. We also profiled CPU time (excluding I/O), and the top 10 most time-consuming functions are shown below: ![CPU profiling image](https://hackmd.io/_uploads/r11COAaylx.png) The detailed call stacks are [attached here](https://hackmd.io/_uploads/rJkuBb1bxg.svg?sanitize=true): Observations: - Most CPU time is spent on: - **PathDB node fetching: 16%** (`keccak`, `atomic.(*Int32).Add(RLock)`) - **Caching operations: 7%** (`maps.ctrlGroup.matchH2`) - **Sender recovery: 4%** (`runtime.cgocall`) - **Transaction execution** itself is not the critical path for performance. ## Future Works - **Parallelize trie node fetching**: We observed that the average trie disk read depth during account and storage reads in the stateDB—excluding the depth contributed by intermediate cached nodes—is around 5. By parallelizing trie node fetching, I/O read costs could be reduced to approximately 25% of the current Geth implementation without BAL. As a result, **the performance gain from BAL alone may drop to around 15%**. It’s therefore worthwhile to benchmark Geth's performance under parallel trie fetching without BAL. It’s also worth noting that other clients (e.g., Nethermind) have adopted similar optimization approaches. ![image](https://hackmd.io/_uploads/r1PSPL3egg.png) - **Experiment with fully parallel execution using BAL**: Investigate the performance impact and feasibility of fully parallel execution enabled by BAL. ## References [1] [EIP-7928: Block-Level Access Lists](https://github.com/ethereum/EIPs/blob/a5cc19cda650213cbb74e20bcbd1e1474a842acc/EIPS/eip-7928.md) [2] [Code for Reproducing BAL Overhead](https://github.com/dajuguan/lab/blob/main/eth/go/access_list_count.go) [3] [Type-1, Type-3 and Type-4 BAL Implementations in Geth](https://github.com/dajuguan/go-ethereum/tree/po/bal) [4] [Script for Generating BAL](https://github.com/dajuguan/lab/blob/main/eth/go/access_list.go) [5] [ESP Academic Grant 2023: EthStorage/Quarkchain Earlier BAL Submission](https://docs.google.com/document/d/1FoW_iuIPoYAjxy-AUGr_vRbIK_Zon3XDAb6heS_GWL0/edit?usp=sharing)