Go Ethereum Logs Retrieval and Storage Analysis

# Go Ethereum Logs Retrieval and Storage Analysis Ethereum logs are powerful tools for analyzing critical events in blockchain data. This note will walk through some log processing-related designs, algorithms, and data structures of the `geth` Ethereum client. ## Overview Go Ethereum (or `geth`) is a popular Ethereum client written in the Go language. When we use the `web3.js` to interact with the blockchain, the underlying library will use the json-rpc calls to ask the Ethereum client (or "full node") to execute the queries, transactions, and smart contract calls. The full node will periodically sync the latest block from the blockchain p2p network and store it in its database. When the high-level library like `web3.js` sends a log query request to the full node, it will search through its database and return filtered results to the library. Therefore, an instance running a full node typically requires large memory for cache and high-performance disks for database reading and writing. Usually, there will not be a vital requirement for CPUs. However, recently we observed that the CPU usage of our BSC full node is constantly hitting 100%, even this machine has 12 CPU cores. After we walk through the code paths of the `geth` (specifically, the [Binance BSC fork](https://github.com/binance-chain/bsc) of the `geth`), we find some tricks and configurations that can help us reduce the CPU usages. ## Walk Through ### Ethereum Block Structure In an Ethereum block, there are two fields related to the logs: `receipts` and `logsBloom`. `receipts` contains the logs of all transactions in that block, stored continuously. `logsBloom` includes the bloom filter representation of the logs. We will discuss what Bloom filter is and how it works in Ethereum later. ### `eth_getlogs` RPC Call The documentation of the `eth_getlogs` RPC call can be found here: <https://eth.wiki/json-rpc/API#eth_getlogs>. The RPC call contains two categories of parameters: content filters and block range filters. Content filters: `address` and `topics`. Block range filters: `fromBlock` and `toBlock`, or `blockHash`. If `blockHash` is present in the filter criteria, then neither `fromBlock` nor `toBlock` is allowed. ### RPC Call Entry <https://github.com/binance-chain/bsc/blob/db2eea7fbd06a90480543de595fe1cab193515fe/eth/filters/api.go#L332-L359> When the client receives a new `eth_getlogs` request, it will first check if the RPC call is supplied with the `blockHash` parameter. If there is only the `blockHash` parameter, it will create a single block filter. Under the hood, the single block filter will completely skip the Bloom filter-related code. If `fromBlock` and `toBlock` are supplied, it will create a range filter that will query the Bloom filter even the `fromBlock` and `toBlock` are the same. ### Bloom Filter [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) is a space-efficient probabilistic data structure that can quickly query if an element is not in a collection. Bloom filter is "probabilistic" because it can only reliably determine if an element is not in a collection. However, because it is space-efficient and fast enough, it is widely used in Ethereum. For Ethereum logs, the Bloom filters are blockwise, which means that it can only determine if the logs are not in a block. The custom Bloom filter implementation in Ethereum used [Keccak](https://en.wikipedia.org/wiki/SHA-3) as the hash function. <https://github.com/binance-chain/bsc/blob/db2eea7fbd06a90480543de595fe1cab193515fe/eth/filters/filter.go#L137-L169> When querying the logs and the range filter is used, the filter will request a new Bloom filter session and pull Bloom filter data from the block data database. After the data is available, the filter session will use the Bloom filter to check if the requested contents like `address` and `topics` are in that block. ### Retrieve Receipts <https://github.com/binance-chain/bsc/blob/db2eea7fbd06a90480543de595fe1cab193515fe/core/blockchain.go#L1102-L1117> After the filter determines which blocks may have the logs we need, it will ask the block data database to retrieve the related blocks' receipts. The receipts are encoded in [RLP format](https://eth.wiki/fundamentals/rlp), which is widely used in the Ethereum codebase. Go Ethereum will first try to get receipts from the LRU cache. The LRU cache is shared across the codebase, and the size of the LRU cache is [hard-coded](https://github.com/binance-chain/bsc/blob/db2eea7fbd06a90480543de595fe1cab193515fe/core/blockchain.go#L84-L99). <https://github.com/binance-chain/bsc/blob/db2eea7fbd06a90480543de595fe1cab193515fe/core/rawdb/accessors_chain.go#L570-L599> If the receipts cannot be found from the LRU cache, the code will retrieve receipts from the `freezer` database. If the retrieval from the `freezer` database also fails, it will ask the `leveldb` database to "freeze" data into the `freezer` database and try to retrieve data from the `freezer` database again. ### Block Data Database In Go Ethereum, all blockchain-related data is stored in its database. There are two kinds of databases in Ethereum: `freezer` and `leveldb`. `leveldb` is a key-value database based on Google's [leveldb](https://github.com/syndtr/goleveldb). `freezer` is a file-based, append-only database suitable for saving old read-only data. Both of the databases use the [snappy](https://github.com/google/snappy) library for data compression. There is a [config section](https://github.com/binance-chain/bsc/blob/db2eea7fbd06a90480543de595fe1cab193515fe/core/rawdb/schema.go#L123-L131) that can control whether the `snappy` compression will be applied to the `freezer` tables. However, the config is hard-coded, so a recompile is required to change the config. ### Raw Logs Matching <https://github.com/binance-chain/bsc/blob/db2eea7fbd06a90480543de595fe1cab193515fe/eth/filters/filter.go#L253-L282> After retrieving the related blocks' receipts, it will extract the logs from receipts. Then, it will execute a parallel for-loop to check if the `address` and `topics` of the logs match the content filters' criteria. Finally, it will send filtered logs as the response to the RPC calls. ## Suggestions For Optimizing CPU Usages - Use `blockHash` instead of using `fromBlock` and `toBlock` for querying logs. - Increase the size of block receipts' LRU cache. - Disable the `snappy` compression of the `freezer` tables.