# Overview
This document addresses the following two questions:
- **What** is the relevant data that should be collected (to get insights about the operational health of the network)?
- **How** to best collect that data?
# Introduction
Analyses of transactions on the Bitcoin blockchain are widespread today because
the data is easily accessible to anyone. Blockchain analyses give rise to the
many websites that provide statistics about the Bitcoin network such as the
network's hashrate, the throughput and fees of transactions, and many more.
For some investigations, however, the blockchain-based view can be too narrow.
For one thing, blockchain data lacks precise transaction and block timestamps,
rendering it useless for fine-grained analyses of transaction and block
propagation in the network. For another, it misses transactions that did not
make it into the blockchain, which provide crucial clues about the demand for
block space, fee dynamics, replace-by-fee usage, and so on. Also missing are
invalid transactions and blocks sent in the network, which can be a basis for
anomaly detection.
Fortunately, a comprehensive data set not suffering the aforementioned
shortcomings can be extracted from a Bitcoin Core node. Unfortunately, however,
as of now there is no standardized way to collecting the data in a simple,
reliable and automated fashion. The remainder of this document attempts to
reach a consensus on what data should be collected, and how best to collect it.
# Relevant data
This is an initial attempt to list data that could prove useful. Because
historical data can not be reproduced later, a general philosophy of
overcollection (i.e., err on the side of collecting too much vs. too little
data) should apply. Feel free to extend this list.
- timestamp: The timestamp of the event
- event: The type of event (invalid transaction/block, block added to chain,
transaction added to mempool, transaction removed from mempool)
- txid: The id of the transaction or block
- raw data: The raw transaction data and, in case of an invalid block, block
data. Recording raw transaction data has the advantage of being able to fix
mistakes (infrequently!) made by the witness detection heuristic.
- metadata (data that can in theory derived from a full historical data set but
in practice is too hard/expensive to do)
- invalid: reason for being invalid
- removed from mempool: reason for being removed
- some of the transaction information provided by `getrawmempool` API call
# Data collection approaches
So far, the following approaches have been identified to collect some or all of
the [relevant data](#relevant-data).
## API-based data collection
This approach is based on taking snapshots of a node's mempool state at regular
intervals using the API call `getrawmempool`. Comparing successive snapshots
yields the list of transactions added and removed from the mempool. Detailed
transaction information for transactions added in the interval can be obtained
using the `getrawtransaction` API call.
Pros:
- The resulting data set will be self-healing in case of downtime
- In case of downtime, significant information might be missed, but even
though the time between snapshots might increase significantly, but both
snapshots always represent valid mempool states.
Cons:
- The following relevant data can not be collected with this approach:
- exact timestamp: accuracy of timestamps is limited by the snapshotting
frequency
- invalid transactions and blocks are missing
- the reason for removal of transactions from the mempool are missing
- All transactions that were added to and removed from the mempool in the time
between two successive snapshots will be missing. This includes some RBF
transactions; data on transactions that were removed because they were
included in a block can be reconstructed using blockchain data.
## ZMQ-based data collection
This approach collects data from various ZMQs provided by Bitcoin Core.
Pros:
- Event notification is immediate
- Exact timestamps
- No missing data due to inadequate sampling as in the API-based approach
Cons:
- Data set can become inconsistent in case of downtime
- Example: A transaction removed from the mempool during downtime will lead
to the transaction to be stuck in a reconstruction of the mempool. Removal
times can approximated for mined transactions using block timestamps, but not
for transactions removed for other reasons.
- For monitoring the operational health of the network, this should not be a
problem, since a node that is down does not contribute to any health
information. When it comes to the potential creation of a public mempool data
set, it should not be a problem either, as data is collected in a
decentralized way by multiple nodes, so as long at least one node is running,
the resulting data set should be complete.
- The following relevant data cannot be collected out of the box:
- Invalid transactions and blocks, including reason for invalidity
- Reasons for removal of transaction from mempool
- Information provided by `getrawmempool`
- In theory, it should be possible to make all data available via ZMQ by
adding the necessary functionality to Bitcoin Core. An existing
[patch](https://github.com/0xB10C/bitcoin-zmq-mempool-chain-events/blob/v23.0-zmce/PATCH.md)
by 0xB10C already makes some of the relevant data availble.
## USDT-and-eBPF-based data collection
This approach does not rely on an external tool to collect data from Bitcoin
Core via some interface. Instead, it is based on adding tracepoints to the
mempool and other subsystems of Bitcoin Core to enable logging all relevant
data directly via Bitcoin Core.
- Pros:
- Can collect all relevant data
- In case it gets merged, data collection can occur using vanilla Bitcoin Core
- Cons
- Probably most work
- Data set can become inconsistent in case of downtime
- Same comments as for ZMQ-based approach apply