Optimizing database interactions

# Optimizing database interactions This document lists some interrelated ideas that we think will speed up indexing by reducing the time a subgraph spends writing to the database. They all apply to subgraph syncing. We think that database interactions are not a big concern for synced subgraphs, and users generally complain about the time it takes to sync a subgraph, not so much its performance once it's reached the chain head. The ideas, which are described in more detail below, are: * pipeline DB writes with the rest of subgraph indexing _(done)_ * use `COPY` instead of `INSERT`, especially when writing a large number of entities to the store * batch changes by not writing to the database on every block. Instead, write after n minutes have passed or m entity changes have been accumulated * introduce immutable entities _(done)_ * do something about overindexing _(still important)_ * string interning * address [aggregations](https://hackmd.io/d1XUlIOUSqulUFbtQbs0iw) _(still important)_ * parallelize writes * limited history/pruning _(still important)_ Before implementing any of these, it would be good to somehow simulate these changes to better gauge what improvement they can bring. ## Pipeline DB writes Currently, storing the entity changes for a block happens sequentially with the rest of indexing. We can perform writing to the database in parallel with processing the next block, or in general the next few blocks, until we reach a limit of how many entity changes we keep in memory, at which point indexing will have to pause. In detail, pipelining would be implemented in the following way: 1. The logic for writing to the database [here](https://github.com/graphprotocol/graph-node/blob/master/core/src/subgraph/instance_manager.rs#L987-L1031) needs to move to a background thread. 1. The `EntityCache` needs to be changed so that it does not evict entities that have not been written to the database yet. Telling the `EntityCache` how far writing has gotten (and which entities are therefore eligible for eviction) introduces a feedback loop between the main indexing thread and the background writer. 1. This [section](https://github.com/graphprotocol/graph-node/blob/master/core/src/subgraph/instance_manager.rs#L1023-L1025) needs to be changed: the correct `vid` for an entity is only needed when we actually go to write the entity, but the `EntityCache` is currently used to track that information. With pipelining, we should move the responsibility for tracking this to the background writer. 1. Sending the entity changes to the background writer should block if the background writer determines that too much work has already accumulated, and should only unblock if the writer has room to queue the new work. ## Use `COPY` instead of `INSERT` We currently use `INSERT` statements (with multiple values) to add entity changes to the database. Especially for larger inserts, there's a general sense that using `COPY` instead performs better. ## Batch changes/checkpointing During syncing, there isn't really a need to write data at every block. If we reduce writes to writing every n minutes, we should see a speedup since writing in batches is generally considered to be faster than smaller inserts. The biggest downside to writing only every n minutes is that we might have to redo the work of those n minutes if the index node is restarted. With the work of the previous two sections in place, checkpointing would be implemented as a change to the behavior of the background writer: instead of writing every time it is asked to write the entity changes for a block, it waits until enough changes have accumulated and then writes them all at once. Some care must be taken to handle entity updates appropriately. Assume we are writing a batch of changes that covers blocks `n` to `n+k`: * before inserting any entity changes, the block ranges of the old versions already in the database need to be closed with the appropriate block number from `[n, n+k]` * if an entity is created or updated in one of the blocks `[n, n+k]`, and then updated again, its block range must be closed while still in memory to avoid creating entity versions with overlapping block ranges ## Introduce immutable entities If users could indicate that an entity type is immutable, e.g. by declaring it with `type Thing @entity(immutable: true) { .. }` in their GraphQL schema, we would know that any entity will never have more than one version, and that block ranges will never be closed. We can therefore just store the block at which the entity was created with the entity and do not need a range. That would let us avoid the exclusion index on `(id, block_range)` that we use to ensure that each entity has at most one live version at each block, and use a simpler unique index on `(id, block)` instead. This change would also allow us to simplify queries since we can replace the clause `block_range @> $block` that checks that the block we are interested in is included in the block range with `block <= $block`, i.e., a simple comparison of integers that can be supported with a BTree. ## Do something about overindexing We know that we create way too many indexes, in the hope that they will speed up queries, which they often but not always do; manual index creation is still needed from time to time and can dramatically reduce the runtime of slow queries. Unfortunately, there doesn't seem to be much that can be done in an automated fashion. There are tools that try to analyze past queries and come up with index suggestions, but we have no experience with how well they actually work. But even with such a tool in hand we'd still first need a feature that allows managing indexes for a subgraph, and should start with a manual process. That process would need to be supported by a tool that lets indexers control * which indexes to create and which ones to omit * add and remove indexes from existing subgraphs The definition of indexes could go into a simple text file, and would need some `graphman` subcommand that drops and creates indexes based on the definitions of the text file. A lot of indexes could be expressed at the level of the GraphQL schema, e.g., 'create an index on Token(volume, id)', but it might need an escape hatch where users can use all the features that Postgres provides that operates at the SQL level. ## String interning Most entities use primary keys that are strings, and that are therefore repeated in all entity versions, and in all foreign keys in related entities. One approach to string interning is described [here](https://github.com/graphprotocol/graph-node/issues/1164#issuecomment-529052263) where each entity gets an integer `eid` column that is used to identify the entity in indexes and foreign keys. We could take that one step further and store the `id` strings in a separate table that maps it to the `eid`, avoiding the data duplication that comes from storing the same `id` with each entity. That might pose a problem though: sometimes, slow queries contain a `order by attr, id` clause and can be sped up by creating a BTree index on `(attr, id)`. If the `id` lives in a separate table, it is no longer possible to create such an index. We add the `id` to the sort key to provide a stable and deterministic order when ordering by non-unique columns. We don't much care how exactly the added `id` column creates a deterministic order, but since it must be deterministic, it can only depend on deterministic data. If we want to change the order to `order by attr, eid`, we need to ensure that all indexers have the `eid` in the same order. It might be possible to assign them in such a way that they are in the order in which entities are created, for example by sorting newly created entities by their `id`. ## Parallelize writes The writes for a subgraph happen sequentially (even with checkpointing, block data is still written in order, albeit in multi-block batches). A single writer is not enough to [saturate the I/O bandwidth of a typical SSD](https://www.percona.com/blog/2017/08/28/looking-disk-utilization-and-saturation/) Saturating modern I/O is therefore only possible if we write in parallel; to some extent we already do that because different subgraphs write in parallel, but on systems where bandwidth is still available, writing the changes for a single subgraph in parallel would be beneficial. When writing changes in parallel, we need to track the block ranges for which changes have been written as a set of intervals, so that we track, say, that we have written blocks `[0, 1000]`, `[1002, 1005]`, and `[1007, 1008]`. The subgraph head that is reported to queries will be the end of the interval starting at `0`, i.e., `1000`. As blocks are written, these intervals need to be joined when gaps disappear, for example once writing block `1001` in the example finishes, the block rnages we track for the subgraph become `[0, 1005]` and `[1007, 1008]`, and the subgraph head advances to `1005`. When `graph-node` starts, we need to get to a stable state where we deal only with one block range by rewinding to the subgraph head and removing any data we wrote beyond that. ## Limited history Subgraphs retain all historical data from genesis, even though in a lot of cases, only limited history is actually needed by subgraph users, e.g., in a subgraph that tracks ERC20 token balances. The unneeded historical data slows down reads and writes (by how much, especially for writes should be measured) We could give subgraph users a way to indicate that they are only interested in history for the last `n` blocks, either for a subgraph as a whole, or for individual entities, and delete historical data that is older than `n` blocks going back from the subgraph head. Here, 'subgraph users' could either be subgraph authors or indexers. Our APIs have facilities already to deal with limited history for an entire subgraph. If limiting by entity type is added, we'll also need to update APIs like the indexing status API. ## Table partitioning _This is not really workable, and we can not partition entities into separate tables and expect to only need to query one. See the separate writeup on split entity history for why_ Instead of limiting history, we could get similar effects from table partitioning. This will be particularly beneficial for entity types that keep running totals like trade volumes over all time. Users would tell us to partition the table for an entity type like this: ```graphql= type Token @entity(partition: true) { id: ID! symbol: String! totalSupply: BigInt! tradeVolume: BigDecimal! } ``` When we create the database tables for such entity types, we would partition them by `upper(block_range)` in such a way that we keep each partition to a certain target number of rows per partition, say 10 million. The tricky part is how we handle a partition getting full. A possible strategy is the following: - track the number of rows for each partition (or use database estimates, the strategy doesn ot need to be overly precise) - always have a `default` partition. That's where the latest version of any entity goes - when the `default` partition gets full, change its range to the blocks it contains and create a new `default` partition (this needs to be tested since it might interfere with queries due to locking) Another partitioning strategy would be to only use two partitions: one for blocks we encounter during syncing, and one for blocks after syncing has finished. Since we always do time-travel queries, we always have a clause `block_range @> $block` in our queries. To ensure that Postgres only consults relevant partitions, we would add the redundant clause `upper(block_range) >= $block` to each query. There are some downsides to using table partitioning: - we can not use an exclusion constraint to make sure that the block ranges for different versions of an entity do not overlap - we need to have an explicit column to store `upper(block_range)`, otherwise the table can't have a primary key - `upper(block_range)` must be part of the primary key - if the user does not have a reasonable guess for `versions_per_block`, we can make it optional, but would need to substitute a fixed default which may or may not be useful Before embarking on table partitioning, we should first get more detail on the BRIN indexes we have on tables and understand when they are effective and when they are useless. ## Recent entities tables A better approach for mutable entities with a high number of versions per entities is introducing a 'recent versions table', which helps queries that access only entity versions close to the subgraph head. That could be implemented in a manner that is entirely transparent to subgraph authors and to queries. For that, for an entity `Token`, we'd create a `token` table and manage it in exactly the same way as we do today. In addition, we create a table `token_recent` which has the same structure (and indexes, for now) as the `token` table. When writing entities, we write them to both the `token` and the `token_recent` table. In addition, we prune the history of the `token_recent` table quite aggressively, and only keep, say, entity versions for the last 1000 blocks (~ 4 hours on ethereum mainnet) Queries that query at a block that's within the `token_recent` time range query that table, and only historical tables query the main `token` table. Since the cleanup with historical rows can interfere with queries, the entire range of history in `token_recent` needs to be split into a queryable part, and a tail part. The tail part needs to be large enough that it covers the maximum time that any query can take. For example, with 1000 blocks of history in `token_recent`, we might make the first 900 blocks queryable and use the last 100 blocks as the tail buffer. With that scheme, we can be sure that any query finishes before the blocks it relies on get cleaned up. We can defer creating recent tables until after a subgraph is synced, i.e., when the indexing logic calls `deployment_synced` and use Postgres statistics to determine if a table is indeed one where maintainng a recent table would be beneficial since we can compare Postgres' estimate for the total number of rows in the table with its estimate of the number of entities in the table. Those estimates generally seem precise enough to make this determination. To start with, we'll enable (and disable) tracking of recent versions manually through `graphman stats recent sgd123.token`. In more detail, that kicks off the following state changes: * Initially, the `recent` flag for each table is `off` (or `null`) * When `graphman stats recent` is run, that flag goes to `requested` * The index node processing the subgraph in question notices this change. As part of writing entity changes for the subgraph, the index node creates a `token_recent` table and copies data from `token` to `token_recent` to populate it with enough history (this can take a few minutes and will stall indexing) It then sets the `recent` flag to `ready` * Query nodes notice that the `recent` flag is `ready` and execute appropriate queries against `token_recent` rather then `token` * The index node writes all changes to both the `token` and `token_recent` table; in addition, it removes uneeded history from `token_recent` * When `graphman stats recent --clear sgd123.token` is run, the `recent` flag is reset to `off`. Query nodes send all their queries to `token` again, and the index node only writes to the `token` table * Some cleanup command can then later drop the `token_recent` table. That cleanup should happen long enough after turning off tracking of recent versions that we can be sure that no query node is still using it (e.g., after an hour) One way to store these state changes in the database that also provides some auditability is to change the `subgraphs.table_stats` table to have the following additional columns: ```sql= recent_requested timestamptz, recent_ready timestamptz, recent_off timestamptz ``` The recent flag for a table is `off` if either it has no entry in `subgraphs.table_stats` or its entry has a non-null value in `recent_off`.