Retirement Indexer Explorations

Background

With @kyle looking into regen-registry#1695, I wanted to take some time to explore possible solutions that may involve intermediate indexing of the analytics database to create event specific tables or views in the anlaytics db.

Events vs Msg's

In the analytics database we have 3 different relevant tables for getting retirement data:

tx has one row per transaction
msg has one row per msg (we may have multimpe msg's in a given transaction)
msg_event_attr contains all event data, and has one row per event-attribute pair. Each Msg has multiple events.

Note: There's currently a problem in our analytics db architecture were we don't actually index events per msg. We currently only do so by "Event Type". So if we have multiple events per Msg (e.g. a group proposal execution with multiple retirements), there is no way to distinguish between the different retirements (e.g. know which retirement location corresponds to which retirement reason or owner). It's unclear if this problem is a problem with our analytics approach, or still an inherant problem of the SDK (see here, or here).

Simple case

In our use case, we want to find "EventRetire" events which have been emitted as part of any Msg execution, inside of any tx.

If we didn't need to look at any event attribute data, one could imagine a simplified case where we join msg_event with tx tables. The event table allows us to filter by 'EventRetire' events, and we join on the tx table to get tx memos (we need tx memos to parse legacy retirement reasons).

Example query:

select *
from tx
natural join msg_event
where msg_event.type like '%EventRetire'
limit 100

This query (a simple join with limit 100) takes 3788.376 ms to run.

Getting Full Event Attributes

In reality, we also need event attribute data to get information like retirement jurisdiction, retirement amount, etc.

While theoretically, this could be parsed from parsing the Msg arguments, that approach would also require custom Msg parsing based on which Msg a retirement was a initiated from (e.g. MsgRetire, MsgTake, MsgBuyDirect), or if that Msg was itself wraped in another msg for execution (e.g. group/MsgExec or authz/MsgExec).

For that reason, here we first explore getting most of the important retirement data from retirement attributes.

Our challenge with this appraoch lies in the fact that our msg_event_attr table stores a row for each key-value pair (e.g. every event attribute) in a given event.

Thankfully, we know what an ideal table format would look like. Let's imagine having a single event_retire table with a column for each attribute.

    Column    |   Type   | Collation | Nullable | Default 
--------------+----------+-----------+----------+---------
 chain_num    | smallint |           |          | 
 block_height | bigint   |           |          | 
 tx_idx       | smallint |           |          | 
 msg_idx      | smallint |           |          | 
 owner        | text     |           |          | 
 batch_denom  | text     |           |          | 
 amount       | text     |           |          | 
 jurisdiction | text     |           |          | 
 reason       | text     |           |          |

To create this table, I've taken an approach leveraging GROUP BY and MAX() aggregate functions:









select
  chain_num, block_height, tx_idx, msg_idx,
  MAX(CASE WHEN key = 'retirer' THEN value ELSE null END) as retirer,
  MAX(CASE WHEN key = 'batch_denom' THEN value ELSE null END) as batch_denom,
  MAX(CASE WHEN key = 'amount' THEN value ELSE null END) as amount,
  MAX(CASE WHEN key = 'location' THEN value ELSE null END) as location
from msg_event_attr
where type like 'regen.ecocredit.v1alpha1.EventRetire'
group by 1,2,3,4

An alternative appraoch involved window functions, (with FIRST_VALUE()) but currently I believe the GROUPY BY & MAX() approach is better for both readability and performance.

Creating a unified `event_retire` table

Another complicating factor for us, is that we actually have two different event types that we are trying to query here:

regen.ecocredit.v1alpha1.EventRetire
regen.ecocredit.v1.EventRetire

These two event types have slightly different attribute key names ("owner" vs "retirer", "jurisdiction" vs "location").

To create a single table that is most usable for our actual use case, I've created two separate views (e.g. event_retire_v1 and event_retire_v1alpha1) for each version of our retire event, and then created a 3rd table called event_retire which uses the column name from event_retire_v1, but unions all rows & records from both tables.

Performance

With these SQL queries, I've created a set of "VIEWS" which create a virtual table for event_retire matching our querying needs.

If we take as a given that we want a specific event_retire view, then the tradeoff to discuss is:

Is using a "VIEW" for this domain specific table sufficient from a performance perspective? Or do we need to create a separate ETL process by which we actuall write rows to a new event_retire table at some regularity (every 10-30 seconds, or likely at each new block height).

Performance improvements could come from having a table with proper indexes (e.g. index by retirer address).

Currently, querying from our event_retire VIEW takes around 4 seconds:

select *
from event_retire
limit 100
---------
         1 |      8441573 |      0 |       0 | "regen1x5mcudx3cwn2skrtpa3uukttqsvm0jkf2f4775" | "C02-001-20180101-20181231-001" | "1"       | "CA-QC"           | 
         1 |      7950767 |      0 |       0 | "regen1r6jtydrz9k5g596fyrrwrzendk9y37qhuhvhe3" | "C02-003-20200630-20220629-001" | "10"      | "US-MA"           | 
         1 |      8424644 |      0 |       0 | "regen1dpuf6cgk9kukpydh35unc9vdvvahmpr562cevd" | "C02-003-20200630-20220629-001" | "10"      | "CA-QC"           | 
         1 |      7624606 |      0 |       0 | "regen19cf8m5rwm9azkseyaufwmgmpur682jfc6ggtpv" | "C02-002-20211012-20241013-001" | "1"       | "US-CA 92065"     | 
(100 rows)
Time: 4073.551 ms (00:04.074)

Next Steps

Given that using views takes a significant amount of time, I think the next set of questions that we should focus on is how we setup our ETL process for this event_retire table.

In the meantime, I would encourage the interfaces team to take the event_retire table as a given, so we can continue working on exposing this to web clients (via postgraphile, or a custom REST endpoint), and work on building out interfaces for retirement data, indepdently from the work on refining our indexing approach (deciding on an ETL architecture).

Retirement Indexer Explorations

Background

Events vs Msg's

Simple case

Getting Full Event Attributes

Creating a unified event_retire table

Performance

Next Steps

Read more

Untitled

Fibershed Credit Class

Regen Network - Template Credit Class JSON-LD

Debugging Multisig Txs

Creating a unified `event_retire` table