With @kyle looking into regen-registry#1695, I wanted to take some time to explore possible solutions that may involve intermediate indexing of the analytics database to create event specific tables or views in the anlaytics db.
In the analytics database we have 3 different relevant tables for getting retirement data:
tx
has one row per transactionmsg
has one row per msg (we may have multimpe msg's in a given transaction)msg_event_attr
contains all event data, and has one row per event-attribute pair. Each Msg has multiple events.Note: There's currently a problem in our analytics db architecture were we don't actually index events per msg. We currently only do so by "Event Type". So if we have multiple events per Msg (e.g. a group proposal execution with multiple retirements), there is no way to distinguish between the different retirements (e.g. know which retirement location corresponds to which retirement reason or owner). It's unclear if this problem is a problem with our analytics approach, or still an inherant problem of the SDK (see here, or here).
In our use case, we want to find "EventRetire" events which have been emitted as part of any Msg execution, inside of any tx.
If we didn't need to look at any event attribute data, one could imagine a simplified case where we join msg_event
with tx
tables. The event table allows us to filter by 'EventRetire' events, and we join on the tx table to get tx memos (we need tx memos to parse legacy retirement reasons).
Example query:
select *
from tx
natural join msg_event
where msg_event.type like '%EventRetire'
limit 100
This query (a simple join with limit 100) takes 3788.376 ms to run.
In reality, we also need event attribute data to get information like retirement jurisdiction, retirement amount, etc.
While theoretically, this could be parsed from parsing the Msg arguments, that approach would also require custom Msg parsing based on which Msg a retirement was a initiated from (e.g. MsgRetire
, MsgTake
, MsgBuyDirect
), or if that Msg was itself wraped in another msg for execution (e.g. group/MsgExec
or authz/MsgExec
).
For that reason, here we first explore getting most of the important retirement data from retirement attributes.
Our challenge with this appraoch lies in the fact that our msg_event_attr
table stores a row for each key-value pair (e.g. every event attribute) in a given event.
Thankfully, we know what an ideal table format would look like. Let's imagine having a single event_retire
table with a column for each attribute.
Column | Type | Collation | Nullable | Default
--------------+----------+-----------+----------+---------
chain_num | smallint | | |
block_height | bigint | | |
tx_idx | smallint | | |
msg_idx | smallint | | |
owner | text | | |
batch_denom | text | | |
amount | text | | |
jurisdiction | text | | |
reason | text | | |
To create this table, I've taken an approach leveraging GROUP BY
and MAX()
aggregate functions:
select
chain_num, block_height, tx_idx, msg_idx,
MAX(CASE WHEN key = 'retirer' THEN value ELSE null END) as retirer,
MAX(CASE WHEN key = 'batch_denom' THEN value ELSE null END) as batch_denom,
MAX(CASE WHEN key = 'amount' THEN value ELSE null END) as amount,
MAX(CASE WHEN key = 'location' THEN value ELSE null END) as location
from msg_event_attr
where type like 'regen.ecocredit.v1alpha1.EventRetire'
group by 1,2,3,4
An alternative appraoch involved window functions, (with FIRST_VALUE()
) but currently I believe the GROUPY BY
& MAX()
approach is better for both readability and performance.
event_retire
tableAnother complicating factor for us, is that we actually have two different event types that we are trying to query here:
regen.ecocredit.v1alpha1.EventRetire
regen.ecocredit.v1.EventRetire
These two event types have slightly different attribute key names ("owner" vs "retirer", "jurisdiction" vs "location").
To create a single table that is most usable for our actual use case, I've created two separate views (e.g. event_retire_v1
and event_retire_v1alpha1
) for each version of our retire event, and then created a 3rd table called event_retire
which uses the column name from event_retire_v1
, but unions all rows & records from both tables.
With these SQL queries, I've created a set of "VIEWS" which create a virtual table for event_retire
matching our querying needs.
If we take as a given that we want a specific event_retire
view, then the tradeoff to discuss is:
Is using a "VIEW" for this domain specific table sufficient from a performance perspective? Or do we need to create a separate ETL process by which we actuall write rows to a new event_retire
table at some regularity (every 10-30 seconds, or likely at each new block height).
Performance improvements could come from having a table with proper indexes (e.g. index by retirer address).
Currently, querying from our event_retire
VIEW takes around 4 seconds:
select *
from event_retire
limit 100
---------
1 | 8441573 | 0 | 0 | "regen1x5mcudx3cwn2skrtpa3uukttqsvm0jkf2f4775" | "C02-001-20180101-20181231-001" | "1" | "CA-QC" |
1 | 7950767 | 0 | 0 | "regen1r6jtydrz9k5g596fyrrwrzendk9y37qhuhvhe3" | "C02-003-20200630-20220629-001" | "10" | "US-MA" |
1 | 8424644 | 0 | 0 | "regen1dpuf6cgk9kukpydh35unc9vdvvahmpr562cevd" | "C02-003-20200630-20220629-001" | "10" | "CA-QC" |
1 | 7624606 | 0 | 0 | "regen19cf8m5rwm9azkseyaufwmgmpur682jfc6ggtpv" | "C02-002-20211012-20241013-001" | "1" | "US-CA 92065" |
(100 rows)
Time: 4073.551 ms (00:04.074)
Given that using views takes a significant amount of time, I think the next set of questions that we should focus on is how we setup our ETL process for this event_retire
table.
In the meantime, I would encourage the interfaces team to take the event_retire
table as a given, so we can continue working on exposing this to web clients (via postgraphile, or a custom REST endpoint), and work on building out interfaces for retirement data, indepdently from the work on refining our indexing approach (deciding on an ETL architecture).