EPF Dev Update #7
This past week, I continued working on gathering transaction data through Indexer (Repo)
from a Geth node using an RPC (provided by Mario) method and storing it in a CSV file. I have made progress in several areas, but have also encountered some limitations.
- There were a few logical mistakes & wrong assumptions about the APIs I was working with for which I had to dive into the Geth codebase. Since I was working with the RPC endpoint I took the opportunity to explore the internals of ethersJS & web3JS as to how they encapsulate RPC calls into functions. Pretty Intresting.
- I have extracted the initial dataset for the deposit contract on Ethereum using the event indexer. I am using the dataset from 14th October, 2020 i.e (from the point the contract was deployed) up untill 25th December, 2022. That is around 500k i.e (503580 to be exact) validators information. but I am assuming there are some discrepancies. I will have to double check the following.
- Make sure all of the validators are captured. I plan to check by summing all of the validator indexes & its sum should be equal to the sum of first 503580 numbers! I will optimize this later.
- Make sure no event is missed.
- Make sure events from uncle blocks are not included. This took me 2 days to figure out. As captured validators were more than number of active validators!
- I also encountered some network delays that caused the data retreival to fail due to too many request or bandwith problems from my end which led to the missed event data i.e (events from blocks were partially downloaded). After checking this I realized the indexer missed about
90k
validators! I will optimize this later too.
- The event data provides the most information needed but it doesnt have the
to
& from
fields which are essential to the analysis. And it is only retrievable from get_getTransactionByHash
maybe the APIs could get somewhat better? I dont know. Therefore I plan to extract those using the transaction hash I stored in the csv file. This also means that for each 500k validator entries I will have to make around 300k request with added delays i.e (As there are some duplicate tx hases in the dataset that are probably called as a btch in a single transaction) to the ethereum node (and hope I dont break Mario's node).
Overall, while developing the indexer, I have tried to overcome these limitations as much as possible by trial & error which are robust and efficient, but it's important to keep in mind that working with distributed systems like Ethereum can be challenging and these limitations can't always be avoided.
Next steps
- Double-check the dataset to ensure all validators are captured and that no events were missed or duplicated, and that events from uncle blocks are not included.
- Extract the to and from fields from the transaction data using the get_getTransactionByHash method.
- Optimize the performance of the indexer.
- I have some metrics in mind to match with
eth-metrics
addresses for comparision.