## EPF Dev Update #9 This past week have been a very busy, debuggish & informative week. 1. I have finalized the indexer for retrieving data and shaped it to a more useful form. The indexer is now able to retrieve data in a format that is more easily usable for analysis. 2. I have been in touch with the authors of a repository (https://github.com/alrevuelta/eth-metrics/) where they statistically determined and associated staking pools and their addresses. However, after having the conversation it turns out that the data is becoming outdated and by manually analyzing the hard-coded address in that repo gav me doubts about association of addresses with a specific pool i.e Binance for example. We are currently discussing this and see where this goes. Another important thing (`alrevuelta`) mentioned that asking for address association to discord pools is not a very good approach as asking them to verify doesn't work apparently. 3. While verifying my dataset authenticity, I found an anomaly in my data & the data that is displayed on the Beaconcha.in website regarding validator indexes. My dataset matches with that of etherscan but it differes when it is extracted After diving into the open-sourced codebase, I discovered a section where it directly queries the Lighthouse client for data retrieval. This anomaly doesn't seem to be an issue for the explorer, but rather the Lighthouse client. This still needs to be verified, I will probably get in touch with team later. I also thought of an idea for associating validators to staking pools, similar to how Uniswap maintains its token list. However, this approach relies on the optimistic assumption that the data associated is correct. We might also need to associate some incentives to the contributors to increase authenticity. This will be very beneficial for the comuunity as a whole. I might persue this after cohort. Currently, I am in the phase of classifying the staking pools and have a few approaches that I will follow, this may deviate a bit from the main project as this ground needs to be set before the analysis for accuracy. I want to note that if we don't have enough data, we may conclude that our analysis is incomplete or inaccurate. However, if I see the complications growing, I will shortlist the major staking pools with a few false positives and continue from there. This will not include the performance of all pools, but rather a few pools to get the performance analysis. This approach will still give us valuable insights while being more practical. Here is how I plan to label the data: - Get a unique `From` address from the dataset & try finding labels through etherscan (https://etherscan.io/labelcloud) is one such example. I have shortlisted the labels. - Get all To & `From` addresses from the dataset that has the same transaction hash appearing more than once. This way we can Identify the contracts that are making batch requests to deposit contract & will probably be some pool we don't know yet. - Match my address with the `eth_metrics` repository to extract common addresses & remove false positives as much as I can. I aim for a dataset that has majority of `From` addresses associated with a staking pool label for further analysis. ## Next steps 1. By the end of the week, I will associate atleast 2 staking pools. I aim for 4 though. Let's see how this goes. 2. Then I will deep dive into the beacon chain API, play around with some that makes more sense to our project & hopefully get to the final phases of the project.