Scoring infrastructure

# Scoring infrastructure ## UW API GET:https://underwrite.rociapi.com/score/123 → ```{ "creditScore": 10, "id": 123, "timestamp": 1662455521 }``` **Score**: 1..10, 101, 102 **id** aka **NFCS_ID**: Integer -> [wallet1, ..., walletN], e.g. ["0xA44CceF6D966d74f7d91B67796e5EFf861F43EEC", "0x9402F038CcCb9259Abb3d51a44f0EaC0D5241236"] ## Credit score model Linear regression, random forest GCE VM scrapper.rociapi.com https://github.com/RociFi/Scraper-Scripts/tree/feature/DE-403-aaveV3-polygon/lending Manually run via run_all_flow.py Monthly Re-training process: https://docs.google.com/document/d/1uJcSl64Usb8vV4gdwHsG4pjRt1aOuIaSO8I-8blw_xE/edit Output to folder `https://console.cloud.google.com/storage/browser/protocol/credit_score?authuser=1&cloudshell=false&pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false` ~40GB per month DS script merges data from all of the folders, transforming raw data from bucket into the format fitting for model: https://github.com/RociFi/mvp-data-analytics/blob/main/data/dataAggregator.py Open questions: - Where features come from - Where these changes are coming from https://github.com/RociFi/CreditRisk-Service/pull/43/files#diff-be128e51bb0a21c72290632d809e580ecaa72d8a6e18e84a2057935fdb359a43 **Features:** 'count_repays_to_count_borrows', 'avg_repay_to_avg_borrow', 'net_outstanding_to_total_borrowed', 'net_outstanding_to_total_repaid', 'count_redeems_to_count_deposits', 'total_redeemed_to_total_deposits', 'avg_redeem_to_avg_deposit', 'net_deposits_to_total_deposits', 'net_deposits_to_total_redeemed' Data validation https://docs.google.com/document/d/1ayP8y7sm7_5R48A-zjWdMkdAECl4zCL-fjoXdHPVoYg/edit Live model (https://github.com/RociFi/RociFi-microservices/): - step1 retrieve lending tx data from the data sources (Subgraphs) - step2 n/a - step3 dex txs - step4 aggregate data from step1 - step3 independent of particular lending data and dex data providers, e.g. `count_repays_to_count_borrows: 12` - FakeDate param allows to skip fresh data and simulate training data period Inconsistencies: code differences step1 ... step4 and scraper scripts, different chain inputs for scraper scripts, different chain inputs in DS script API used: - Thegraph hosted (full list: https://docs.google.com/document/d/1js0PFUfzb-LrtZ4d4_4yiCLfgan5FgI8-3lCJw9yb1w/edit) - Etherscans - Bitquery not in use Requirements: - Re-train automatically - Validate result with the predefined set - Use the same pipeline for training and live data Problems: - Training and live data mismatch - Missing auto-tests ## Fraud model Re-training process: https://docs.google.com/document/u/1/d/14OcpXROOek4WTdD8Haq7VOL2n6IQ22yw6o741GysCqA/edit Fraud data adapter Fraud API ## Coin prices https://github.com/RociFi/coin-price-loader Bunch of Java scripts to transform Coingecko free plan using proxies → MySQL Days Readonly Many customers Now: manual, future: on cron 800GB TODO: research upgrade plan TODO: deprecate Bituery TODO: negative prices in old JSON files TODO: 1 day shift in coin prices TODO: Stablecoin prices in step1