20th May - HackMD

# 20th May ### General: - Meeting on teams: - Discussed what is left to be done: - Complete API gateway integration - Complete adding logging to all services - Improve icao24 check against database injection - Complete web app (to work correctly on all services) - Improve look and usability of web app - Implement unit testing for flink - Perform load testing - Perform some stress testing - Do costs analysis - Do scalability analysis - Do reliability analysis - Create presentation - Al + Grik: complete logging and fixes for go services. - Nico + enri + Grik: work on web app. - Nico + enri: implement unit testing for flink. - Gio-R: finalize deploy pipeline - Al+Gio-R: test deployment on different user and perform load testing, consider stress testing (costs limitations) etc. ### Al & Grik: - Implemented better icao24 check in aircraft info and positions. - Implemented logs in aircraft list - Implemented logs in 30d history - Implemented logs in realtime history - Implemented logs in history calculator - Changed 30d history to also return the values used for the interval and the resolution. ### nico: ### Grik: ### enri: # 21th May ### Al & Grik: - Solved conflicts between log changes and develop - Tested code after conflict solve. - Removed ADC env var check and instead always only use ADC as authorization - Merged logs and icao check changes into develop. - Removed all log related env vars to simplify deployment configuration. # 23th May ### Al (with help from Gio-R): - Set up terraform and everything needed to deploy on personal gcloud. - Followed instructions to test deployment on a different cloud fro scratch. - Diagnosed various bugs and issues with Gio-R that were encountered along the way. - Managed to get the infrastructure up but there were still a couple of issues to fix. - Airspace positions icao check was incorrect and was refusing valid icaos. # 24th May ### Al (with help from the others): - Refactored aircraft positions code to use different files for each structure (client, pubsublistener, hub, timer) - Fixed incorrect Icao24 check in aircraft positions that was rejecting valid icaos. - Fixed a terraform issue that wasn't pulling the correct image for flink - Fixed a terraform API gateway configuration issue that was blocking access to static resources of web app. - Tested Web app interface with the others, found some issues and improvements to handle: - Show selected icao in the aircraft list dropbox - Don't use alert when the websocket is disconnected. - Show sign of life when a request fails for some reason. - Show some indicator that the websocket is listening for positions. - Fix realtime history viewer to use timestamp instead of starttime. - Fix 30d history viewer to access "history": field of the json response. # 26th May ### Al (with help from Gio-R and enri): - Tested changes to the web app. - Tested changes in terraform api enabling - Started learning how to use gatling to write load tests # 27th May ### Al & Grik: - Wrote and tried some load testing scenarios using gatling: - Aircraft list - Aircraft list + Aircraft info - Airspace realtime history - Had to fight a lot with gatling syntax, also having to recompile each time to check whether it actually worked (no intellisense). - Performed stress testing for aircraft list. - Performed stress testing for aircraft list locally after changing: - Storing the list in bytes to avoid conversion on request time - Avoid the conversion step inside the state.Write() call (shouldn't matter) - Found some issues with container scaling and fixed it. - Moved max container concurrency back to 100 by looking at stress testing results. - Gatling seems to incur in some issues with high volume of requests (400 concurrent). - Using hey (https://github.com/rakyll/hey) for stress testing seemed to work fine on the other had. - Further testing from Al found that aircraft list instances on cloud would still not scale under heavy load. # 28th May ### Al: - Local testing of gatling shows that it apparently runs out of virtual tcp ports causing errors and incorrect results. - Using Hey instead did not incur in the problem - Hey local testing showed aircraft list able to sustain 500rps with 99% response time under 0.031 and average of 0.0023 - Changed aircraft info to avoid converting string to []byte in request handler. Store the list directly as byte array. - Redeployed and tested on the deployed aircraft info using hey. - Discussed with Gio-R about ability to clean up some left over things from deployment. - Discussed with everyone about doing and keeping up a production deployment on one of our accounts and use that account only for the deployment hoping it would last. - Tried registering for the 300 dollars free trial but it requires credit card (prepaid not ok) and paypal option failed with unsepcified error. Gio-R also had the same issue. - Looked into perfKit to understand whether it was something useful to us, but it seems not at this stage since we already locked on the cloud provider. - Identified an issue when trying to update a single service in the deployment (Gio-R found solution and fixed it). - Continued with aircraft list and info stress tests to see whether they would start scaling: - Info service does seem to start scaling under heavy load (14 instances of 15 max at certain points), but not reliably. - List went to a max of 2 instances. It might be a problem of cpu utilization, very unreliably. - The CPU utilization is low even under heavy load so something else is the bottleneck? - Further local testing can see if bottleneck is app or something else. - Alternative is that the second service is failing on startup but there aren't logs suggesting it. - Testing without rate limiting requests to 500rpm got better response time results which makes no sense. - 100 threads, 40 0000 requests: - 10 seconds, 3670rps, 100% of requests < 0.779s response time, average of 0.026s. - For some reason rate limiting requests to 1 per second per thread is taking more time? - Perhaps because all the 500 requests happen at the exact same time so some are delayed, but still it doesn't make sense. - Maybe the 500 concurrent were triggering the scaling which slowed overall performance due to instance creation in the background. - Talked with enri about cost estimation to model number of requests and accurately quantify resource usage for cloud run apps. - Most important factors that influence the cost are total number of requests and the CPU time for single request. - A service under constant 200rps 24/7 at 20ms per request would cost ~190$ a month. Those are very pessimistic numbers. - Both 200rps and 20ms are higher than the model and cpu time actual values. - Kubernetes on the other hand managed to cost more money today doing nothing than the could run services being stress tested. - Tried stress testing realtime history service as well but found that the backend cluster had actually failed at some point: - No data was generated on the db for today - Looking at the log showed errors about IP_SPACE_EXHAUSTED. - Gio-R had already encountered this before but measures should have been in place to avoid it so we don't know what went wrong yet. - Spent time trying to destroy the deployment but it failed and had to be cleaned up manually. # 29th May ### Al: - Looked into high kubernetes costs from the day before - Found potential ways to reduce the costs a bit but ended up dicussing about hosting on premise. - Gio-R created a docker-compose for flink and history-calculator to run locally - Discussed with Gio-R about using single image registry and implementing automatic image publishing on github push. - Al tested running back end on spare machine. - Encountered and fixed various issues to start it. - Finally started it crashed with socket timeout. - Diagnosed the issue in various ways until we found that OpenSky API was responding after 60 seconds or giving bad gateway. - Nico changed flink code to survive exceptions when calling OpenSky. - Tested and iterated on the changes with Nico until the backend was able to survive without opensky. - Liaisoned with Enri and Nico for their cost analysis regarding Firestore data access patterns and other values. - Deployed front facing services on cloud for stress and load testing. - Tried helping Grik to get Gatling tests to work. - Grik managed to set up websocket load testing but there were still issues with the ws not closing, so the test never ending. # 30th May ### General: - Meeting on teams: - Discussed what is left to do: - Complete loadtesting - Complete costs analysis - complete stress testing - Deploy app on main - Start producing the presentations. - Discussed what to bring to presentation and how to split things to talk about - Focus on things that are important for the course, reliability, scalability, security, costs etc - Nico: - Presentation of project objectives and architecture (requirements, expected load, architecture) - Explain limitations of time and budget - give overview of what every person did - Start explaining the backend - Why flink - Choices during flink implementation - scalability of backend - reliability of backend (fault resistent) - Costs of backedn (flink is chunky) - Al: - Front facing services (using microservices rather than using monolithic application) - Objectives of front facing services - Language decision (use scala or something else?) (gin) - Distroless images - Aircraft info service - Database embedded and considered alternatives - Icao 24 check for database - Immutability - Positions service - Considered alternatives to the architecture and why we choose this. - Cloud Pub Sub and why - Limitations of PB - How wb are handled - listener for each icao - icao check - Subscirptions uuid - Handle gracefully error on ws read, ws write, pb err, pb timeout. - Usage of readWrite locks to allow high concurrency while avoiding inconsistent state. - scalability of positions - reliability of positions - Logging - Load testing - stress testing - Grik: - The remaining services have to do with history and the aircraft list. - Big query to compute history. Very good on paper. Low complexity of code, db does all the work. - Testing big query bad perf (Al): Vabbè sono le 3:00 di notte, decidete poi voi se avete voglia o tempo di finire qui. ### Al, Grik: - Completed implementation of loadtesting: - 2 new users per second using the app for 3 minutes, using all the available services. 250 concurrent active, 600 users in 5 minutes - Performed testing, found 95% optimal results, mean result below 100ms - Very few outlier errors (9 ws erros in 20k), some long resp times ### Al - Continued monitoring local backend deployment for issues using logs. - Identified issue with 5m data having timestamps that weren't aligned to 5 minutes (clock drift of flink probably) - This created issues with the history calculator that would not be able to correctly pinpoint the documents later for subtraction operations. - Notified Nico who immediately managed to find a solution and fix it (subtract with previous timestamp to see if difference if 5m, if not approximate based on whether the difference is higher or lower than 5) - Identified potential issue with airplanes distance values (too high based on personal approximations) - Waited 02 AM to observe whether daily value and db cleanup worked correctly at the end of the day. - Identified an issue with database cleanup, old values were not deleted correctly. - Found and fixed the bug that was using time.Time vals instead of string dates in database query. # 31th May ### Al: - Consulted with Nico about the incorrect distance calculation. - Nico found the issue and implemented a fix. - Tested the fix and verified that the new values were realistic. - Deployed services on cloud for stress testing. ### Al & Grik: - Identified all the endpoints to test (can't test ws due to most stress test tools not supporting them) - Researched best practices and tools for stress testing - Started performing stress testing using Hey on cloud deployed services: - 100 threads, 60s, from cable connected 8 core 16 threads machine. - Results for most services where around 4000rps and with 3-12 replicated instances on cloud side. - Recorded result data for each endpoint using the same test parameters. - Monitoring resource usage on test source machine we found that the uplink bandwidth was being a bottleneck (10MBps) - Increasing the number of threads would only increase the latency due to more requests waiting for bandwidth. - enpoints/positions/url showed higher latency and around 100 504 response errors. Highest latency was 15s. - Looking into the issue it appears that the gateway was not able to find the service instance for those cases for some reason. - This was highly repetable for only that endpoint, all other endpoint never showed this problem even with multiple tests. - Bandwidth consumption was very low for /url endpoint, and the code serving it is extremely simple. - Tested the endpoint locally with impeccable performance and 0 errors. - We looked into potential configuration differences of the endpoint but found nothing. - We have no idea what the issue might be. - Surprisingly monthly history performed exceptionally well considering it requires access to the db for each request. - We looked into cloud based solutions to workaround the bandwidth limitation (all other group members have laptops and no cabled connection). - Basically all of them require credit card registration and cost money or have too low loads offering for free tier. - Gatling has enterprise trial for 2 weeks. Still have to see if it asks for credit card. - Found that with Hey, if one request was stalling, the thread would stall and overall rps would drop. ### Al: - Performed stress test with Hey at the same time as Gio-R to see whether the services would be able to hold (approx) double the load. - Both bandwidth were saturated and and the rps remained as high as tests from a single machine. - So services seem to be able to scale much more than bandwidth. - Later asked another friend with 20MB in up and we managed to get 12000 cumulative requests per second. - Looked for hey alternatives. Found wrk cited in various places by other developers. - wrk needs to be compiled and is linux only. - Used windows subsistem for linux to compile it (took 30m) - With wrk you can define the number of connection each thread handles, if one connection is stalling the thread continues to work on all the other ones. - Ran wrk on aircraft list with slightly better results than hey, but still bandwidth limited. - Ran wrk on enpoints/position/url. Same errors but overall better rps (800->2800) - Consulted with Gio-R about finally pushing everything onto main. - Did the PR and found that apparently logging also needs to have parameters validated. - Ignored the problem for now to see if everything else worked correctly and it did. - # 1st June ### Al: - Tried creating a docker-compose for front facing services to more easily test them locally. - Docker compose was getting stuck on attaching to the running containers, but they are distroless s they don't have a shell. - After some struggling finally gave up and started all the services manually. - Used wrk to test all the services locally: - info, realtime history and endpoints/url were exceptional being around 100 000rps - list was a bit slower at 65 000 rps, likely due to the bigger message to send. - web UI was at 6k rps, we hypothesize due to the size of the data and the fact that it is serving files from the fs. - monthly history was predictably the slowest at 2k rps. - This is expected since it needs to query the db with each request, read the db data and re elaborate it for the user. - Spend time trying to get gatling to work testing the websockets. - When it finally worked it registered that no messages were received - Dubiously assumed it to be ok since manual testing shows that it actually receives the messages but only after a few seconds. - Gatling is sadly unable to work at highloads (500rps) even on a local environment, the connections are simply refused outright. - Tested up to 200rps and it worked fine. - We don't have any tool to test this well. - Only solution would be to develop our own tool and hope we don't mess up. - Time budget will not allow it. - As it is we are unable to stress test the websocket endpoint past 200rps. We have no reason to suspect issues but testing would be much better. # 2nd June ### Al: - Started deployment of app from github to start producing history values and watch for any issues. - Used image of architecture from Gio-R on excalidraw to make high res image of architecture, (small refinements). - Added image to repo readme. - Changed number of minimal service instance from 0 to 1 to provide a more responsive experience to new users. ## Al & Grik: - Started working together on presentation to split what to talk about each. - Already started working on the slides. # 3rd June ### General: - Meeting on teams: - Shared planned topics for each person and discussed how to optimize the presentation to avoid splitting talking too much. - Did a small preview for each person for the presentation. - Discussed which topics to leave as additional slides, what to add or remove etc. - Gio-R identified an issue with 30d-history 1d-bucked values recording hourly vals instead of daily vals. - Al found and fixed the bug in the code. ## Al & Grik: - Continued work on slides and presentation. # xxth June ### General: ### Al: ### Gio-R: ### nico: ### Grik: ### enri: