# 20th May
### General:
- Meeting on teams:
- Discussed what is left to be done:
- Complete API gateway integration
- Complete adding logging to all services
- Improve icao24 check against database injection
- Complete web app (to work correctly on all services)
- Improve look and usability of web app
- Implement unit testing for flink
- Perform load testing
- Perform some stress testing
- Do costs analysis
- Do scalability analysis
- Do reliability analysis
- Create presentation
- Al + Grik: complete logging and fixes for go services.
- Nico + enri + Grik: work on web app.
- Nico + enri: implement unit testing for flink.
- Gio-R: finalize deploy pipeline
- Al+Gio-R: test deployment on different user and perform load testing, consider stress testing (costs limitations) etc.
### Al & Grik:
- Implemented better icao24 check in aircraft info and positions.
- Implemented logs in aircraft list
- Implemented logs in 30d history
- Implemented logs in realtime history
- Implemented logs in history calculator
- Changed 30d history to also return the values used for the interval and the resolution.
### nico:
### Grik:
### enri:
# 21th May
### Al & Grik:
- Solved conflicts between log changes and develop
- Tested code after conflict solve.
- Removed ADC env var check and instead always only use ADC as authorization
- Merged logs and icao check changes into develop.
- Removed all log related env vars to simplify deployment configuration.
# 23th May
### Al (with help from Gio-R):
- Set up terraform and everything needed to deploy on personal gcloud.
- Followed instructions to test deployment on a different cloud fro scratch.
- Diagnosed various bugs and issues with Gio-R that were encountered along the way.
- Managed to get the infrastructure up but there were still a couple of issues to fix.
- Airspace positions icao check was incorrect and was refusing valid icaos.
# 24th May
### Al (with help from the others):
- Refactored aircraft positions code to use different files for each structure (client, pubsublistener, hub, timer)
- Fixed incorrect Icao24 check in aircraft positions that was rejecting valid icaos.
- Fixed a terraform issue that wasn't pulling the correct image for flink
- Fixed a terraform API gateway configuration issue that was blocking access to static resources of web app.
- Tested Web app interface with the others, found some issues and improvements to handle:
- Show selected icao in the aircraft list dropbox
- Don't use alert when the websocket is disconnected.
- Show sign of life when a request fails for some reason.
- Show some indicator that the websocket is listening for positions.
- Fix realtime history viewer to use timestamp instead of starttime.
- Fix 30d history viewer to access "history": field of the json response.
# 26th May
### Al (with help from Gio-R and enri):
- Tested changes to the web app.
- Tested changes in terraform api enabling
- Started learning how to use gatling to write load tests
# 27th May
### Al & Grik:
- Wrote and tried some load testing scenarios using gatling:
- Aircraft list
- Aircraft list + Aircraft info
- Airspace realtime history
- Had to fight a lot with gatling syntax, also having to recompile each time to check whether it actually worked (no intellisense).
- Performed stress testing for aircraft list.
- Performed stress testing for aircraft list locally after changing:
- Storing the list in bytes to avoid conversion on request time
- Avoid the conversion step inside the state.Write() call (shouldn't matter)
- Found some issues with container scaling and fixed it.
- Moved max container concurrency back to 100 by looking at stress testing results.
- Gatling seems to incur in some issues with high volume of requests (400 concurrent).
- Using hey (https://github.com/rakyll/hey) for stress testing seemed to work fine on the other had.
- Further testing from Al found that aircraft list instances on cloud would still not scale under heavy load.
# 28th May
### Al:
- Local testing of gatling shows that it apparently runs out of virtual tcp ports causing errors and incorrect results.
- Using Hey instead did not incur in the problem
- Hey local testing showed aircraft list able to sustain 500rps with 99% response time under 0.031 and average of 0.0023
- Changed aircraft info to avoid converting string to []byte in request handler. Store the list directly as byte array.
- Redeployed and tested on the deployed aircraft info using hey.
- Discussed with Gio-R about ability to clean up some left over things from deployment.
- Discussed with everyone about doing and keeping up a production deployment on one of our accounts and use that account only for the deployment hoping it would last.
- Tried registering for the 300 dollars free trial but it requires credit card (prepaid not ok) and paypal option failed with unsepcified error. Gio-R also had the same issue.
- Looked into perfKit to understand whether it was something useful to us, but it seems not at this stage since we already locked on the cloud provider.
- Identified an issue when trying to update a single service in the deployment (Gio-R found solution and fixed it).
- Continued with aircraft list and info stress tests to see whether they would start scaling:
- Info service does seem to start scaling under heavy load (14 instances of 15 max at certain points), but not reliably.
- List went to a max of 2 instances. It might be a problem of cpu utilization, very unreliably.
- The CPU utilization is low even under heavy load so something else is the bottleneck?
- Further local testing can see if bottleneck is app or something else.
- Alternative is that the second service is failing on startup but there aren't logs suggesting it.
- Testing without rate limiting requests to 500rpm got better response time results which makes no sense.
- 100 threads, 40 0000 requests:
- 10 seconds, 3670rps, 100% of requests < 0.779s response time, average of 0.026s.
- For some reason rate limiting requests to 1 per second per thread is taking more time?
- Perhaps because all the 500 requests happen at the exact same time so some are delayed, but still it doesn't make sense.
- Maybe the 500 concurrent were triggering the scaling which slowed overall performance due to instance creation in the background.
- Talked with enri about cost estimation to model number of requests and accurately quantify resource usage for cloud run apps.
- Most important factors that influence the cost are total number of requests and the CPU time for single request.
- A service under constant 200rps 24/7 at 20ms per request would cost ~190$ a month. Those are very pessimistic numbers.
- Both 200rps and 20ms are higher than the model and cpu time actual values.
- Kubernetes on the other hand managed to cost more money today doing nothing than the could run services being stress tested.
- Tried stress testing realtime history service as well but found that the backend cluster had actually failed at some point:
- No data was generated on the db for today
- Looking at the log showed errors about IP_SPACE_EXHAUSTED.
- Gio-R had already encountered this before but measures should have been in place to avoid it so we don't know what went wrong yet.
- Spent time trying to destroy the deployment but it failed and had to be cleaned up manually.
# 29th May
### Al:
- Looked into high kubernetes costs from the day before
- Found potential ways to reduce the costs a bit but ended up dicussing about hosting on premise.
- Gio-R created a docker-compose for flink and history-calculator to run locally
- Discussed with Gio-R about using single image registry and implementing automatic image publishing on github push.
- Al tested running back end on spare machine.
- Encountered and fixed various issues to start it.
- Finally started it crashed with socket timeout.
- Diagnosed the issue in various ways until we found that OpenSky API was responding after 60 seconds or giving bad gateway.
- Nico changed flink code to survive exceptions when calling OpenSky.
- Tested and iterated on the changes with Nico until the backend was able to survive without opensky.
- Liaisoned with Enri and Nico for their cost analysis regarding Firestore data access patterns and other values.
- Deployed front facing services on cloud for stress and load testing.
- Tried helping Grik to get Gatling tests to work.
- Grik managed to set up websocket load testing but there were still issues with the ws not closing, so the test never ending.
# 30th May
### General:
- Meeting on teams:
- Discussed what is left to do:
- Complete loadtesting
- Complete costs analysis
- complete stress testing
- Deploy app on main
- Start producing the presentations.
- Discussed what to bring to presentation and how to split things to talk about
- Focus on things that are important for the course, reliability, scalability, security, costs etc
- Nico:
- Presentation of project objectives and architecture (requirements, expected load, architecture)
- Explain limitations of time and budget
- give overview of what every person did
- Start explaining the backend
- Why flink
- Choices during flink implementation
- scalability of backend
- reliability of backend (fault resistent)
- Costs of backedn (flink is chunky)
- Al:
- Front facing services (using microservices rather than using monolithic application)
- Objectives of front facing services
- Language decision (use scala or something else?) (gin)
- Distroless images
- Aircraft info service
- Database embedded and considered alternatives
- Icao 24 check for database
- Immutability
- Positions service
- Considered alternatives to the architecture and why we choose this.
- Cloud Pub Sub and why
- Limitations of PB
- How wb are handled
- listener for each icao
- icao check
- Subscirptions uuid
- Handle gracefully error on ws read, ws write, pb err, pb timeout.
- Usage of readWrite locks to allow high concurrency while avoiding inconsistent state.
- scalability of positions
- reliability of positions
- Logging
- Load testing
- stress testing
- Grik:
- The remaining services have to do with history and the aircraft list.
- Big query to compute history. Very good on paper. Low complexity of code, db does all the work.
- Testing big query bad perf
(Al): Vabbè sono le 3:00 di notte, decidete poi voi se avete voglia o tempo di finire qui.
### Al, Grik:
- Completed implementation of loadtesting:
- 2 new users per second using the app for 3 minutes, using all the available services. 250 concurrent active, 600 users in 5 minutes
- Performed testing, found 95% optimal results, mean result below 100ms
- Very few outlier errors (9 ws erros in 20k), some long resp times
### Al
- Continued monitoring local backend deployment for issues using logs.
- Identified issue with 5m data having timestamps that weren't aligned to 5 minutes (clock drift of flink probably)
- This created issues with the history calculator that would not be able to correctly pinpoint the documents later for subtraction operations.
- Notified Nico who immediately managed to find a solution and fix it (subtract with previous timestamp to see if difference if 5m, if not approximate based on whether the difference is higher or lower than 5)
- Identified potential issue with airplanes distance values (too high based on personal approximations)
- Waited 02 AM to observe whether daily value and db cleanup worked correctly at the end of the day.
- Identified an issue with database cleanup, old values were not deleted correctly.
- Found and fixed the bug that was using time.Time vals instead of string dates in database query.
# 31th May
### Al:
- Consulted with Nico about the incorrect distance calculation.
- Nico found the issue and implemented a fix.
- Tested the fix and verified that the new values were realistic.
- Deployed services on cloud for stress testing.
### Al & Grik:
- Identified all the endpoints to test (can't test ws due to most stress test tools not supporting them)
- Researched best practices and tools for stress testing
- Started performing stress testing using Hey on cloud deployed services:
- 100 threads, 60s, from cable connected 8 core 16 threads machine.
- Results for most services where around 4000rps and with 3-12 replicated instances on cloud side.
- Recorded result data for each endpoint using the same test parameters.
- Monitoring resource usage on test source machine we found that the uplink bandwidth was being a bottleneck (10MBps)
- Increasing the number of threads would only increase the latency due to more requests waiting for bandwidth.
- enpoints/positions/url showed higher latency and around 100 504 response errors. Highest latency was 15s.
- Looking into the issue it appears that the gateway was not able to find the service instance for those cases for some reason.
- This was highly repetable for only that endpoint, all other endpoint never showed this problem even with multiple tests.
- Bandwidth consumption was very low for /url endpoint, and the code serving it is extremely simple.
- Tested the endpoint locally with impeccable performance and 0 errors.
- We looked into potential configuration differences of the endpoint but found nothing.
- We have no idea what the issue might be.
- Surprisingly monthly history performed exceptionally well considering it requires access to the db for each request.
- We looked into cloud based solutions to workaround the bandwidth limitation (all other group members have laptops and no cabled connection).
- Basically all of them require credit card registration and cost money or have too low loads offering for free tier.
- Gatling has enterprise trial for 2 weeks. Still have to see if it asks for credit card.
- Found that with Hey, if one request was stalling, the thread would stall and overall rps would drop.
### Al:
- Performed stress test with Hey at the same time as Gio-R to see whether the services would be able to hold (approx) double the load.
- Both bandwidth were saturated and and the rps remained as high as tests from a single machine.
- So services seem to be able to scale much more than bandwidth.
- Later asked another friend with 20MB in up and we managed to get 12000 cumulative requests per second.
- Looked for hey alternatives. Found wrk cited in various places by other developers.
- wrk needs to be compiled and is linux only.
- Used windows subsistem for linux to compile it (took 30m)
- With wrk you can define the number of connection each thread handles, if one connection is stalling the thread continues to work on all the other ones.
- Ran wrk on aircraft list with slightly better results than hey, but still bandwidth limited.
- Ran wrk on enpoints/position/url. Same errors but overall better rps (800->2800)
- Consulted with Gio-R about finally pushing everything onto main.
- Did the PR and found that apparently logging also needs to have parameters validated.
- Ignored the problem for now to see if everything else worked correctly and it did.
-
# 1st June
### Al:
- Tried creating a docker-compose for front facing services to more easily test them locally.
- Docker compose was getting stuck on attaching to the running containers, but they are distroless s they don't have a shell.
- After some struggling finally gave up and started all the services manually.
- Used wrk to test all the services locally:
- info, realtime history and endpoints/url were exceptional being around 100 000rps
- list was a bit slower at 65 000 rps, likely due to the bigger message to send.
- web UI was at 6k rps, we hypothesize due to the size of the data and the fact that it is serving files from the fs.
- monthly history was predictably the slowest at 2k rps.
- This is expected since it needs to query the db with each request, read the db data and re elaborate it for the user.
- Spend time trying to get gatling to work testing the websockets.
- When it finally worked it registered that no messages were received
- Dubiously assumed it to be ok since manual testing shows that it actually receives the messages but only after a few seconds.
- Gatling is sadly unable to work at highloads (500rps) even on a local environment, the connections are simply refused outright.
- Tested up to 200rps and it worked fine.
- We don't have any tool to test this well.
- Only solution would be to develop our own tool and hope we don't mess up.
- Time budget will not allow it.
- As it is we are unable to stress test the websocket endpoint past 200rps. We have no reason to suspect issues but testing would be much better.
# 2nd June
### Al:
- Started deployment of app from github to start producing history values and watch for any issues.
- Used image of architecture from Gio-R on excalidraw to make high res image of architecture, (small refinements).
- Added image to repo readme.
- Changed number of minimal service instance from 0 to 1 to provide a more responsive experience to new users.
## Al & Grik:
- Started working together on presentation to split what to talk about each.
- Already started working on the slides.
# 3rd June
### General:
- Meeting on teams:
- Shared planned topics for each person and discussed how to optimize the presentation to avoid splitting talking too much.
- Did a small preview for each person for the presentation.
- Discussed which topics to leave as additional slides, what to add or remove etc.
- Gio-R identified an issue with 30d-history 1d-bucked values recording hourly vals instead of daily vals.
- Al found and fixed the bug in the code.
## Al & Grik:
- Continued work on slides and presentation.
# xxth June
### General:
### Al:
### Gio-R:
### nico:
### Grik:
### enri: