Projects at Starlizard

# Projects at Starlizard ## Microsoft Orleans DynamoDB clustering provider On my own initiative, I've investigated a series of seemingly unrelated performance and scalability issues with the Microsoft Orleans-based solution on Kubernetes facing challenges such as grain activation problems, communication delays between silos and slow horizontal scaling during peak loads. Utilizing EKS logs, I identified that we had a significant number of Kubernetes API requests with delays and timeouts occurring during peak demand (some requests took more than 4 seconds to complete). Given this situation, I've then determined the relevant metrics for the Kubernetes API servers and that during periods of peak demand the CPU usage across the two API servers reached 1500% on average. With this information I've then gone back to the EKS audit logs and I was able to determine that most of the load was being generated by the Microsoft Orleans Kubernetes based clustering provider. The Microsoft Orleans team have a dedicated Discord channel and confirmed that the Kubernetes based clustering provider was placing unnecessary strain on the Kubernetes API servers. In addition, they had concerns regarding the compatibility of their library with future Kubernetes versions. To address these challenges, I wrote an Architectural Decision Record (ADR) detailing the problem, evaluated solutions, migration procedures and the rationale behind disqualifying certain alternatives, alongside the effort required to implement the chosen solutions. I proposed transitioning to a DynamoDB-based clustering provider offered by Microsoft as a temporary fix, despite concerns over its SCAN operation usage. Furthermore, I designed a new solution leveraging the DynamoDB single-table design approach to enhance efficiency, reduce DynamoDB requests and decrease cost. After building consensus among senior management for adopting the Microsoft Orleans DynamoDB provider as an interim measure and developing the single-table design as a long-term solution, I executed the necessary changes across our infrastructure. This change successfully addressed the encountered issues, reduced API server CPU usage from 1500% to under 100% and reduced the Kubernetes API request latency to below 1 second. ## Content-Logging We had a critical issue where a customer claimed that they've not received certain data items. To address this concern, I proposed and developed **Content-Logging**, a solution enabling the cost-effective recording of .NET objects. I've used the Content-Logging solution to record all the data being sent to the customer via the subscription process and conclusively determined that the discrepancies were not due to our software, leading the customer to identify the problem on their end. To formalize this process, I wrote an Architecture Decision Record (ADR) that laid out requirements such as data storage capability, extensibility for future data types, context-specific logging, near-zero latency and minimal operational costs. In the ADR document I've detailed three potential solutions along with an assessment against the set requirements. The chosen solution used a fluent-bit based forwarder-aggregator type ETL process for capturing the data directly from the Kubernetes logs (transforming it to the desired shape and then storing the data in its final form in AWS S3 in gzipped Json), AWS Glue for table definitions and AWS Athena for SQL-based data querying. An overnight Lambda based process seamlessly converts the ingested data to gzipped parquet for efficient and cost-effective querying. The main reason for disqualifying the other two solutions, based on Amazon Data Firehose and Datadog, was cost. The fluent-bit based solution had an estimated operational cost of only $12 per data type/environment/year, significantly lower than the other evaluated options which exceeded $400 for the same metrics. This cost-effectiveness was crucial given the extensive variety of data types and environments that could benefit from the **Content-Logging** at Starlizard. I've developed the solution end-to-end, including the creation of a reusable Terraform module for streamlined deployment to AWS cloud and the .NET content-logging library with support for transport plugins. I've also written comprehensive documentation for DevOps and developers. I've held presentations of the solution with team leads and senior management and consequently Content-logging is now being used by most teams at Starlizard. ## Microsoft Orleans storage in DynamoDB ### Migrating the storage from MariaDB to DynamoDB I proposed and led the migration of Microsoft Orleans data storage from MariaDB to DynamoDB, enhancing the solution performance and scalability. I've initiated the change by writing a comprehensive document that outlined the current challenges and explored potential solutions. The potential solutions were evaluated using assessment criteria for latency, throughput, reliability, scalability, developer experience, security, cost, manageability, upgradeability, availability and integration capabilities with AWS services. The analysis covered performance characteristics of MariaDB, DynamoDB with both provisioned and on-demand capacity and DynamoDB with on-demand capacity augmented by DAX. Benchmarking showed significant advantages when using DynamoDB, particularly under peak load conditions, presenting a compelling case for its adoption. My presentation to the team, backed by detailed performance data and cost-benefit analysis, led to the consensus to (1) transition to DynamoDB with on-demand capacity for its simplified capacity provisioning and (2) as a long-term plan to transition to provisioned capacity for reduced cost. I've led the migration by integrating the necessary code changes into a common code base at Starlizard (making the DynamoDB based storage option available across all teams), implementing the required changes to the AWS infrastructure and then coordinated the release of this feature to our environments. This change not only optimised our system's performance during high-demand scenarios but also streamlined the management and scalability of our data storage solution. ### Microsoft Orleans reminder storage bug fix I addressed a critical issue encountered after migrating the Microsoft Orleans storage from MariaDB to DynamoDB. Microsoft Orleans needs to store two types of data: grain state and reminders. Despite the successful transition for grain state storage, a significant bug in the reminders DynamoDB storage provider caused daily costs to surge to over $1000 due to excessive read capacity unit (RCU) consumption. Seeing this situation, I quickly diagnosed the issue and developed a bug fix that dramatically reduced resource usage. After deploying the fix, we observed a cost reduction of more than 100x. To enhance the Microsoft Orleans framework, I submitted a [pull request](https://github.com/dotnet/orleans/pull/7658) to the official GitHub repository, detailing the fix along with additional improvements, which was then promptly merged by the Microsoft Orleans team. ### DynamoDB compression for large grain states In certain scenarios, the state of some grains can expand significantly, leading to potential storage challenges like: - Increased Read Capacity Units (RCU) and Write Capacity Units (WCU) usage which in turn leads to significant cost increases. - Increased risk of breaching the 400 KB max item size in DynamoDB. - Performance concerns related to large items and the maximum DynamoDB partition performance. While DynamoDB has mechanisms in place to mitigate some concerns, it's crucial to understand their limitations. Specifically, DynamoDB employs two strategies: - Leveraging adaptive capacity to dynamically manage the allocation of throughput. - Moving high throughput items into their own dedicated partitions. However, there are inherent constraints with these strategies. Each partition has a strict upper limit of 1000 WCU. Consequently, this restricts the maximum performance to a range of 2.5 (item size is 400 KB) to 1000 (item size is 1 KB) writes per second, depending on the item size and additional parameters utilized during the data writing process. To address these challenges, on my own initiative, I've built consensus, implemented and deployed to all environments a change to allow for the compression of large grain states without incurring any downtime in the environments to which it was deployed to. The change was implemented in a common code base and therefore it became available to all teams at Starlizard using DynamoDB as the storage mechanism in Microsoft Orleans. With this mechanism in place, an initial grain state of 16 KB can potentially be reduced to 1 KB or even less through compression (a 93% reduction). Such an approach offers multiple benefits for a Microsoft Orleans based solution: - Reduced Write Capacity Units (WCU) usage by more than tenfold for larger items, especially considering that 75% of actions on DynamoDB items pertain to writes. - A corresponding decrease in storage costs by over 10 times for sizable items. - A reduction in network latency during the transfer of items to and from the storage solution. To support this functionality, I introduced a new record schema incorporating a `Metadata` map attribute to hold key-values such as `Compression` type, `Serialization` method and `SchemaVersion`, ensuring seamless data handling and compatibility. The migration strategy, which detected the record version upon read and updated records to the latest schema version upon write, ensured backward compatibility and facilitated a smooth transition to the optimised storage format. Testing the solution in isolation using data from the UAT environment showed a potential compression ratio of over 500% and a negligible addition of 40 microseconds in compression latency — substantially outweighing any potential drawbacks with decreased data transfer latency due to smaller data sizes. Once deployed to the LIVE environment, the benefits proved even higher with a 94% reduction in RCU usage and 97% reduction in WCU usage. ### DynamoDB single-table design On my own initiative, I spearheaded a significant initiative to optimize the Microsoft Orleans DynamoDB storage system, which includes grain state, reminders and clustering provider data. Recognizing DynamoDB as the third largest expense after EC2 and MSK, I proposed a single-table design approach to optimize the use of DynamoDB. This proposed solution was designed to: - Lower the number of requests to DynamoDB (thus reducing latency and cost). - Simplify monitoring and management since we'd have to manage a single table. The single-table design solution unlocks the possibility to use DynamoDB provisioned capacity with reserved capacity for the stable state load (thus replacing the on-demand provisioning strategy currently being employed). Using provisioned capacity without the single-table design approach wasn't feasible due to the spiky workload. To gather support for this major change, I wrote a comprehensive design document detailing the existing challenges, the proposed solution, the data access patterns along with example data using NoSQL Workbench for DynamoDB, the migration strategy and a comparison with alternative solutions (explaining why they fell short of our objectives). This document, complete with development timelines and cost-benefit analyses, was presented to senior management and the team, receiving overwhelmingly positive feedback. This proposal not only promises to streamline operational processes and reduce development and maintenance efforts but also forecasts an impressive cost saving of $70,000 annually from this change alone, equating to a 90% reduction in DynamoDB-related expenses. The savings are expected to increase substantially once implemented across the entire organization. Additionally, the solution's extensibility paves the way for further cost reductions by potentially replacing Aurora and MSK, highlighting my ability to drive impactful financial and operational improvements. ## Architecture Diagram ### Solution diagram I've authored the end-to-end solution architecture diagrams using draw.io. These diagrams detailed essential architecture components such as networking, compute, storage, integration with other SaaS solutions (e.g.: OKTA, Datadog) and integration with on-premise data centres and services for each client using the latest AWS architecture icons and best practices from AWS. I've leveraged the solution architecture diagram to drive solution architecture presentations to external customers, facilitating clear communication and understanding of the AWS cloud-based SaaS solution. ### CI/CD solution diagram I've authored a detailed architecture diagram showcasing the integration and connectivity between the CI/CD pipeline architecture components (BitBucket, Jenkins, Vault, JFrog, OKTA, Flux and AWS cloud). The CI/CD diagram was used to explain the CI/CD processes and tools to new and existing team members. ## Inner Sourcing I've been tasked with reviewing a PR for repository documentation that contained multiple errors and omissions. I took the initiative to not only rewrite the existing documentation but also to create a comprehensive repository template for future use. This template, designed to establish a standard for both new and existing repositories, includes a suite of tools and guidelines aimed at enhancing documentation quality and consistency: - **VSCode standard plugins** for spellchecking, markdown assistance, linting, and draw.io integration, coupled with VSCode settings to streamline the development environment. - A detailed `README.md` which contained a table of contents, architecture diagrams, an introduction, and references to the release process and contributing guide. - `ARCHITECTURE.md` showcases the system's design through [draw.io](https://app.diagrams.net/) based diagrams, documents functional and non-functional requirements, the stakeholders, security considerations, key decisions, and potential risks along with mitigation strategies. - `CHANGES.md` adopts the [keep a changelog](https://keepachangelog.com/en/1.1.0/) format for documenting changes and [Semantic Versioning](https://semver.org/) to determine the next version. This document included examples to guide contributors. - `CONTRIBUTING.md` outlines the process for issue reporting, required permissions, development environment setup, and coding guidelines emphasizing high code coverage, dependency update guide, and the feature branch pattern to be used. - Templates for **issues** and **pull requests** that streamline the submission process, ensuring consistency in reporting bugs or feature requests and in detailing pull requests with checks for tests, documentation, and successful builds. - `MAINTAINERS.md` and `ROADMAP.md` provide information on repository maintenance, the pull request review process, release procedures, and future planning with milestones and goals. After creating this template, I applied it to document the repository template itself, therefore demonstrating its utility and effectiveness. The repository template has since been adopted as a standard at Starlizard and is being used for new repositories. This initiative not only enhanced the quality and consistency of our project documentation but also streamlined the onboarding process for new contributors. ## Miscellaneous I led and contributed to a wide range of projects, related to cloud engineering, chaos engineering, unified storage solutions, and training. Some of my key achievements include: 1. **Chaos Engineering**: I've authored non-functional requirements and a detailed test plan focusing on operational excellence, security, reliability, performance efficiency, cost optimization and sustainability leveraging the AWS Well-Architected framework. I've implemented a Terraform module for deploying the AWS Fault Injection Service experiments to the solution AWS account and automated the daily experiment runs via a Jenkins pipeline. 3. **Unified Storage Solution**: I've proposed a system and architecture for a centralized view for diverse data storage mediums (e.g.: Kafka, MySQL, DynamoDB, S3) on a timeline, enhancing observability and issue resolution. The proposal is being actively worked on. 4. **Training Initiatives**: - Delivered in-depth training on **Amazon Aurora** and **DynamoDB**, covering their architecture, security, best practices, pricing, and provided practical demonstrations. - Conducted a comprehensive **Terraform training series**, highlighting its importance, application within our projects, and detailed usage through the content-logging feature which was used as an example. 4. **Subscription Monitoring**: Developed a generalized solution for aggregating metrics across pods, facilitating precise monitoring of the data flow and processing stages, utilizing Prometheus and Datadog for advanced metric analysis. 5. **Orleans Grain State Storage Date Format Fix**: Identified and resolved a critical date storage issue in the Orleans grain state storage, implementing a storage provider fix and a migration tool without incurring downtime. 6. **MSK Optimizations**: Implemented several optimizations for Kafka (MSK), including server and producer-side compression, rebalancing partitions, and strategic parameter adjustments to enhance performance and reliability. 7. **RDS Enhancements**: Implemented performance optimizations, connection limit solutions (via RDS Proxy), and migration to Aurora 3, contributing to significant performance improvements and cost reductions. 8. **Kafka Schema Registry**: Successfully integrated and configured Kafka Schema Registry for high availability, including comprehensive documentation for the feature, the rollout and the rollback strategy, risks, concerns and mitigating actions. 9. **DevOps**: I've enhanced operational efficiency and early problem detection using Terraform and AWS DevOps Guru. I've determined the solution SLA using the solution architecture, and actively contributed to the DevOps function. 10. **SRE**: I've implemented observability improvements, helped with capacity planning, and made significant enhancements to system availability and performance through strategic infrastructure improvements.