Projects at AWS

# Projects at AWS ## ONS Project Description: Serverless Data-Lake Solution for the UK Office for National Statistics (ONS) In this project, I designed and co-contributed to the development and deployment of a data-lake solution for the UK Office for National Statistics (ONS). This solution enabled users to create datasets, define access control policies, and generate on-demand compute environments for seamless data analysis using Python, NumPy, and Spark on both Linux and Windows platforms. Our serverless architecture allowed users to save results between sessions, providing researchers with the flexibility to turn off and on their compute environments as needed, reducing operational costs. We designed the system to grant administrative users the ability to restrict accessibility to datasets, compute environment types, and costs while retaining audit capabilities for system usage. The data-lake solution focused on using serverless technologies, enabling the customer to benefit from a usage-based cost structure. The technologies utilized in this project included API Gateway, AWS Lambda, S3, CloudFront, API Gateway, React, IAM, Spark, EC2, and DynamoDB. In addition to my involvement in the development process, I was also part of the team that presented and showcased the solution on-site to the ONS. This project demonstrates my proficiency in serverless computing, data-lake implementation, and collaboration within a team to deliver a successful solution to a high-profile customer. ## FAA Project Description: Near-Realtime Air Traffic Dashboard for the Federal Aviation Administration (FAA) In this project, I designed and co-developed a project for the FAA that focused on creating a near-realtime dashboard to display airplane information derived from both public and private data streams. Our primary goal was to develop an end-to-end scalable data ingestion pipeline capable of streaming real-time data to downstream clients, such as user dashboards. The solution we built was able to ingest data at scale, transform it, store the desired information, and then stream it in near-realtime to the user dashboard. This allowed for improved situational awareness and data-driven decision-making in the aviation industry. Our technology stack included AWS Data Firehose for data ingestion, AWS Data Analytics for processing and analysis, Amazon S3 and DynamoDB for storage, React for the front-end user interface, Athena for querying data, and Websockets for real-time communication. This project showcases my expertise in developing scalable data ingestion pipelines, real-time data processing, and dashboard development, as well as my ability to collaborate effectively within a team. ## DfE Project Description: Accelerated Certification Path Solution for Aspiring Professors In this project, I designed and co-developed an innovative solution to help aspiring professors achieve their professional certifications through the shortest possible development path. By utilizing a graph-based database, our system recorded the certification paths of previous candidates and determined the most efficient route to certification for new candidates based on their individual circumstances. In the proof-of-concept (PoC), I implemented simulated customer data silos, data generators that produced continuous and realistic data changes (inserts, updates, and deletes), a scalable data ingestion pipeline to the graph database, the graph-based data model, and the algorithm used to determine the shortest certification path for candidates. The algorithm was implemented using Apache TinkerPop and the Gremlin query language. Throughout the project, I delivered weekly demos to C-level stakeholders, showcasing the value added by our solution. I also built consensus within the team to use GraphQL as the query language for downstream solutions to query results. The technology stack for this project included SQL Server, Amazon Neptune, Apache TinkerPop, Gremlin Query Language, Amazon Glue Job (Spark), Amazon Glue Workflow, Amazon Glue Crawler, S3, Jupyter notebooks, DynamoDB, CloudFront, CloudFormation, API Gateway, Lambda, React, and GraphQL. This project highlights my expertise in graph databases, data modeling, algorithm development, and effective communication with stakeholders at all levels. ## UCL Project Description: Resource Usage Optimization and Data Reconciliation for University College London (UCL) As the sole software development engineer (SDE) for this project, I designed and implemented a solution to optimize and predict resource usage at UCL while addressing the challenge of bridging data silos based on fuzzy names that the customer was unable to reconcile. I developed simulated customer data silos based on Oracle databases and implemented data generators to reliably produce continuous and realistic data changes (inserts, updates, and deletes). I also designed a scalable data migration and ingestion pipeline to move data from on-premises Oracle databases to AWS data stores, such as S3, Redshift, and DynamoDB, to optimize access patterns. To unify data silos, I utilized AWS Glue ML Transforms, and then employed Amazon Forecast to predict future demand across individual resources. This allowed UCL to efficiently allocate resources based on predicted usage patterns and streamline operations. The technology stack for this project included DynamoDB, Lambda, API Gateway, CloudFront, React, AWS Glue Jobs (Spark & Python), AWS Glue Workflows, Redshift, Oracle, AWS DMS, and Glue ML Transforms. This project demonstrates my ability to work independently, develop efficient data migration pipelines, and use machine learning techniques to optimize resource management for large institutions. ## Merseyside Project Description: Intelligent Environmental and Online Activity Monitoring System In this project, I designed and developed a proof-of-concept (PoC) device capable of capturing both sensor data (noise, humidity, light, temperature) and online activity information (connection metadata, DNS queries, Bluetooth, and Wi-Fi broadcast packets). The goal was to create a comprehensive monitoring solution that could optimize visiting schedules by predicting activity levels at the capture location with 99% accuracy. I implemented the on-device software using an ETL (Extract, Transform, Load) process that ensured seamless operation even during connectivity disruptions and unscheduled power cuts. To achieve this, I utilized Amazon Greengrass and continuous Lambda functions for each ETL stage, coupled with on-device persisted Redis queues for data buffering between stages. To ensure a reliable and consistent setup process, I created a method for building and pre-configuring the OS and device for both sensor and networking capture devices. The sensor device was based on Raspbian, while the networking capture device used OpenWRT. To handle the data coming from thousands of devices, I designed a scalable ingestion method that processed the information and fed it into a machine learning model. This model was able to predict activity levels at the capture location with an impressive 99% accuracy. The technology stack for this project included OpenWRT, Raspbian, Redis, GreenGrass, Lambda, DynamoDB, React, AWS Glue Jobs (Python), AWS Glue Workflows, Redshift. The activity level predictions were then used to optimize the visiting schedule, minimizing wasted effort and time while maximizing efficiency. This project showcases my ability to develop innovative solutions for IoT device deployment, data processing, and machine learning model implementation. ## Staff Uni Project Description: Student Dropout Detection and Prevention System As the designer of the solution and part of a two-person development team, I worked closely with another developer to implement a project aimed at detecting and preventing student dropout. The solution leveraged student activity, grades, and other relevant data and metadata to calculate the risk of dropout for individual students. Our solution featured an ingestion pipeline that demonstrated data migration from on-premises to the cloud. By having the customer export data daily to an S3 bucket, we used a Glue Workflow, triggered by a cron expression, to utilize Glue Tables, Glue Crawlers, Glue Jobs (Python and PySpark), and Athena for data transformation and storage in a Data Lake. The transform stage included partitioning the data according to desired query scenarios, converting it to Parquet format, and applying gzip compression for faster and more effective querying. I designed a student-level predictive model that considered all student attributes and calculated the dropout risk for each individual. If the risk exceeded a configurable threshold, a professor would receive an alert via EventBridge. The insights were written to EventBridge, and alerts were raised in the application as well as through email using Amazon SNS. EventBridge rules determined the appropriate actions to take, enabling professors to contact at-risk students through messages. The ingestion pipeline's flexibility allowed for easy extension to other data sources using only configuration changes. The technology stack for this project included React, CloudFront, API Gateway, Lambda, DynamoDB, EventBridge, S3, Glue Tables, Glue Crawlers, Glue Jobs (Python), AWS Glue Workflows, Athena, CloudFormation, and AWS Forecast. This project highlights my expertise in data pipeline design, predictive modeling, and developing flexible, scalable solutions for complex data-driven challenges.