owned this note
owned this note
Published
Linked with GitHub
# Pyramid Develops Data Pipeline to Drive ML/AI Models
Written by [Chen-Kang "Kevin" Lee](https://www.linkedin.com/in/chen-kang-lee-64058a18b/)
## 1. Introduction
The explosion of information generated by individuals and organizations over the last 40 years has created the need for new data analysis methods and techniques. Increasingly powerful hardware is required to manage these massive datasets; however, not all individuals and businesses are equipped with the resources necessary to host and maintain a machine learning pipeline. Cloud-based machine learning removes this barrier by eliminating the user’s responsibility to maintain hardware, providing greater flexibility and scalability.
Over the course of a four-month summer program, Pyramid interns took on the project of developing an end-to-end data pipeline in the Google Cloud Platform (GCP) that automatically ingests, prepares, and analyzes open-source data. The pipeline adapts to various datasets by allowing users to plug in different machine learning (ML) models and preprocessing code. Additionally, leveraging cloud services like auto-scaling allows the pipeline to process large datesets. This project serves as an architectural blueprint to aid Pyramid data scientists during future tech challenges when required to quickly prototype a data pipeline that can process data and draw insights.
## 2. Project Overview
Before creating the pipeline, we defined a data science/ML problem to solve. The focus of this project was the pipeline structure itself, so we selected a simple problem that would allow us to remain focused on exploring tools to implement each stage of the pipeline. We chose the task of identifying banks located in disaster **hot zones,** or areas most vulnerable to natural disasters, using the public [Bank Data from Federal Deposit Insurance Corpration (FDIC)](https://banks.data.fdic.gov/docs/). A map of disaster hot zones was generated by running clustering algorithms on the [disaster dataset made public by the Federal Emergency Management Agency (FEMA)](https://www.fema.gov/about/openfema/data-sets).
We divided the pipeline design into four stages: **Ingestion, Preparation, Analysis, and Visualization,** creating an end-to-end workflow from raw data to easy-to-read graphs presented in a website that we developed. Google Cloud Storage (GCS) buckets were used as interfaces between stages, storing the intermediate representation of the data at each stage.
By choosing GCP as the host for its comprehensive suite of services and documentation, Pyramid was able to expand this solution from AWS to new platforms. The following chart shows which GCP services were used in each stage. In the next section, we will dive into the details of implementation.
![](https://i.imgur.com/0HeCXR4.png)
Our development team followed a one-week sprint cycle, and the entire project was completed over a 10-week summer internship in 2022. The focus of each sprint roughly aligned with the stages of the pipeline, though the first four weeks were dedicated to research and planning. The following graphic shows the work completed during each sprint.
![](https://i.imgur.com/gYtQ58X.png)
## 3. Deep Dive
### 3.1 Ingestion
Ingestion was implemented using an Apache Airflow operator on a Google Cloud Composer. The operator downloaded data from the source (public data from FDIC and FEMA websites) into a GCS bucket. More details on Apache Airflow and Cloud Composer can be found in section 3.5.
Parallelization for the purpose of making the ingestion process run faster was not an option for our pipeline because both the FDIC and FEMA datasets were download as a single file; however, ingestion can benefit from parallelization in problems with larger datasets that are sourced from multiple files. An Airflow operator can be set up for each source file, and Cloud Composer will schedule them to run parallel when possible.
### 3.2 Preparation
After data ingestion was complete, we prepared the raw data for analysis. Our three primary preprocessing subtasks were as follows: **filter incomplete or extraneous FEMA disaster entries; convert the Federal Information Processing Standard (FIPS) codes used by FEMA; and convert the street address data used by FDIC.** Preprocessing was completed using Google Dataflow, a cloud-based data processing service that optimizes performance by automatically identifying opportunities for parallelization and data shuffling.
FEMA has been recording disasters since 1969. More recent records entries include more detailed information, whereas older entries have missing fields. Preprocessing effectively removed older entries with missing values. Additionally, we omitted data related to disasters that occurred outside of the contiguous United States (e.g., in Alaska or Hawaii) because these disasters formed their own separate clusters, making the presentation less clear.
Upon discovering that FEMA data uses the FIPS code, we replaced the FIPS codes with the midpoint of each county according to data from the US Census Bureau. This strategy enabled geographic clustering based on latitude and longitude data. FDIC data also required conversion because street addresses are listed as branch locations. We utilized ArcGIS, a geocoder in the GeoPy library, to map the branches’ street addresses to a longitude and latitude.
Google Dataflow allowed us to configure the preprocessing operation using Apache Beam, which provides a unified programming model for both batch and streaming data processing. With Apache Beam, users can define the data processing pipeline using either Java or Python. **PTransforms** are the basic building blocks of a Beam pipeline, and they are used to perform a wide range of data processing tasks, such as filtering, mapping, aggregating, and joining. By chaining PTransforms together, users can perform complex data processing operations. Each PTransform takes one or more parallel connections (abbreviated **PCollections**) as input, then produces one or more PCollections as output. PCollections are immutable, distributed datasets that can be partitioned and processed in parallel across multiple machines or nodes, which enables high performance, scalable data processing.
The figure below illustrates the transformations defined in our preprocessing stage. Beginning with the raw data downloaded during ingestion into a GCS bucket, the data is then passed through the PTransforms, which are depicted as orange rectangles. The gray dots represent the intermediate PCollections generated by the PTransforms. Finally, the preprocessed data is stored in a second GCS bucket titled "Clean Data."
![](https://i.imgur.com/EsYqmLi.png)
### 3.3 Analysis
When asked about natural disasters in the US, people instinctively associate certain disaster types with specific areas (i.e., hurricanes in the southeast and tornadoes in the Midwest). A clustering algorithm using FEMA disaster data gives a more accurate map of hot zones for all disaster types. For this project, we applied three different algorithms to cluster the FEMA data: **kNN, DBSCAN, Filtered DBSCAN.** After generating the hot zones with these clustering algorithms, we used k-means clustering to map bank locations to nearby clusters. In this section, we explore the rationale behind each clustering algorithm.
[kNN](https://ieeexplore.ieee.org/document/1053964) is a classic algorithm often favored for its simplicity; however, FEMA disaster data includes county-level details, so the data points were much denser on the eastern half of the US where counties are geographically smaller. This observation posed a challenge to kNN because the same threshold that generated well-formed clusters on the West Coast would have resulted in the generation of one big cluster to encompass the entire East Coast.
![](https://i.imgur.com/Ub35fFV.png)
[DBSCAN (Density-Based Spatial Clustering of Applications with Noise)](https://dl.acm.org/doi/10.5555/3001460.3001507) is a density-based clustering algorithm that groups together data points that are close to each other in the feature space. DBSCAN can identify and ignore outliers, which are data points that do not belong to any cluster. It also allows the creation of clusters with varying densities, which gives us more insightful clustering results across the whole country.
![](https://i.imgur.com/0YktXeP.png)
Some disasters are highly correlated with the time of year (e.g., blizzards occur during winter, hurricanes occur during summer, etc.). This correlation initially led us to the [ST-DBSCAN](https://www.sciencedirect.com/science/article/pii/S0169023X06000218) algorithm, which is an updated version of DBSCAN that clusters not only in the spatial aspect but also the temporal aspect. However, disasters in the FEMA dataset vary greatly in length, which made it challenging to select a specific timestamp to represent each disaster entry. We decided on a middle ground solution: to group disasters by the month they begin, then implement an updated DBSCAN to obtain clusters of varying density. This process is denoted as “Filtered DBSCAN.”.
![](https://i.imgur.com/bxZKspz.gif)
The code for the analysis are in the form of Python scripts. The automation done by Cloud Composer triggered Google Compute Engine instances and executed the scripts, which loaded the cleaned data from GCS buckets to the local machine and ran the algorithms. Using Plotly--a graphical library--we plotted the bank branch addresses into an interactive map of the US, then exported our plot into HTML elements stored in a GCS bucket. The HTML elements are subsequently embedded into the website we created in the visualization stage.
### 3.4 Visualization
To present the result, we wrote a [demo website](https://internship-355618.uk.r.appspot.com/) using the React library and hosted the webserver on Google App Engine. Our website directly embeds HTML elements, so to update the graph, we only needed to update the HTML elements on the cloud.
### 3.5 Automation
Knowing the FDIC and FEMA datasets will update periodically, we designed the pipeline to execute every six months to display up-to-date disaster hot zones. Automation of the pipeline was achieved with Cloud Composer, which allowed us to define and describe the dependency between tasks with an Apache Airflow Directed Acyclic Graph (DAG). Airflow supports operators that can directly interact with GCP, ranging from starting Beam jobs, starting instances, etc. Depending on the task, users must define their own DAG.
![](https://i.imgur.com/nYUuP0t.png)
## 4. Conclusion
Our team of interns assembled a pipeline on GCP that is designed to handle large volumes of data. This deliverable can be extended easily to accommodate new data sources and processing requirements. With this pipeline as a blueprint, Pyramid data scientists can focus on analysis and generation of insights rather than spending time on the technicalities of building a data pipeline.
## Contributors
- [Lilia Hsueh](https://www.linkedin.com/in/lilia-hsueh-268b55191/)
- [Chen-Kang "Kevin" Lee](https://www.linkedin.com/in/chen-kang-lee-64058a18b/)
- [Rachel Massey](https://www.linkedin.com/in/rachel-massey-02483312/)
- [Drew Nguyen-Phan](https://www.linkedin.com/in/drewnguyenphan/)
- [Jonah Osband](https://www.linkedin.com/in/jonah-osband/)
- [Sayim Shazlee](https://www.linkedin.com/in/sayim-shazlee/)