Chen-kang Lee
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Pyramid Develops Data Pipeline to Drive ML/AI Models Written by [Chen-Kang "Kevin" Lee](https://www.linkedin.com/in/chen-kang-lee-64058a18b/) ## 1. Introduction The explosion of information generated by individuals and organizations over the last 40 years has created the need for new data analysis methods and techniques. Increasingly powerful hardware is required to manage these massive datasets; however, not all individuals and businesses are equipped with the resources necessary to host and maintain a machine learning pipeline. Cloud-based machine learning removes this barrier by eliminating the user’s responsibility to maintain hardware, providing greater flexibility and scalability. Over the course of a four-month summer program, Pyramid interns took on the project of developing an end-to-end data pipeline in the Google Cloud Platform (GCP) that automatically ingests, prepares, and analyzes open-source data. The pipeline adapts to various datasets by allowing users to plug in different machine learning (ML) models and preprocessing code. Additionally, leveraging cloud services like auto-scaling allows the pipeline to process large datesets. This project serves as an architectural blueprint to aid Pyramid data scientists during future tech challenges when required to quickly prototype a data pipeline that can process data and draw insights. ## 2. Project Overview Before creating the pipeline, we defined a data science/ML problem to solve. The focus of this project was the pipeline structure itself, so we selected a simple problem that would allow us to remain focused on exploring tools to implement each stage of the pipeline. We chose the task of identifying banks located in disaster **hot zones,** or areas most vulnerable to natural disasters, using the public [Bank Data from Federal Deposit Insurance Corpration (FDIC)](https://banks.data.fdic.gov/docs/). A map of disaster hot zones was generated by running clustering algorithms on the [disaster dataset made public by the Federal Emergency Management Agency (FEMA)](https://www.fema.gov/about/openfema/data-sets). We divided the pipeline design into four stages: **Ingestion, Preparation, Analysis, and Visualization,** creating an end-to-end workflow from raw data to easy-to-read graphs presented in a website that we developed. Google Cloud Storage (GCS) buckets were used as interfaces between stages, storing the intermediate representation of the data at each stage. By choosing GCP as the host for its comprehensive suite of services and documentation, Pyramid was able to expand this solution from AWS to new platforms. The following chart shows which GCP services were used in each stage. In the next section, we will dive into the details of implementation. ![](https://i.imgur.com/0HeCXR4.png) Our development team followed a one-week sprint cycle, and the entire project was completed over a 10-week summer internship in 2022. The focus of each sprint roughly aligned with the stages of the pipeline, though the first four weeks were dedicated to research and planning. The following graphic shows the work completed during each sprint. ![](https://i.imgur.com/gYtQ58X.png) ## 3. Deep Dive ### 3.1 Ingestion Ingestion was implemented using an Apache Airflow operator on a Google Cloud Composer. The operator downloaded data from the source (public data from FDIC and FEMA websites) into a GCS bucket. More details on Apache Airflow and Cloud Composer can be found in section 3.5. Parallelization for the purpose of making the ingestion process run faster was not an option for our pipeline because both the FDIC and FEMA datasets were download as a single file; however, ingestion can benefit from parallelization in problems with larger datasets that are sourced from multiple files. An Airflow operator can be set up for each source file, and Cloud Composer will schedule them to run parallel when possible. ### 3.2 Preparation After data ingestion was complete, we prepared the raw data for analysis. Our three primary preprocessing subtasks were as follows: **filter incomplete or extraneous FEMA disaster entries; convert the Federal Information Processing Standard (FIPS) codes used by FEMA; and convert the street address data used by FDIC.** Preprocessing was completed using Google Dataflow, a cloud-based data processing service that optimizes performance by automatically identifying opportunities for parallelization and data shuffling. FEMA has been recording disasters since 1969. More recent records entries include more detailed information, whereas older entries have missing fields. Preprocessing effectively removed older entries with missing values. Additionally, we omitted data related to disasters that occurred outside of the contiguous United States (e.g., in Alaska or Hawaii) because these disasters formed their own separate clusters, making the presentation less clear. Upon discovering that FEMA data uses the FIPS code, we replaced the FIPS codes with the midpoint of each county according to data from the US Census Bureau. This strategy enabled geographic clustering based on latitude and longitude data. FDIC data also required conversion because street addresses are listed as branch locations. We utilized ArcGIS, a geocoder in the GeoPy library, to map the branches’ street addresses to a longitude and latitude. Google Dataflow allowed us to configure the preprocessing operation using Apache Beam, which provides a unified programming model for both batch and streaming data processing. With Apache Beam, users can define the data processing pipeline using either Java or Python. **PTransforms** are the basic building blocks of a Beam pipeline, and they are used to perform a wide range of data processing tasks, such as filtering, mapping, aggregating, and joining. By chaining PTransforms together, users can perform complex data processing operations. Each PTransform takes one or more parallel connections (abbreviated **PCollections**) as input, then produces one or more PCollections as output. PCollections are immutable, distributed datasets that can be partitioned and processed in parallel across multiple machines or nodes, which enables high performance, scalable data processing. The figure below illustrates the transformations defined in our preprocessing stage. Beginning with the raw data downloaded during ingestion into a GCS bucket, the data is then passed through the PTransforms, which are depicted as orange rectangles. The gray dots represent the intermediate PCollections generated by the PTransforms. Finally, the preprocessed data is stored in a second GCS bucket titled "Clean Data." ![](https://i.imgur.com/EsYqmLi.png) ### 3.3 Analysis When asked about natural disasters in the US, people instinctively associate certain disaster types with specific areas (i.e., hurricanes in the southeast and tornadoes in the Midwest). A clustering algorithm using FEMA disaster data gives a more accurate map of hot zones for all disaster types. For this project, we applied three different algorithms to cluster the FEMA data: **kNN, DBSCAN, Filtered DBSCAN.** After generating the hot zones with these clustering algorithms, we used k-means clustering to map bank locations to nearby clusters. In this section, we explore the rationale behind each clustering algorithm. [kNN](https://ieeexplore.ieee.org/document/1053964) is a classic algorithm often favored for its simplicity; however, FEMA disaster data includes county-level details, so the data points were much denser on the eastern half of the US where counties are geographically smaller. This observation posed a challenge to kNN because the same threshold that generated well-formed clusters on the West Coast would have resulted in the generation of one big cluster to encompass the entire East Coast. ![](https://i.imgur.com/Ub35fFV.png) [DBSCAN (Density-Based Spatial Clustering of Applications with Noise)](https://dl.acm.org/doi/10.5555/3001460.3001507) is a density-based clustering algorithm that groups together data points that are close to each other in the feature space. DBSCAN can identify and ignore outliers, which are data points that do not belong to any cluster. It also allows the creation of clusters with varying densities, which gives us more insightful clustering results across the whole country. ![](https://i.imgur.com/0YktXeP.png) Some disasters are highly correlated with the time of year (e.g., blizzards occur during winter, hurricanes occur during summer, etc.). This correlation initially led us to the [ST-DBSCAN](https://www.sciencedirect.com/science/article/pii/S0169023X06000218) algorithm, which is an updated version of DBSCAN that clusters not only in the spatial aspect but also the temporal aspect. However, disasters in the FEMA dataset vary greatly in length, which made it challenging to select a specific timestamp to represent each disaster entry. We decided on a middle ground solution: to group disasters by the month they begin, then implement an updated DBSCAN to obtain clusters of varying density. This process is denoted as “Filtered DBSCAN.”. ![](https://i.imgur.com/bxZKspz.gif) The code for the analysis are in the form of Python scripts. The automation done by Cloud Composer triggered Google Compute Engine instances and executed the scripts, which loaded the cleaned data from GCS buckets to the local machine and ran the algorithms. Using Plotly--a graphical library--we plotted the bank branch addresses into an interactive map of the US, then exported our plot into HTML elements stored in a GCS bucket. The HTML elements are subsequently embedded into the website we created in the visualization stage. ### 3.4 Visualization To present the result, we wrote a [demo website](https://internship-355618.uk.r.appspot.com/) using the React library and hosted the webserver on Google App Engine. Our website directly embeds HTML elements, so to update the graph, we only needed to update the HTML elements on the cloud. ### 3.5 Automation Knowing the FDIC and FEMA datasets will update periodically, we designed the pipeline to execute every six months to display up-to-date disaster hot zones. Automation of the pipeline was achieved with Cloud Composer, which allowed us to define and describe the dependency between tasks with an Apache Airflow Directed Acyclic Graph (DAG). Airflow supports operators that can directly interact with GCP, ranging from starting Beam jobs, starting instances, etc. Depending on the task, users must define their own DAG. ![](https://i.imgur.com/nYUuP0t.png) ## 4. Conclusion Our team of interns assembled a pipeline on GCP that is designed to handle large volumes of data. This deliverable can be extended easily to accommodate new data sources and processing requirements. With this pipeline as a blueprint, Pyramid data scientists can focus on analysis and generation of insights rather than spending time on the technicalities of building a data pipeline. ## Contributors - [Lilia Hsueh](https://www.linkedin.com/in/lilia-hsueh-268b55191/) - [Chen-Kang "Kevin" Lee](https://www.linkedin.com/in/chen-kang-lee-64058a18b/) - [Rachel Massey](https://www.linkedin.com/in/rachel-massey-02483312/) - [Drew Nguyen-Phan](https://www.linkedin.com/in/drewnguyenphan/) - [Jonah Osband](https://www.linkedin.com/in/jonah-osband/) - [Sayim Shazlee](https://www.linkedin.com/in/sayim-shazlee/)

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully