Chen-kang Lee
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Pyramid Develops Data Pipeline to Drive ML/AI Models Written by [Chen-Kang "Kevin" Lee](https://www.linkedin.com/in/chen-kang-lee-64058a18b/) ## 1. Introduction The explosion of information generated by individuals and organizations over the last 40 years has created the need for new data analysis methods and techniques. Increasingly powerful hardware is required to manage these massive datasets; however, not all individuals and businesses are equipped with the resources necessary to host and maintain a machine learning pipeline. Cloud-based machine learning removes this barrier by eliminating the user’s responsibility to maintain hardware, providing greater flexibility and scalability. Over the course of a four-month summer program, Pyramid interns took on the project of developing an end-to-end data pipeline in the Google Cloud Platform (GCP) that automatically ingests, prepares, and analyzes open-source data. The pipeline adapts to various datasets by allowing users to plug in different machine learning (ML) models and preprocessing code. Additionally, leveraging cloud services like auto-scaling allows the pipeline to process large datesets. This project serves as an architectural blueprint to aid Pyramid data scientists during future tech challenges when required to quickly prototype a data pipeline that can process data and draw insights. ## 2. Project Overview Before creating the pipeline, we defined a data science/ML problem to solve. The focus of this project was the pipeline structure itself, so we selected a simple problem that would allow us to remain focused on exploring tools to implement each stage of the pipeline. We chose the task of identifying banks located in disaster **hot zones,** or areas most vulnerable to natural disasters, using the public [Bank Data from Federal Deposit Insurance Corpration (FDIC)](https://banks.data.fdic.gov/docs/). A map of disaster hot zones was generated by running clustering algorithms on the [disaster dataset made public by the Federal Emergency Management Agency (FEMA)](https://www.fema.gov/about/openfema/data-sets). We divided the pipeline design into four stages: **Ingestion, Preparation, Analysis, and Visualization,** creating an end-to-end workflow from raw data to easy-to-read graphs presented in a website that we developed. Google Cloud Storage (GCS) buckets were used as interfaces between stages, storing the intermediate representation of the data at each stage. By choosing GCP as the host for its comprehensive suite of services and documentation, Pyramid was able to expand this solution from AWS to new platforms. The following chart shows which GCP services were used in each stage. In the next section, we will dive into the details of implementation. ![](https://i.imgur.com/0HeCXR4.png) Our development team followed a one-week sprint cycle, and the entire project was completed over a 10-week summer internship in 2022. The focus of each sprint roughly aligned with the stages of the pipeline, though the first four weeks were dedicated to research and planning. The following graphic shows the work completed during each sprint. ![](https://i.imgur.com/gYtQ58X.png) ## 3. Deep Dive ### 3.1 Ingestion Ingestion was implemented using an Apache Airflow operator on a Google Cloud Composer. The operator downloaded data from the source (public data from FDIC and FEMA websites) into a GCS bucket. More details on Apache Airflow and Cloud Composer can be found in section 3.5. Parallelization for the purpose of making the ingestion process run faster was not an option for our pipeline because both the FDIC and FEMA datasets were download as a single file; however, ingestion can benefit from parallelization in problems with larger datasets that are sourced from multiple files. An Airflow operator can be set up for each source file, and Cloud Composer will schedule them to run parallel when possible. ### 3.2 Preparation After data ingestion was complete, we prepared the raw data for analysis. Our three primary preprocessing subtasks were as follows: **filter incomplete or extraneous FEMA disaster entries; convert the Federal Information Processing Standard (FIPS) codes used by FEMA; and convert the street address data used by FDIC.** Preprocessing was completed using Google Dataflow, a cloud-based data processing service that optimizes performance by automatically identifying opportunities for parallelization and data shuffling. FEMA has been recording disasters since 1969. More recent records entries include more detailed information, whereas older entries have missing fields. Preprocessing effectively removed older entries with missing values. Additionally, we omitted data related to disasters that occurred outside of the contiguous United States (e.g., in Alaska or Hawaii) because these disasters formed their own separate clusters, making the presentation less clear. Upon discovering that FEMA data uses the FIPS code, we replaced the FIPS codes with the midpoint of each county according to data from the US Census Bureau. This strategy enabled geographic clustering based on latitude and longitude data. FDIC data also required conversion because street addresses are listed as branch locations. We utilized ArcGIS, a geocoder in the GeoPy library, to map the branches’ street addresses to a longitude and latitude. Google Dataflow allowed us to configure the preprocessing operation using Apache Beam, which provides a unified programming model for both batch and streaming data processing. With Apache Beam, users can define the data processing pipeline using either Java or Python. **PTransforms** are the basic building blocks of a Beam pipeline, and they are used to perform a wide range of data processing tasks, such as filtering, mapping, aggregating, and joining. By chaining PTransforms together, users can perform complex data processing operations. Each PTransform takes one or more parallel connections (abbreviated **PCollections**) as input, then produces one or more PCollections as output. PCollections are immutable, distributed datasets that can be partitioned and processed in parallel across multiple machines or nodes, which enables high performance, scalable data processing. The figure below illustrates the transformations defined in our preprocessing stage. Beginning with the raw data downloaded during ingestion into a GCS bucket, the data is then passed through the PTransforms, which are depicted as orange rectangles. The gray dots represent the intermediate PCollections generated by the PTransforms. Finally, the preprocessed data is stored in a second GCS bucket titled "Clean Data." ![](https://i.imgur.com/EsYqmLi.png) ### 3.3 Analysis When asked about natural disasters in the US, people instinctively associate certain disaster types with specific areas (i.e., hurricanes in the southeast and tornadoes in the Midwest). A clustering algorithm using FEMA disaster data gives a more accurate map of hot zones for all disaster types. For this project, we applied three different algorithms to cluster the FEMA data: **kNN, DBSCAN, Filtered DBSCAN.** After generating the hot zones with these clustering algorithms, we used k-means clustering to map bank locations to nearby clusters. In this section, we explore the rationale behind each clustering algorithm. [kNN](https://ieeexplore.ieee.org/document/1053964) is a classic algorithm often favored for its simplicity; however, FEMA disaster data includes county-level details, so the data points were much denser on the eastern half of the US where counties are geographically smaller. This observation posed a challenge to kNN because the same threshold that generated well-formed clusters on the West Coast would have resulted in the generation of one big cluster to encompass the entire East Coast. ![](https://i.imgur.com/Ub35fFV.png) [DBSCAN (Density-Based Spatial Clustering of Applications with Noise)](https://dl.acm.org/doi/10.5555/3001460.3001507) is a density-based clustering algorithm that groups together data points that are close to each other in the feature space. DBSCAN can identify and ignore outliers, which are data points that do not belong to any cluster. It also allows the creation of clusters with varying densities, which gives us more insightful clustering results across the whole country. ![](https://i.imgur.com/0YktXeP.png) Some disasters are highly correlated with the time of year (e.g., blizzards occur during winter, hurricanes occur during summer, etc.). This correlation initially led us to the [ST-DBSCAN](https://www.sciencedirect.com/science/article/pii/S0169023X06000218) algorithm, which is an updated version of DBSCAN that clusters not only in the spatial aspect but also the temporal aspect. However, disasters in the FEMA dataset vary greatly in length, which made it challenging to select a specific timestamp to represent each disaster entry. We decided on a middle ground solution: to group disasters by the month they begin, then implement an updated DBSCAN to obtain clusters of varying density. This process is denoted as “Filtered DBSCAN.”. ![](https://i.imgur.com/bxZKspz.gif) The code for the analysis are in the form of Python scripts. The automation done by Cloud Composer triggered Google Compute Engine instances and executed the scripts, which loaded the cleaned data from GCS buckets to the local machine and ran the algorithms. Using Plotly--a graphical library--we plotted the bank branch addresses into an interactive map of the US, then exported our plot into HTML elements stored in a GCS bucket. The HTML elements are subsequently embedded into the website we created in the visualization stage. ### 3.4 Visualization To present the result, we wrote a [demo website](https://internship-355618.uk.r.appspot.com/) using the React library and hosted the webserver on Google App Engine. Our website directly embeds HTML elements, so to update the graph, we only needed to update the HTML elements on the cloud. ### 3.5 Automation Knowing the FDIC and FEMA datasets will update periodically, we designed the pipeline to execute every six months to display up-to-date disaster hot zones. Automation of the pipeline was achieved with Cloud Composer, which allowed us to define and describe the dependency between tasks with an Apache Airflow Directed Acyclic Graph (DAG). Airflow supports operators that can directly interact with GCP, ranging from starting Beam jobs, starting instances, etc. Depending on the task, users must define their own DAG. ![](https://i.imgur.com/nYUuP0t.png) ## 4. Conclusion Our team of interns assembled a pipeline on GCP that is designed to handle large volumes of data. This deliverable can be extended easily to accommodate new data sources and processing requirements. With this pipeline as a blueprint, Pyramid data scientists can focus on analysis and generation of insights rather than spending time on the technicalities of building a data pipeline. ## Contributors - [Lilia Hsueh](https://www.linkedin.com/in/lilia-hsueh-268b55191/) - [Chen-Kang "Kevin" Lee](https://www.linkedin.com/in/chen-kang-lee-64058a18b/) - [Rachel Massey](https://www.linkedin.com/in/rachel-massey-02483312/) - [Drew Nguyen-Phan](https://www.linkedin.com/in/drewnguyenphan/) - [Jonah Osband](https://www.linkedin.com/in/jonah-osband/) - [Sayim Shazlee](https://www.linkedin.com/in/sayim-shazlee/)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully