Raphael Hagen
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    At CarbonPlan we frequently use Terabyte scale public climate datasets in our projects. This can be difficult due to the huge data volumes. Instead of trying to download massive datasets, we use cloud resources that allow us to perform computations close to the data. However, the data format can also cause difficulty if it isn’t optimized for cloud-based computations. People often work around this difficulty by creating a separate, analysis-ready, cloud-optimized (ARCO) dataset, but that requires storing a second copy of the data which can be expensive. By creating a reusable Kerchunk reference of the NASA-NEX-CMIP6 dataset, we avoided duplicating 36 Terabytes of data, and our work in extreme heat was sped up by an order of magnitude. Creating and sharing Kerchunk references of public datasets in a data commons can help speed up everyone's work and enable open science. ### Two models for Scientific Computing #### Download Model Many remote-sensing, model and assimilation datasets exist in NASA DAAC’s or similar data centers. Gridded datasets are commonly in NetCDF or GeoTIFF formats. In the case of NetCDF, a dataset is usually composed of hundreds or thousands of individual files. Working with an entire dataset commonly involves downloading data to a local university cluster or a subset of it to a laptop. In some archival systems, you enter a global queue, where you wait for a robot arm to retrieve and read data from a tape archive. This **download -> subset -> analyze model** moves the data to your compute resources. Generally it takes a lot of transfer and processing time, storage space (as data is duplicated) and makes reproducibility difficult. #### Data-proximate model ##### Cloud-optimized Datasets <!-- --> Another model is to have datasets in *Analysis-Ready Cloud-Optimized (ARCO)* formats. This model switches the paradigm, and moves the computation to the data which allows for quicker analysis due to the nature of ARCO data, with the ability for asynchronous reads on individual chunks, data proximate computing and easy scalability. Zarr, GeoTIFF and Parquet are examples of this file format. ##### Archival Datasets One can also access archival data formats, such as NetCDF with data-proximate computing. This can provide some flexibility around your compute resources, but may have preformance drawbacks compared to an ARCO dataset. https://figshare.com/articles/figure/Data_Access_Modes_in_Science/11987466?file=22017009 ### Kerchunk This is all great, if your data is already in an ARCO format, but there could be many reasons why this isn’t the case. Perhaps you don’t have the expertise or resources to process and store an ARCO copy of the archival dataset. Here is where Kerchunk comes in. Kerchunk allows you to create a reusable reference file of an archival dataset so that it can be read as if it were an ARCO data format such as Zarr. Not only does this offer significant performance benefits, but it also allows you to create a ‘virtual reference dataset’ by merging across variables and concatenating along a time dimension. By using Kerchunk, you can have the best of both worlds* – data providers doing the work of maintaining, updating, storing and publishing stable common formats of the dataset, plus the cloud-optimized read performance of an ARCO dataset. These Kerchunk references are usually tiny and can be easily shared, which could help enable open-science and reproducibility. In this model, data providers can still use their stable time-tested data formats and people can still have cloud-native access patterns. \* *depends on your chunking-schema use case* ### Real-world analysis using Kerchunk Our work at CarbonPlan in Climate Impacts relies on access to high-quality climate datasets. As part of our recent work on [Extreme-Heat](https://carbonplan.org/research/extreme-heat-explainer), we created a global dataset of Wet-Bulb-Globe-Temperature (WBGT) for multiple climate Shared Socioeconomic Pathways (SSPs). To calculate WBGT, we used the NASA-NEX-CMIP6 dataset, which is a spatially downscaled version of the CMIP6 archive. This dataset is composed of over 7000 NetCDF files and is nearly 36 Terabytes in size. Instead of creating an ARCO copy of the NASA-NEX-CMIP6 dataset, we created a [repo](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6) to demonstrate how Kerchunk can be used to create a virtual reference dataset, which allowed us to speed-up our WBGT calculation. As shown in this [notebook](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6/blob/main/generation/parallel_reference_generation.ipynb), Kerchunk references can be generated in parallel if you have access to a distributed computing framework such as Dask. Here, we spun up 500 tiny cloud workers to create references for all the NetCDF files. With that many workers, it took about 30 minutes to create Kerchunk reference files for the entire 36 Terabyte NASA-Nex CMIP6 dataset. The resulting references only take up about 290 Megabytes. The beauty of this approach is that these reference files only have to be created once. Now, anyone can use these references to read the NASA-NEX-CMIP6 dataset as if it were an ARCO dataset. ## Performance Once we had our references created, we wanted to see if it was all worth it. With this virtual ARCO dataset, can we speed up our WBGT calculation? In [this section of the repo](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6/tree/main/comparison) we have two notebooks: [heat_datatree.ipynb](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6/blob/main/comparison/heat_datatree.ipynb) and [heat_open_mfdataset.ipynb](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6/blob/main/comparison/heat_openmfdataset.ipynb). These detail two approaches to calculating WBGT – one loading all of the NetCDF files and one using the Kerchunk references to read the dataset as if it were an ARCO dataset. | Method | # of Input Datasets | Temporal Extent | # of Workers | Worker Instance Type | Time | | ------------------- | ------------------- | --------------- | ------------ | -------------------- | --------------------- | | Archival dataset | 20 | 365 days | 10 | m7i.xlarge | 20 minutes 24 seconds | | Cloud-optimized dataset | 20 | 365 days | 10 | m7i.xlarge | 2 min 49 seconds | As you can see in the table above, the ARCO / Kerchunk method took only three minutes, compared to the 20 minutes of the download model. While these time differences might not seem wildly important, it’s good to keep in mind that this timing was applied on a small subset of the entire dataset. A nearly 10x speedup could take a computation from weeks to days. ### Try it yourself! Feel free to use or redistribute the Kerchunk reference we made. It’s good to know that the Kerchunk project is under active development, so you might find some sharp edges and breaking changes. Additionally, if your use-case requires a different chunking schema then the underlying file chunking, you will want to look to other projects such as [pangeo-forge-recipes](https://pangeo-forge.readthedocs.io/en/latest/) and [xarray-beam](https://xarray-beam.readthedocs.io/en/latest/) which are ETL tools to create ARCO datasets where the chunking can be modified. At the time of writing, Kerchunk supports NetCDF 3 and 4, GRIB2, TIFF/GeoTIFF and FITS. Examples of using Kerchunk can be found on the [official docs](https://fsspec.github.io/kerchunk/) as well as [Project Pythia cookbook](https://projectpythia.org/kerchunk-cookbook/README.html). We hope that this example shows how useful Kerchunk can be for large-scale analysis of earth-science data. ## Attribution This type of work is possible because of the development work of: [Martin Durant (Kerchunk)](https://github.com/martindurant) [Tom Nichols (Xarray-Datatree)](https://github.com/TomNicholas) [Ori Chegwidden (WBGT - Heat Risk Analysis)](https://github.com/orianac) [Max Jones](https://github.com/maxrjones/maxrjones1) [Andrew Huang](https://github.com/ahuang11) Pangeo-ML kerchunk grant augmentation # <INSERT>

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully