data-exp-lab
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Help
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # indexing chunks etc A sketch of what we might want to do in the yt stuff for dask, async, etc. Basically, we currently have index objects on every dataset. I think the obvious next steps for getting us to work with dask involve first identifying the operations that are important, which I would put as derived fields (including spatial derived fields), generation of input to visualization routines, and high-level operations like max mean etc. These include dimensional reduction (projections). * Make our dataset object, which currently is a 1:1 mapping to index objects, instead have the option to have multiple index objects. The first step would be to simply turn index into a list and have it iterate over them. * Modify our chunking system to operate asynchronously eliminate the ability to access an array implicitly by _getitem_ on a dataset, and make this an explicit operation. * For instance, with that first step, it would mean that we could have multiple particle indices associated with a single dataset. For instance, having co-registered halos and particles, or co-registered fluid/geographic datasets from different sources. If we assume that each field type is associated with an index, this would simplify the process -- so we could have 1:N mappings from index to field types, but 1:1 from field types to index objects Alternately, if we had non-overlapping index objects (or multiple discrete objects) we could allow multiple indexes for a single field type ``` class Index: data_files: [DataFile] = [] _identify_chunks(SelectorObject: selector) -> iterable[Chunk] ``` ``` class Dataset: indexes: [Index] = [] _add_index(Index: index) select(SelectorObject: selector) -> iterable[Chunk] execute_operation(iterable[Chunk]) -> async Result ``` Kacper suggests that maybe we should be looking at Chunks some what differently. If we separate out into coarse/refined operations, we could move our selection operations into the chunks, and have the index just yield the high-level chunks -- or even just have the index yield *all* of the chunks. *(copying over some more notes from the slack discussion)* Possible organization for constructing a set of objects: * Index: provides "coarse" indexing, to determine which chunks get emitted for IO and fine-grained indexing * Chunk: a set of data that has some fixed overhead for reading, and is not too big that we can't parallelize. This will also control fine-grained indexing, so only returning the data that is relevant to a selector. * IO Handler: Not sure we need this, but if we did, it could manage the state of IO. * Dataset: has one or more indexes So the flow for trivially parallel operations would be that the selector object would go to the index, it would coarsely index and emit chunks (and if there are not enough chunks it could potentially subchunk them, dunno) and then each chunk would be a dask or async future and when the IO occurred on it it would read from disk. right now for a grid frontend, it goes something like this: * Emit chunks that intersect wit hthe data object but also count the number of cells; this is expensive!!! * Each chunk is read, then the grid objects select the data that hits the object (grid objects: `yt/data_objects/grid_patch.py` ) * The chunk is `yield`ed and operated on * Chunks are discarded ## Things to Keep in Mind For particle reading, there are a handful of specific things that need to be kept in mind: * Particle unions are mostly managed in the IO handler, which suggests perhaps there should be a centralized "dependency" resolver. (We have a few of them.) This suggests that maybe we should have a routine that breaks down and re-assembles things from the IO handlers, whether that's in chunks or IO etc etc. For grids: * Any grid that has a notion of "levels" needs to be able to ensure that the selection routines take into account "child" grids as well as "min"/"max" levels in the selector. For both: * There can be very expensive setup and tear down costs for IO, which we manage by pushing the setup/teardown into the individual IO handlers. For instance, even just navigating groups in an HDF5 file can take a lot of time. ## Particle IO If we were to start afresh, here's what we could do: * Each set of particles *pre-selection* could be a dask chunk. * Each set of particles *post-selection* could be a dask chunk. * Instead of managing the arrays internally using *any* chunking from yt, we instead utilize dask-native operations. This will break down very rapidly, but we can get some functionality back over time. It's also possible that what we might want to do is skip the first item above and just do post-selection particles. By the time the chunks get to the IO handler, it often is *just* something like this: ```python def _read_particle_fields(self, chunks, ptf, selector): # Now we have all the sizes, and we can allocate data_files = set([]) for chunk in chunks: for obj in chunk.objs: data_files.update(obj.data_files) ``` and then the chunks are not seen again -- only the `data_files`. *But*, the data files do get sorted and whatnot to attempt to reduce IO overhead, because the chunks themselves are sized according to some heuristics. ```python for data_file in sorted(data_files, key=lambda x: (x.filename, x.start)): si, ei = data_file.start, data_file.end f = h5py.File(data_file.filename, mode="r") for ptype, field_list in sorted(ptf.items()): if data_file.total_particles[ptype] == 0: continue g = f[f"/{ptype}"] if getattr(selector, "is_all_data", False): mask = slice(None, None, None) mask_sum = data_file.total_particles[ptype] hsmls = None else: coords = g["Coordinates"][si:ei].astype("float64") if ptype == "PartType0": hsmls = self._get_smoothing_length( data_file, g["Coordinates"].dtype, g["Coordinates"].shape ).astype("float64") else: hsmls = 0.0 mask = selector.select_points( coords[:, 0], coords[:, 1], coords[:, 2], hsmls ) if mask is not None: mask_sum = mask.sum() del coords if mask is None: continue for field in field_list: ... ``` So a few things. The ultimate result of this function is nothing more than an iterable of key/value store that associates *selected* particle types and fields with the numpy arrays. Note that it will still iterate over the chunks, but there's no attempt to line up a chunk ID with the particles. The particles are read in a consistent order (because they are assumed to be sorted internally, and we sort by a consistent sorting method) but they are otherwise not keyed to the chunks, just the data files. The result here is `yield (ptype, field), data`. ## Some things that will be a lot easier * Pixelization! We can distribute this and sum the buffers. Map the pixelization to an output buffer and a set of input buffers, and then reduce. * Derived quantities ## Some things that will maybe be a lot harder * Spatially relevant fields ## Notes from a daskified particle IO attempt The [dxl yt-dask-experiment repo](https://github.com/data-exp-lab/yt-dask-experiments) includes some initial chunking experiments with dask in the `dask_chunking` directory. That folder primarily experiments with using dask to do some manual IO on a gadget dataset. In this experiment, the gadget file reading is split into a series of dask-delayed operations that include file IO and selector application. The notebook [here](https://nbviewer.jupyter.org/github/data-exp-lab/yt-dask-experiments/blob/master/dask_chunking/gadget_dask_delayed.ipynb) gives an overview of that attempt, so I won't go into details here, but will instead focus on an attempt to use dask for file IO natively within *yt*. The focus of this attempt falls within `BaseIOHandler._read_particle_selection()`. In this function, we pre-allocate arrays for all the particle fields that we're reading and then read in the fields for each chunk using the `self._read_particle_fields(chunks, ptf, selector)` generator. In the case of gadget, `_read_particle_fields` follows the code samples in the above section on [Particle IO](https://hackmd.io/pzGhEBGLSsakU6NLwsxkhA#Particle-IO). Also note that an open PR removes the preallocation of the particles ([PR 2416](https://github.com/yt-project/yt/pull/2416)). One conceptually straightforward way to leverage dask is by building a `dask.dataframe` from delayed objects. Dask dataframes, unlike dask arrays, do not need to know the total size a priori, but we do need to specify a metadata dictionary to declare the column names and datatypes. So to build a `dask.dataframe`, we can do the following (from within `BaseIOHandler._read_particle_selection()`): ```python ptypes = list(ptf.keys()) delayed_dfs = {} for ptype in ptypes: # build a dataframe from delayed for each particle type this_ptf = {ptype: ptf[ptype]} delayed_chunks = [ dask.delayed(self._read_single_ptype)( ch, this_ptf, selector, ptype_meta[ptype] ) for ch in chunks ] delayed_dfs[ptype] = ddf.from_delayed(delayed_chunks, meta=ptype_meta[ptype]) ``` In this snippet, we build up a dictionary of dask dataframes organized by particle type. First, let's focus on the actual application of the `dask.delayed` decorator: ``` dask.delayed(self._read_single_ptype)( ch, this_ptf, selector, ptype_meta[ptype] ) ``` This decorator wraps a `_read_single_ptype` function, which takes a single chunk, a single particle type dictionary (with multiple fields) and a column-datatype metadata dictionary and returns a normal pandas dataframe: ```python def _read_single_ptype(self, chunk, ptf, selector, meta_dict): # read a single chunk and single particle type into a pandas dataframe so that # we can use dask.dataframe.from_delayed! fields within a particle type should # have the same length? chunk_results = pd.DataFrame(meta_dict) # each particle type could be a different dataframe... for field_r, vals in self._read_particle_fields([chunk], ptf, selector): chunk_results[field_r[1]] = vals return chunk_results ``` We can see this function just calls our normal `self._read_particle_fields` and pulls out the field values as usual. We are storing the fields in columns of our single chunk and single particle type pandas dataframe, `chunk_results`. So for a single particle type, we build our delayed dataframe ```python delayed_chunks = [ dask.delayed(self._read_single_ptype)( ch, this_ptf, selector, ptype_meta[ptype] ) for ch in chunks ] delayed_dfs[ptype] = ddf.from_delayed(delayed_chunks, meta=ptype_meta[ptype]) ``` The `delayed_dfs` object is a dictionary with a delayed dataframe for each particle type. The reason we're organizing by particle type is the issue that different particle types may have a different number of records in a given chunk (otherwise we'd have to store a large number of null values). In order to return the expected in-memory dict that `_read_particle_selection()` should return, we can very easily pull all the records from across our chunks and particle typeswith: ```python rv = {} for ptype in ptypes: for col in delayed_dfs[ptype].columns: rv[(ptype, col)] = delayed_dfs[ptype][col].values.compute() ``` So this is a fairly promising approach, particularly since the dataframes do not need to know the expected array size. And it does indeed work to read data from our particle front ends, with some modifications to the `ParticleContainer` class. The notebook [here](https://github.com/data-exp-lab/yt-dask-experiments/blob/master/dask_chunking/native_gadget_read.ipynb) shows the above approach in action, with notes on the required changes to the `ParticleContainer` class.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully