chavlin

@chavlin

Joined on Apr 3, 2020

  • volume rendering intro for geodata in yt this presentation: https://bit.ly/ytgeostuff yt's overview on volume rendering: link Chris's 2020 AGU poster: citeable archive, direct repo link. Other code/repos: yt: https://yt-project.org/ ytgeotools: https://github.com/chrishavlin/ytgeotools yt_idv: https://github.com/yt-project/yt_idv
     Like  Bookmark
  • Table of Contents Overview Experiments in daskifying yt Development plan Some Final Notes yt and Dask: an overview In the past months, I've been investigating and working on integrating Dask into the yt codebase. This document provides an overview of my efforts to date but also is meant as a a preliminary YTEP (or pYTEP?) to solicit feedback from the yt community at an early stage before getting to far into the weeds of refactoring.
     Like  Bookmark
  • return to main post 1. (particle) data IO Full notebook available here. yt reads particle and grid-based data by iterating across the chunks, with frontend-specific IO functions. For gridded data, each frontend implements a _read_fluid_selection (e.g., yt.frontend.amrvac.AMRVACIOHandler._read_fluid_selection) that iterates over chunks and returns a flat dictionary with numpy arrays concatenated across each chunk. The particle data, frontends must implement a similar function, _read_particle_fields, that typically gets invoked within the BaseIOHanlder._read_particle_selection function. In both cases, the read functions accept the chunks iterator, the fields to read and a selector object: def _read_particle_fields(self, chunks, ptf, selector): def _read_fluid_selection(self, chunks, selector, fields, size):
     Like  Bookmark
  • return to main post 3. dask-unyt arrays yt uses unyt to track and convert units so if we are using dask for IO and want to return delayed arrays, we need some level of dask-unyt support. In the notebook, working with unyt and dask, I demonstrate an initial prototype of a dask-unyt array. In this notebook, I create a custom dask collection by sublcassing the primary dask.array class and adding some unyt functionality in hidden sidecar attributes. This custom class is handled automatically by the dask scheduler, so that if we have a large dask array with a dask client running and we create our new dask-unyt array, e.g.: import dask.array as da from dask.distributed import Client client = Client(threads_per_worker=2, n_workers=2)
     Like  Bookmark
  • return to main post 2. profile calculation: operations on chunked data The prototype dask-Gadget reader above returns the flat numpy arrays expected by _read_particle_selection(), but if instead we could return dask objects, we could easily build up parallel workflows. In the notebook here, I explored how to create a daskified-profile calculation under the assumption that our yt IO returns dask arrays. The notebook demonstrates two approaches. In the first, I use intrinsic dask array functions to recreate a yt profile using dask's implementation of np.digitize and then summing over the digitized bins. The performance isn't great, though, and it's not obvious how the approach could be extended to reproduce all of the functionality of ds.create_profile() in yt (multiple binning variables, returning multiple statistics at once, etc.). In the second approach, I instead work out how to apply the existing yt profile functions directly to the dask arrays. When you have a delayed dask array, you can loop over a chunk, apply a function to each chunk and then reduce the result across the chunks. In pseudo-code, creating a profile looks like:
     Like  Bookmark
  • Results from initial explorations in leveraging dask in yt There are a number of possible ways for yt to leverage dask in order to simplify existing parallel and lazy operations but thus far, we have focused on two related areas in which dask may prove particularly useful: data IO: reading data from disk processing data after reading Central to both of these areas is the chunk iterator that yt uses to index a dataset. The exact implementation depends both on the frontend and datatype, but in general a single chunk is a range of data indeces that may span datafiles or a subset of a single datafile. Thus, data can be read in by separate processes in parallel or sequentially on a single processor by iterating and aggregating over chunks separately. In the case of IO, data can also be subselected by selector objects (e.g., slices, regions) before aggregation across chunks into yt arrays. And in the case of data processing, some computations such as finding statistical measures of a data field (max, min, weighted means) can also process chunks separately and return intermediate results that are then aggregated. In terms of dask, one straightforward conceptual approach with fairly minimal refactoring of yt to use the dask.delayed() decorator to construct graphs to execute across chunks. In pseudocode this could look like:
     Like  Bookmark
  • Levaraging Dask in yt Chris Havlin (U. Illinois DXL, chavlin@illinois.edu) RHytHM Workshop, 2020-12-10 with: Matt Turk, Madicken Munk, Kacper Kowalik ![ytdask](https://i.imgur.com/NjOEJhy.png =250x) What's to come:
     Like  Bookmark