# SciPy 2023 Talk Planning
## Notes
- Questions for Brian
- What data is available?
- Can we actually scale across heliocloud with Dask?
## Overall messages to convey
- Solar physics is cool
- Solar data has its own particular foibles
- sunpy was built before cloud compute was generally available - can it adapt to a cloud-future?
- Show what works and what does not work!
## Tasks
- Fido client for searching AIA files on heliocloud
- Can we take advantage of the registry that is being developed?
- Identify a dataset of interest
## Slides
- Intro to solar physics
- as illustrated by data products
- movies of eruptive events
- composites of AIA+LASCO for extended corona
- and here is an eclipse image (connect to eclipse that will pass over Austin next year)
- "first multi-messenger astronomy"
- Intro to SunPy
- as motivated by images shown in intro to solar
- every image you saw has a different coordinate system, we need a way to bundle image and coordinate system/metadata
- Also need to search for and download this data
- enter sunpy!
- strongly coupled to astropy--depend on this package for most of our core functionality
- introduce affiliated packages/ecosystem
- Difficulties of solar image data
- because every image has a difference co-ord system we don't have ready-to-stack data - datacubes need to be built.
- Multi-PB of solar data, heterogeneous types
- Do we want to discuss the problems of getting so much data from JSOC/VSO?
- Outstanding pain points
- Problems for SunPy
- WCS/xarray incompatibility (a chasm to build a bridge over)
- Astropy Units and Dask not playing nicely
- Pydata vs astro
- Live demo
- data search of the data we want via new fido client
- loading all of these images into maps
- reproject all of these images to the same coordinate system (accounting for differential rotation)
- show time series slices through data, show off ndcube+dask
- do time lag analysis; show successful integration between dask and sunpy/ndcube
- Conclusions
## Abstract
https://cfp.scipy.org/2023/talk/DZBF7K/
### Summary
Over the last decade, the SunPy ecosystem, a Python solar data analysis environment, has evolved organically to serve the needs of scientists analyzing solar physics data, mostly on desktop and laptop computers. However, modern solar observatories are producing data volumes in the tens of petabytes, necessitating the need for parallelized and out-of-core computation. HelioCloud is a cloud computing environment tailored for heliophysics research and colocated with many terabytes of solar physics data. In this talk, we will show how the SunPy ecosystem, combined with Dask on HelioCloud, can be used to efficiently process high-resolution solar data.
### Details
The SunPy ecosystem is a set of community-developed, free and open-source Python packages for solar data analysis. The ecosystem consists of the core sunpy package, which provides general capabilities such as data download, data structures, and coordinate transformations, as well as a growing set of affiliated packages which provide more application-specific functionality such as image processing techniques. The entire SunPy ecosystem depends heavily on the broader scientific Python ecosystem, including numpy, scipy, and scikit-image and especially the astropy package, a community Python package for astronomy.
Over the last decade, the SunPy ecosystem has evolved organically to serve the needs of scientists analyzing solar physics data. Analysis of observational solar data has traditionally been carried out on desktop or laptop computers or small compute clusters (see Bobra et al., 2020). This limitation is partly due to the longstanding historical reliance on the proprietary Interactive Data Language (IDL) by the solar physics community which has limited scalability due in part to licensing restrictions. However, modern space- and ground-based solar observatories are producing data volumes in the tens of petabytes, necessitating the need for parallelized and out-of-core computation. The surge in popularity of Python within the broader astronomy community as well as the growing availability of computing resources has led to many solar researchers using Python in cloud environments. All of these factors have propelled the development of HelioCloud. Inspired by similar science platforms for other disciplines like Pangeo, HelioCloud is a NASA-funded, AWS-backed cloud computing environment tailored for heliophysics research. HelioCloud provides both a dashboard for creating custom virtual machines as well as a JupyterLab interface. Using the latter allows for interactive, scalable computation enabled by Dask across many compute nodes. Most importantly, HelioCloud is collocated with nearly 1 petabyte of solar physics data such that researchers can perform their analysis without the added latency of needing to download the data.
In this talk, we will demonstrate how the SunPy ecosystem, combined with Dask on HelioCloud, can be used to efficiently process high-resolution solar data. First, we will provide a brief description of the SunPy project with particular emphasis on the ndcube and sunkit-image affiliated packages. Next, we will provide a brief description of the JupyterLab interface of the HelioCloud platform. Finally, we will demonstrate a typical scientific workflow on HelioCloud by efficiently analyzing many hours worth of solar active region evolution using sunpy, ndcube, sunkit-image, and Dask to scale out our computation over many workers. Additionally, we will discuss existing incompatibilities between Dask and the astropy ecosystem and how collaboration with the broader scientific Python community could resolve such frictions.