Tidy Cubes (or Tidy Analysis-Ready Cubes)

# Tidy Cubes (or Tidy Analysis-Ready Cubes) **Authors** Emma Marshall, Deepak Cherian, Scott Henderson [alphabetical from second author on] **Notebook** https://github.com/dcherian/tidy-xarray/blob/main/tidy-xarray.ipynb ## Status See questions from Emma's [notes](https://docs.google.com/document/d/1am3Axx699cOAq8V7ShzRHADGSWUbHeCa4B-zX-fMqjc/edit); > 1. When should I use dimensional v. non-dimensional coordinates? > 1. How to organize/store metadata that I want to use for indexing/selecting data I think the following somewhat answers them. We need more examples, and more conceptual thinking. ## Abstract (SciPy Proposal) We attempt a definition of "tidy data" for labelled array objects. "Data Tidying", for dataframes, is conceptualized as structuring datasets to facilitate analysis (Wickham, 2014). For practical purposes, we focus on the Xarray data model. **Sentence about Xarray data model** A common source of confusion: what variables should be coordinate variables; vs what should be data variables. Some ties to "Analysis Ready Data"; Implications for data creators and implications for usability ## Background/Context The concept of "tidy data" has been defined for dataframes. > To get a handle on the problem, this paper focuses on a small, but important, aspect of data cleaning that I call data tidying: structuring datasets to facilitate analysis. - [Hadley Wickham](https://tidyr.tidyverse.org/articles/tidy-data.html) These principles are: 1. Every column is a variable. 2. Every row is an observation. 3. Every cell is a single value. ## Axioms 1. Data analysts primarily work with logical cubes, rather than a single asset. The following recommendations stem from optimizing for cube construction. ## Questions 1. attr or variable 2. data var or coordinate var - independent observables should be data variable - does every other variable provides context to these observables. 3. coordinate var or dimension - For paired images do we do extra variables or "image" dimensions with `img1`, `img2` - this suggests a new function ## Principles 1. Every variable is a field. - different observables should be different variables - example: - InSAR: `vx1992` data_var has field `vx` and time coordinate `1992` 1. If present, a dimension's coordinate values must be associated with a variable of the same name. - example: - InSAR: `nx, ny` dims with `xaxis`, `yaxis` data vars - Aquarius: `phony_dim0`! with coordinate values in attribute 1. Scalar physical fields are best represented as scalar coordinate variables rather than attributes. - example: - InSAR - Aquarius - One ill-formed principle: - While data creators may prefer to create multiple assets (files) on disks, data analysts work with cubes. - Here we prioritize the need to construct tidy cubes. - If the scalar metadata field is invariant across multiple assets, then store it as an attribute - If it varies across assets that may be concatenated to form a cube, that scalar metadata variable should be coordinate variable. - E.g. time bounds or "granule_id" stored as an attribute on a single asset isn't a useful organization of the data ## Things that are not handled yet. - Avoid unnecessary broadcasting. A 1D `x` coordinate should not be represented by a 2D `x` variable - minimize dimensions. - "what other shapes/throughlines does the data have that are important but maybe not crucial to its shape?" - Tom Augspurger pointed out that there are multiple ways to translate STAC to Xarray: - ODC (Dataset) & stackstac (DataArray) use different ideas: [github discussion](https://github.com/opendatacube/odc-stac/issues/54#issuecomment-1103313511) - Bands as dimensions or data_vars? - is each band an observable? Yes. They are independent measurements - to_dataset(BAND_DIMENSION) - band dimension is good for visualization - but this is a software limitation - What about polarizations? - Key metadata is stored in file names :face_palm: - Analysis-ready data - HLS is an interesting case; COG - has a bunch of pre-processing / atmospheric corrections - but still needs work to assemble into a cube; - breaking out the bands - analysis ready values vs analysis ready structure ## Things that are very Xarray specific - Set as coordinate variables those that you want propagated when a DataArray is extracted from a Dataset. ----- ## Notes helpful links: - https://tidyr.tidyverse.org/articles/tidy-data.html - https://vita.had.co.nz/papers/tidy-data.pdf ### notes on tidyr paper - tidying: structuring datasets to facilitate analysis - tidy framework lets you focus on interesting domain problem, not data logistics - every messy dataset messy in its own way - Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). - "A general rule of thumb is that it is easier to describe functional relationships between variables (e.g., z is a linear combination of x and y, density is the ratio of weight to volume) than between rows, and it is easier to make comparisons between groups of observations (e.g., average of group a vs. average of group b) than between groups of columns." - "standard way of mapping meaning of dataset to its structure" - for dfs: + every col is variable + every row is an obs + every cell is a single value ### tidy data for xarray - what would it look like? think about fundamental nature, shape of data - how many core dimensions (x,y,time, others?) - what other shapes/throughlines does the data have that are important but maybe not crucial to its shape? (ie. ) + minimize dimensions + store 'data' as vars, not as additional coords/dims + one var per observation type * ie. you don't want multiple variables for each year of a ds (insar ice velocity dataset cleaning example) #### Examples of messy data, what is messy about it? how do we 'tidy' it? ##### insar ice velocity - too many variables - what is the min. # of variables you can have to represent your data? - coordinate info is stored as a variable - diff. between a coordinate and a variable - **Key principle?: does it give structure/context to an obs or is it an obs itself?** - not enough dimensions - to start has x,y dimensions, but you can see there is a time dim represented in the data vars - if there is a pattern or connection between variables, probably want to represent as a coord or dim - **Key principle?: independent data vars** - no coordinates - the 'data' representing the shape of the dataset and the observations the dataset is trying to convey are treated the same currently - need to have a separation between metadata describing a (velocity) observation and the observation itself - if it is metadata you want to use in processing/analysis ie. location of obs (x,y coords), orbital direction of acquisition, imaging mode etc. want to represent it as a coord that can be selected, if it is relevant but won't be directly use in pipeline, store as attr? ##### s1 rtc images

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.