Beginning with netCDF version 4.8.0, the Unidata NetCDF group
has extended the netcdf-c library to provide access to cloud
storage (e.g. Amazon S3
[1]
) by providing a mapping from a subset of the full netCDF Enhanced
(aka netCDF-4) data model to a variant of the Zarr
[4]
data model that already has mappings to
key-value pair cloud storage systems.
The NetCDF version of this storage format is called NCZarr
[2].
NCZarr uses a data model [2] that is,
by design, similar to, but not identical with the Zarr Version 2
Specification [4]. Briefly, the data
model supported by NCZarr is netcdf-4 minus the user-defined
types and the String type. As with netcdf-4 it supports
chunking. Eventually it will also support filters in a manner
similar to the way filters are supported in netcdf-4.
Specifically, the model supports the following.
With respect to full netCDF-4, the following concepts are
currently unsupported.
NCZarr support is enabled if the βenable-nczarr option
is used with './configure'. If NCZarr support is enabled, then
a usable version of libcurl must be specified
using the LDFLAGS environment variable (similar to the way
that the HDF5 libraries are referenced).
Refer to the installation manual for details.
NCZarr support can be disabled using the βdisable-dap.
In order to access a NCZarr data source through the netCDF API, the
file name normally used is replaced with a URL with a specific
format.
The URL is the usual scheme:://host:port/path?query#fragment format.
There are some details that are important.
The fragment part of a URL is used to specify information
that is interpreted to specify what data format is to be used,
as well as additional controls for that data format.
For NCZarr support, the following key=value pairs are allowd.
Internally, the nczarr implementation has a map abstraction
that allows different storage formats to be used.
This is closely patterned on the same approach used in
the Python Zarr implementation, which relies on the Python
MutableMap
[3] class.
In NCZarr, the corresponding type is called zmap.
The primary zmap implementation is s3 (i.e. mode=nczarr,s3) and indicates
that the Amazon S3 cloud storage is to be used. Other storage formats
use a structured NetCDF-4 file format (mode=nczarr,nz4), or a
directory tree (mode=nczarr,nzf)
The latter two are used mostly for debugging and testing.
However, the nzf format is important because it is intended
to match a corresponding storage format used by the Python
Zarr implementation. Hence it should serve to provide
interoperability between NCZarr and the Python Zarr.
The NCZARR format extends the pure Zarr format by adding
extra objects such as .nczarr and .ncvar. It is possible
to suppress the use of these extensions so that the netcdf
library can read and write a pure zarr formatted file.
This is controlled by using mode=nczarr,zarr combination.
The NCZarr support has a logging facility.
Turning on this logging can
sometimes give important information. Logging can be enabled by
using the client parameter "log" or "log=filename",or by
setting the environment variable NCLOGGING.
The first case will send log output to standard error and the
second will send log output to the specified file. The environment
variable is equivalent to log.
The Amazon AWS S3 storage driver currently uses the Amazon
AWS S3 Software Development Kit for C++ (aws-s3-sdk-cpp).
In order to use it, the client must provide some configuration
information. Specifically, the ~/.aws/config
file should
contain something like this.
[default]
output = json
aws_access_key_id=XXXX...
aws_secret_access_key=YYYY...
The notion of "addressing style" may need some expansion.
Amazon S3 accepts two forms for specifying the endpoint
for accessing the data.
https://<bucketname>.s2.<region>.amazonaws.com/
https://s2.<region>.amazonaws.com/<bucketname>/
The NCZarr code will accept either form, although internally,
it is standardized on path style.
The NCZarr storage format is almost identical to that of
the the standard Zarr version 2 format.
The data model differs as follows.
Consider both NCZarr and Zarr, and assume S3 notions of bucket
and object. In both systems, Groups and Variables (Array in Zarr)
map to S3 objects. Containment is modelled using the fact that
the container's key is a prefix of the variable's key.
So for example, if variable v1 is contained int top level group g1 β _/g1 β
then the key for v1 is /g1/v.
Additional information is stored in special objects whose name
start with ".z".
In Zarr, the following special objects exist.
The NCZarr format uses the same group and variable (array) objects
as Zarr. It also uses the Zarr special .zXXX objects.
However, NCZarr adds some additional special objects.
.nczarr β this is in the top level group β key /.nczarr.
It is in effect the "superblock" for the dataset and contains
any netcdf specific dataset level information.
.nczgroup β this is a parallel object to .zgroup and contains
any netcdf specific group information. Specifically it contains the following.
These lists allow walking the NCZarr dataset without having to use
the potentially costly S3 list operation.
.nczvar β this is a parallel object to .zarray and contains
netcdf specific information. Specifically it contains the following.
With some constraints, it is possible for an nczarr library to read
Zarr and for a zarr library to read the nczarr format.
The latter case, zarr reading nczarr is possible if the zarr library
is willing to ignore objects whose name it does not recognized;
specifically anthing beginning with .ncz.
The former case, nczarr reading zarr is also
possible if the nczarr can simulate or infer the contents of
the missing .nczXXX objects. As a rule this can be done as follows.
Here are a couple of examples using the ncgen and ncdump utilities.
ββββncgen -4 -lb -o "file:///home/user/dataset.nzf#mode=nczarr" dataset.cdl
ββββncdump "file:///home/user/dataset.nzf#mode=nczarr"
ββββncgen -4 -lb -o "s3://datasetbucket" dataset.cdl
ββββncgen -4 -lb -o "s3://datasetbucket#mode=zarr" dataset.cdl
[1] Amazon Simple Storage Service Documentation
[2] NetCDF ZARR Data Model Specification
[3] Python Documentation: 8.3. collections β High-performance container datatypes
[4] Zarr Version 2 Specification
In order to use the S3 storage driver, it is necessary to
install the Amazon aws-sdk-cpp library.
As a starting point, here are the CMake options used by Unidata
to build that library. It assumes that it is being executed
in a build directory, build
say, and that build/../CMakeLists.txt exists
.
cmake -DFORCE_CURL=ON -DBUILD_ONLY=s3 -DMINIMIZE_SIZE=ON -DBUILD_DEPS=OFF -DCMAKE_CXX_STANDARD=14 ..
The Amazon S3 cloud storage imposes some significant limits
that are inherited by NCZarr (and Zarr also, for that matter).
Some of the relevant limits are as follows:
Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 4/10/2020
Last Revised: 4/12/2020