# Project: Single File Catalog
[TOC]
## Overview
The current spec requires that the `catalog_file` point to a `csv file`. In some cases, it would be useful to embed the catalog "table" in the catalog itself. A so called single-file-catalog. STAC has an extension that does this (see [here](https://github.com/radiantearth/stac-spec/tree/master/extensions/single-file-stac)).
## Deliverables - Functionality
- Make `catalog_file` key optional and support a key `catalog_dict` which is a json dictionary that represents the data that would otherwise be in the csv. Exactly one of the two keys would be required but the catalog creator could choose.
The `catalog_dict` dictionary can be expressed as
**Option 1) dict**: dict like ```{column -> {index -> value}}```:
```yaml
{
"esmcat_version":"0.1.0",
"id":"aws-cesm1-le",
"description":"This is an ESM collection for CESM1 Large Ensemble Zarr dataset publicly available on Amazon S3 (us-west-2 region)",
"catalog_dict":{
'component':{
0:'atm',
1:'atm',
2:'atm',
3:'atm',
4:'atm'
},
'frequency':{
0:'daily',
1:'daily',
2:'daily',
3:'daily',
4:'daily'
},
'experiment':{
0:'20C',
1:'20C',
2:'20C',
3:'20C',
4:'20C'
},
'variable':{
0:'FLNS',
1:'FLNSC',
2:'FLUT',
3:'FSNS',
4:'FSNSC'
},
'path':{
0:'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.zarr',
1:'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC.zarr',
2:'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.zarr',
3:'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.zarr',
4:'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC.zarr'
}
},
"attributes":[
{
"column_name":"component",
"vocabulary":""
},
{
"column_name":"frequency",
"vocabulary":""
},
{
"column_name":"experiment",
"vocabulary":""
},
{
"column_name":"variable",
"vocabulary":""
}
],
"assets":{
"column_name":"path",
"format":"zarr"
},
"aggregation_control":{
"variable_column_name":"variable",
"groupby_attrs":[
"component",
"experiment",
"frequency"
],
"aggregations":[
{
"type":"union",
"attribute_name":"variable",
"options":{
"compat":"override"
}
}
]
}
}
```
or
**Option 2) records**: ```[{column -> value}, ... , {column -> value}]```
```yaml
{
"esmcat_version":"0.1.0",
"id":"aws-cesm1-le",
"description":"This is an ESM collection for CESM1 Large Ensemble Zarr dataset publicly available on Amazon S3 (us-west-2 region)",
"catalog_dict":[
{
'component':'atm',
'frequency':'daily',
'experiment':'20C',
'variable':'FLNS',
'path':'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.zarr'
},
{
'component':'atm',
'frequency':'daily',
'experiment':'20C',
'variable':'FLNSC',
'path':'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC.zarr'
},
{
'component':'atm',
'frequency':'daily',
'experiment':'20C',
'variable':'FLUT',
'path':'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.zarr'
},
{
'component':'atm',
'frequency':'daily',
'experiment':'20C',
'variable':'FSNS',
'path':'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.zarr'
},
{
'component':'atm',
'frequency':'daily',
'experiment':'20C',
'variable':'FSNSC',
'path':'s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC.zarr'
}
],
"attributes":[
{
"column_name":"component",
"vocabulary":""
},
{
"column_name":"frequency",
"vocabulary":""
},
{
"column_name":"experiment",
"vocabulary":""
},
{
"column_name":"variable",
"vocabulary":""
}
],
"assets":{
"column_name":"path",
"format":"zarr"
},
"aggregation_control":{
"variable_column_name":"variable",
"groupby_attrs":[
"component",
"experiment",
"frequency"
],
"aggregations":[
{
"type":"union",
"attribute_name":"variable",
"options":{
"compat":"override"
}
}
]
}
}
```
### esm-collection-spec side
- [ ] Update the [specification file](https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md)
- [ ] Update the [validator script](https://github.com/NCAR/esm-collection-spec/blob/master/esmcol_validator/validator.py)
### intake-esm side
- [ ] Update/add a new argument `catalog_type` to [`serialize() method`](https://github.com/NCAR/intake-esm/blob/master/intake_esm/core.py#L131)???
- Accepted values for `catalog_type` would include `dict` for `catalog_dict` key and `file` for `catalog_file` key.
- Errors if `catalog_type` not in `{"dict", "file"}`
- Should we add an `orient` argument to control the type of the values of the dictionary??? For instance, `orient='dict'` would yield option 1) and `orient='records'` would yield option 2) described above.
```python
def serialize(self, name, catalog_type='file',
orient='records', directory=None):
...
```
- [ ] Update [`_fetch_catalog()` method](https://github.com/NCAR/intake-esm/blob/master/intake_esm/core.py#L126) so that it can support reading the catalog from dictionary specified in `catalog_dict`:
```python
@lru_cache(maxsize=None)
def _fetch_catalog(self):
"""Get the catalog content and cache it.
"""
if 'catalog_file' in self._col_data:
return pd.read_csv(self._col_data['catalog_file'])
else:
return pd.DataFrame(self._col_data['catalog_dict'])
...
```
## Milestones - Metrics (TODO)
## Timeline (TODO)
## References
- [https://github.com/NCAR/esm-collection-spec/issues/13](https://github.com/NCAR/esm-collection-spec/issues/13)
- [https://github.com/NCAR/intake-esm/pull/179#issuecomment-553630201](https://github.com/NCAR/intake-esm/pull/179#issuecomment-553630201)
- [https://github.com/NCAR/intake-esm/issues/166](https://github.com/NCAR/intake-esm/issues/166)
###### tags: `ncar` `pangeo`