MetaLad Hackathon

# MetaLad Hackathon ## Computing playground: (For those with a ``juseles``account) ``` ssh -J <user>@juseless.inm7.de <user>@lad1 cd /playground ``` ## Initial demo by Christian (9.30AM-11AM CET) ``` # create a nested dataset structure datalad create ds1 && cd ds1 echo '"3456"' > x.txt echo '"xxxxxx"' > f1.txt datalad create -d . sub1 echo '"1234567"' > sub1/x.txt datalad create -d . sub2 echo '"1234567"' > sub2/sub2_f1.txt datalad save -r # extract dataset metadata datalad meta-extract -d . metalad_core | jq # try dump metadata, returns "NoMetadataStoreFound" datalad meta-dump . # add metadata by piping datalad meta-extract -d . metalad_core | \ datalad meta-add - # add more metadata by another extractor datalad meta-extract -d . metalad_example_dataset | \ datalad meta-add - datalad meta-dump -d . # repeat metadata extraction with # metalad_example_dataset which changes the # metadata as it contains a timestamp datalad meta-extract -d . metalad_example_dataset | \ datalad meta-add - # add single file metadata datalad meta-extract -d . metalad_core x.txt | datalad meta-add - # there are no stores in subdatasets cd sub1 datalad meta-dump . # add metadata to subdataset datalad meta-extract -d . metalad_core x.txt | datalad meta-add - # dump the file level metadata. THIS REQUIRES -r! (otherwise it would return dataset metadata, of which the subds has none) datalad meta-dump . -r # or datalad meta-dump '.:x.txt' # aggregate subds metadata into the super cd .. datalad meta-aggregate sub1 datalad meta-dump -r # one can automate the subds extraction and superds aggregation with meta-conduct datalad meta-conduct extract_metadata --pipeline-help # save current metadata datalad meta-dump -r > before.json # removing metadata (would also need a git gc) rm -rf .git/refs/datalad datalad meta-conduct extract_metadata traverser.top_level_dir=$(pwd) traverser.traverse_sub_datasets=True traverser.item_type=file extractor.extractor_type=file extractor.extractor_name=metalad_core adder.aggregate=True ``` # Discussions ### Topic 1: Metadata has no history - performance discussion Metadata is stored by the following chain of properties: ``dataset-id``, ``dataset-version``, ``extractor-name``, ``extractor-versio`` If metadata is added with the exact same dataset-id, dataset-version, extractor-name, and extractor-version, the new metadata overwrites the old one. Stephan asks whether it wouldn't be more performant to not perform meta-add when the metadata already exists (he has a usecases about the institute archive with lots of file-level metadata where meta-conduct takes days), Christian cautions that the look-up process would be a larger performance hit and suspects that the slow behavior is due to slow executor performance. ### Topic 2: the --recursive flag - better name? - return file level metadata under which circumstance? - ## TODO - Add to docs: - suggestions for improving performance when handling large amounts of metadata - Explain the `-r` switch and dump patterns extremely well - `meta-conduct` -