## Meeting notes
**Background:** QC processing of microscopy images in a DataLad-based workflow shows increased time consumption after it was moved from ``fastdata`` to ``largedata``. The aim of this meeting is to understand the analysis setup, the cause for the time losses, and ways to improve it.
**Data**: 29 microscopy TIFF files per brain sections, taking up about ~12GB per file. 100 such files will be put in a subdataset, and subsequently organized in superdatasets per brain.
Individual data files moves between 3 locations with different processing:

* ``ime262`` is the lab server taking images from the microscope
* ``largedata`` and ``fastdata`` are at the JSC. The ``/incoming`` directory is on ``fastdata``, and the place where individual files are saved into datasets
* An SSH connection from the server to the JSC is slow (mostly due to long time for authentication (once) and a 600ms latency to get the remote shell per process)
**Process**: store the data at the JSC, and, in the process, perform QC analysis
**2 larger problem spaces were discussed**:
1) Most of the analysis was supposed to be done on fastdata, but the JSC urged the group to do the anlaysis on largedata (via compute nodes that mount largedata). Sadly, the analysis now takes much longer:
- ``datalad save`` and ``datalad push`` take a significant amount of time (~20 minutes). There is a logfile with some insights that was discussed in earlier meetings. Some of the causes discussed in this meeting are:
- ``save`` needs to evaluate the state of the working tree completely. Depending on the location and its parametrization, it may need to evaluate a substantial tree of files. On GPFS (used on ``fastdata`` and ``largedata``), metadata queries are costly and probably contribute largely to the time demands (though only for the initial retrieval from a central location, subsequently metadata is cached by this file system). A ``datalad clone`` will be faster than ``datalad save`` on this filesystem.
- ``push`` will talk to the remote on ``largedata`` and perform additional metadata queries on the other GPFS system. Because ``push`` has to sync the annex state between remotes, at least two ``git push``es need to be performed. The coming ``0.16`` release comes with a switch to clone datasets in git-annex ' "private mode", which can help with performance by preventing changes in the annex branch that would need to be propagated to remotes. From Roman's observations, it seems like they are running into a bug where parallelization via ``--jobs`` is not propagated properly. Michael will investigate the logs.
- In conjunctions: Interactions between the filesystem, Git and git-annex commands, and their parametrization
* Streaming access (file transfer operations) are fast and don't seem to be a bottle neck, however, the file system operations (``datalad save`` on ``fastdata``) are slow. One potential proposed solution, based on INM-7 experiences with slow GPFS filesystem per-inode-processing with UK Biobank data that was solved by moving FS processes to temporary RAM disks, is to move the ``datalad save`` step to the lab server.
* When the dataset generation is moved to the lab server, the flow chart would change like this:
* Create datasets on the lab server, and then a RIA store structure via ``datalad create-sibling-ria`` from lab server ``ime262`` to JSC:
```
datalad create <brain1>
cd <brain1>
datalad create -d . <subdataset-for-slices>
# Copy brain slice TIFF into the subdataset
datalad save ...
# If this subdataset is a new one, run
datalad create-sibling-ria <local URL over NFS: ria+file://judac-largedata...>
datalad push
```

* Commands like ``export-archive-ora`` can be used to ZIP contents into archives
* Michael will remember permission settings
* An alternative to using DataLad commands to push to the JSC is to place a store or store contents created on the lab server via `rsync` (or similar means) into the RIA store
* Clone the datasets out of the store to the place where QC analysis happen
**Action items**
- Perform a comparison of timings of the current script on the lab server
- Test with 0.16/current ``master``
- Try it with an artificial dataset with large files
2) Permission management and setup
* [Didn't get around to discuss this]
The script that is currently used is this (time stamps to show the problem at the end):
```python
# run on JUDAC: (source /p/fastdata/bigbrains/pipeline/code/setupJurecaLogin.sh; python3 /p/fastdata/bigbrains/pipeline/code/zstk_core/discussions/datalad_profiling.py)
# run on Worker: (source /p/fastdata/bigbrains/pipeline/code/setupJurecaWorker.sh; python3 /p/fastdata/bigbrains/pipeline/code/zstk_core/discussions/datalad_profiling.py)
import os, sys
from datetime import datetime
from pathlib import Path
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))))
import zstk_core.pipey as pipey
from datalad.distribution.dataset import Dataset
from datalad.api import create # pylint: disable=no-name-in-module
path_from = '/p/fastdata/bigbrains/pipeline/zstack-view/B20/2900_2999/B20_2931_Slice%02i.tif'
path_to = '/p/fastdata/bigbrains/rmn/datalad_test/repo/B20_2931_Slice%02i.tif'
path_repo = '/p/fastdata/bigbrains/rmn/datalad_test/repo'
path_datastore = '/p/largedata2/bigbrains/rmn/datalad_test/datastore'
jobs = 32
def delDir(path):
os.system(f'chmod -R 770 {path}')
os.system(f'rm -fr {path}')
print('cleaning :', datetime.now())
delDir(path_repo)
delDir(path_datastore)
print("init begin :", datetime.now())
create(path=path_repo,description='init', dataset=None, cfg_proc='text2git')
ds = Dataset(path_repo)
Path(path_datastore).mkdir(parents=True, exist_ok=True)
ds.create_sibling_ria(url='ria+file://' + path_datastore, name='datastore', existing='skip')
ds.save(message='save init', recursive=False, jobs=jobs)
ds.push(to='datastore', jobs=jobs, force='checkdatapresent')
print('copy files :', datetime.now())
def copyFile(file_index: int):
os.system('cp -L ' + (path_from % file_index) + ' ' + (path_to % file_index))
list(pipey.Pipeline.map(copyFile, list(range(1,30))))
print('save files :', datetime.now())
ds.save(message='save init', recursive=False, jobs=jobs)
print('push files :', datetime.now())
ds.push(to='datastore', jobs=jobs, force='checkdatapresent')
print('done :', datetime.now())
# JUDAC 32 jobs
# cleaning : 2022-01-24 14:59:57.071188
# init begin : 2022-01-24 14:59:58.096311
# copy files : 2022-01-24 15:00:44.366280
# save files : 2022-01-24 15:01:56.692186
# push files : 2022-01-24 15:13:35.740835
# done : 2022-01-24 15:26:16.112858
# JUDAC 1 job
# cleaning : 2022-01-24 15:28:39.365718
# init begin : 2022-01-24 15:28:41.047388
# copy files : 2022-01-24 15:29:27.272002
# save files : 2022-01-24 15:30:42.828638
# push files : 2022-01-24 15:42:34.918862
# done : 2022-01-24 15:54:50.295356
````