# Script to split a SAM dataset definition
Here is the [gist of the script](https://gist.github.com/dingp/36aa66247fd3d4fa70e307a491a17e7a
).
The script will take an exisiting SAM dataset definition as input, and produce a list of datasets, each containing a subset of files in the input dataset.
It does so by following these steps:
1. Count total number of files in the input dataset. If `--max-files-per-set` is provided, calculate how many subsets to be created, and ignore `--nsubsets` if necessary;
2. if `--nsubsets` is specified, and the estimated number of files per subset is smaller than `--max-files-per-set`, the number specified by `--nsubsets` will be the number of subsets created;
3. take a snapshot of the input dataset and create a new dataset definition with constraints on the snapshot ID. This is the superset of the new subsets;
4. create each subsets by specifying the snapshot ID and the range of snapshot file number.
## Environment setup and authentication
You can do the setup in either of the following ways on a DUNE GPVM node.
### method 1 (if you are using `dunesw`)
```bash
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
export DUNESW_VERSION=v09_72_01d00 # change this to the version available on cvmfs.
setup dunesw $DUNESW_VERSION -q e20:prof
setup_fnal_security
```
### method 2 (if you are setting up `samweb_client` only)
```bash
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
setup sam_web_client
export SAM_EXPERIMENT=dune
kx509
export ROLE=Analysis
voms-proxy-init -rfc -noregen -voms=dune:/dune/Role=$ROLE -valid 120:00
```
## Usage of the script
### Script helper
The helper of the script can be found below. The script will firstly print out the details of the input dataset, and the list of new datasets to be created. By default, the script will ask for confirmation before proceeding to create the new datasets. Use `-y/--yes` to bypass the confirmation prompt.
```bash
-bash-4.2$ ./split_sam_dataset.py -h
usage: split_sam_dataset.py [-h] [--dataset-name DATASET_NAME] [--prefix PREFIX]
[--suffix SUFFIX] [-y] [--max-files-per-set MAX_FILES_PER_SET]
[--nsubsets NSUBSETS]
Split SAM dataset into subsets.
optional arguments:
-h, --help show this help message and exit
--dataset-name DATASET_NAME
SAM dataset name;
--prefix PREFIX Prefix to the dataset name after the split;
--suffix SUFFIX Suffix to the dataset name after the split;
-y, --yes batch mode, bypassing the prompt;
--max-files-per-set MAX_FILES_PER_SET
Maixum number of files per subset (takes precendence over --nsubsets
if speficied);
--nsubsets NSUBSETS Number of subsets to be split into.
```
### Example 1 - Split a dataset into 10 subsets
```bash
./split_sam_dataset.py --prefix=dingpf --nsubsets 10 --dataset-name prodcosmics_corsika_protodunedp_mcc10
```
```bash
-bash-4.2$ ./split_sam_dataset.py --prefix=dingpf --nsubsets 10 --dataset-name prodcosmics_corsika_protodunedp_mcc10
--------------------------------------------------------------------------------
Definition Name: prodcosmics_corsika_protodunedp_mcc10
Definition Id: 74222
Creation Date: 2018-08-12T05:56:38+00:00
Username: dunepro
Group: dune
Dimensions: data_tier simulated and file_format artroot and application detsim and version v06_70_01 and dune.campaign mcc10 and file_name
cosmics_protodunedp_%.root
--------------------------------------------------------------------------------
Total file count: 1065
Number of subsets: 10
--------------------------------------------------------------------------------
The following datasets will be created:
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_all_10_subsets (1065 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_1_out_of_10 (107 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_2_out_of_10 (107 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_3_out_of_10 (107 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_4_out_of_10 (107 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_5_out_of_10 (107 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_6_out_of_10 (106 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_7_out_of_10 (106 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_8_out_of_10 (106 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_9_out_of_10 (106 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_10_out_of_10 (106 files)
Do you want to continue? (y/n): y
Continuing...
--------------------------------------------------------------------------------
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_all_10_subsets -- 1065 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_1_out_of_10 -- 107 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_2_out_of_10 -- 107 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_3_out_of_10 -- 107 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_4_out_of_10 -- 107 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_5_out_of_10 -- 107 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_6_out_of_10 -- 106 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_7_out_of_10 -- 106 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_8_out_of_10 -- 106 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_9_out_of_10 -- 106 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052057_10_out_of_10 -- 106 files
```
### Example 2 -- Split a dataset into subsets with a maximum number of 100 files per subset.
```bash
-bash-4.2$ ./split_sam_dataset.py --prefix=dingpf --max-files-per-set 100 --nsubsets 10 --data
set-name prodcosmics_corsika_protodunedp_mcc10
--------------------------------------------------------------------------------
WARNING: --max-files-per-set is set.
WARNING: need to create at least 11 subsets.
WARNING: ignore --nsubsets=10.
--------------------------------------------------------------------------------
Definition Name: prodcosmics_corsika_protodunedp_mcc10
Definition Id: 74222
Creation Date: 2018-08-12T05:56:38+00:00
Username: dunepro
Group: dune
Dimensions: data_tier simulated and file_format artroot and application detsim and version v06_70_01 and dune.campaign mcc10 and file_name
cosmics_protodunedp_%.root
--------------------------------------------------------------------------------
Total file count: 1065
Number of subsets: 11
--------------------------------------------------------------------------------
The following datasets will be created:
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_all_11_subsets (1065 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_1_out_of_11 (97 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_2_out_of_11 (97 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_3_out_of_11 (97 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_4_out_of_11 (97 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_5_out_of_11 (97 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_6_out_of_11 (97 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_7_out_of_11 (97 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_8_out_of_11 (97 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_9_out_of_11 (97 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_10_out_of_11 (96 files)
dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_11_out_of_11 (96 files)
Do you want to continue? (y/n): y
Continuing...
--------------------------------------------------------------------------------
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_all_11_subsets -- 1065 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_1_out_of_11 -- 97 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_2_out_of_11 -- 97 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_3_out_of_11 -- 97 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_4_out_of_11 -- 97 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_5_out_of_11 -- 97 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_6_out_of_11 -- 97 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_7_out_of_11 -- 97 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_8_out_of_11 -- 97 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_9_out_of_11 -- 97 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_10_out_of_11 -- 96 files
Created -- dingpf_prodcosmics_corsika_protodunedp_mcc10_202330052100_11_out_of_11 -- 96 files
```