# Data Submission & Exploration Workshop
April 3 and 4, 2023 1-4pm ET
###### tags: `Spring 2023` `Brain Image Library` `workshop`
## GitHub
<img src="https://i.imgur.com/hMES3Xp.png" width="10%" /><br>
Presentations and scripts for both days of the workshop are available here
* https://github.com/brain-image-library/workshops
## Resources
As a member of the [Brain Image Library](https://www.brainimagelibrary.org/) project you have access to
* A login node (`login.brainimagelibrary.org`) to access the computing resources (not for computing)
* A virtual machine (`workshop.brainimagelibrary.edu`) for applications that require viz, like Fiji and Matlab
* An OpenDemand instance, `ondemand.brainimagelibrary.org`, with access to Jupyter Lab and R-Studio
* A set of compute nodes managed by a schedular (`SLURM`)
* 8 large-memory compute nodes that can be accessed using SLURM from within the virtual machine
* 1 GPU-enabled compute node (access upon request)
:::info
:bulb: The VM `workshop.brainimagelibrary.edu` will be generally online for use by members of the Brain Image Library. If a resource is unavailable or should be become unavailable for updates or upgrades, then you will receive a notification from the team.
:::
## Connecting to the `workshop` VM
Open terminal and run the command
```bash
ssh <your-username>@workshop.brainimagelibrary.edu
```
For example,
```bash
ssh icaoberg@workshop.brainimagelibrary.edu
icaoberg@workshop.brainimagelibrary.edu's password:
Last login: Mon Jan 24 10:46:38 2023 from pool-71-162-2-190.pitbpa.fios.verizon.net
********************************* W A R N I N G ********************************
You have connected to workshop.bil.psc.edu
This computing resource is the property of the Pittsburgh Supercomputing Center.
It is for authorized use only. By using this system, all users acknowledge
notice of, and agree to comply with, PSC polices including the Resource Use
Policy, available at http://www.psc.edu/index.php/policies. Unauthorized or
improper use of this system may result in administrative disciplinary action,
civil charges/criminal penalties, and/or other sanctions as set forth in PSC
policies. By continuing to use this system you indicate your awareness of and
consent to these terms and conditions of use.
LOG OFF IMMEDIATELY if you do not agree to the conditions stated in this warning
Please contact support@psc.edu with any comments/concerns.
********************************* W A R N I N G ********************************
````
If you can see the message above when you connect, then you should be ready to start using the resources.
Tools you can use to connect to any of these resources
| Tool name | Operating System |
| -------- | -------- |
| [iTerm2](https://iterm2.com) | MacOSX |
| [Terminal](https://support.apple.com/guide/terminal/welcome/mac) | MacOSX |
| [Putty](https://putty.org/) | Windows |
| [OpenSSH/PowerShell](https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse?tabs=gui) | Windows |
| Terminal | Linux |
| OnDemand | WebBrowser |
| X2Go | MacOSX/Windows/Linux |
### Exercise. Open a Terminal on X2Go.
Open X2Go and create a `New Session`

Fill the new session dialog with the information below

Notice that we are selecting a `Single application`. click `Ok` and double click the `Applications` box and login

## LMOD
<img src='https://i.imgur.com/TiNg8y8.png' width="25%" />
Lmod is a Lua based module system that easily handles the `MODULEPATH` Hierarchical problem.
Environment Modules provide a convenient way to dynamically change the users’ environment through modulefiles.
In a nutshell, we use LMOD to manage software that can be used in the VM as well as in the compute nodes. Software available as modules should be accessible on both resources.
This document only lists a few commands. For complete documentation click [here](https://lmod.readthedocs.io/en/latest/010_user.html).
:::info
:bulb: If you want us to install a piece of software in our resources, then please remember to submit software installation requests to `bil-support@psc.edu`.
:::
### Listing available modules
To list all available software modules use the command
```bash
module avail
```

```
The command above will list all available software.
:::info
:envelope: Cannot find the software you need to explore the collections? Then please send a request to `bil-support@psc.edu`.
:::
### Listing specific modules
To list specific modules use the command
```bash
module avail <package-name>
```
For example,
```bash
module avail matlab
-------------- /bil/modulefiles ---------------
matlab/2019a matlab/2021a
```
### Listing useful information
To list useful info about a module use the command
```bash
module help <package-name>
```
For example,

### Loading modules
To load a module use the command
```bash
module load <package-name>
```
For example,
```bash
module load matlab/2021a
```
Running the command above will make the matlab binary available in the current session
```bash
which matlab
/bil/packages/matlab/R2021a/bin/matlab
```
In this example, you can simply type `matlab` to start the Matlab engine

#### Loading a specific version of a module
There are times when there are multiple versions of the same same software.
For example,

If you wish to load a specific version of a package use the command
```bash
module load <package>/<version>
```
For example,
```bash
module load bioformats/6.4.0
```
### Listing loaded modules
To list the loaded modules use the command
```bash
module list
```
For example,

### Unload module
To load a module use the command
```bash
module unload <package-name>
```
for example
```bash
module unload matlab/2021a
```
### Using modules on scripts
When building scripts that are using more than one tool available as modules, simply type the module command for each tool

## SLURM

[Slurm](https://slurm.schedmd.com/documentation.html) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
This document only lists a few commands. For complete documentation click [here](https://slurm.schedmd.com/documentation.html).
### sinfo
```bash
sinfo - View information about Slurm nodes and partitions.
SYNOPSIS
sinfo [OPTIONS...]
```
For example

### squeue
```bash
squeue - view information about jobs located in the Slurm scheduling queue.
SYNOPSIS
squeue [OPTIONS...]
```
For example

### scontrol
```bash
scontrol - view or modify Slurm configuration and state.
SYNOPSIS
scontrol [OPTIONS...] [COMMAND...]
```
As a regular user you can view information about the nodes and jobs but won't be able to modify them.
The view information about the nodes use the command

To view information about a specific node, use the node name to print this information. For example

Because there exists one partition, the you can run `sinfo` or `sinfo -p compute` to gather basic information about this partition.
For example

### sbatch
```bash
sbatch - Submit a batch script to Slurm.
SYNOPSIS
sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]
```
### srun
```bash
srun - Run parallel jobs
SYNOPSIS
srun [OPTIONS(0)... [executable(0) [args(0)...]]] [ : [OPTIONS(N)...]] executable(N) [args(N)...]
```
For example, to allocate an 8 hour debugging session you can type
```bash
srun -p compute --mem=16Gb --time=08:00:00 --pty /bin/bash
```
##### interact
The interact command is an in-house script for starting interactive sessions

* At the moment, there only exists one partition named `compute`, so running
```bash
interact
```
or
```bash
interact -p compute
```
* To specify the amount of memory use the option `--mem=<MB>`. For example `interact --mem=1Tb`
* This is a shared partition, if you wish to get all the resources in a compute node, use the option `--nodes`. For example `interact -N 1`. Since this is a shared resource, please be considerate using this resource.
### scancel
```bash
scancel - Used to signal jobs or job steps that are under the control of Slurm.
SYNOPSIS
scancel [OPTIONS...] [job_id[_array_id][.step_id]] [job_id[_array_id][.step_id]...]
```
There is no need to
* To cancel a specific job use the command `scancel <job_id>`. For example `scancel 00001`
* To cancel all your running jobs use the command `scancel -u <username>`. For example `scancel -u icaoberg`.
## Exercises
### Exercise. Opening an image.
Let's download an image
```
wget https://download.brainimagelibrary.org/33/9b/339bbe4c4d1bbe2f/20210520_CJLoreleai_Ex17_S2_SST_NPY_Hippo_Region010Overview.jpg
```
and open it on `Eye of Gnome`
```
eog 20210520_CJLoreleai_Ex17_S2_SST_NPY_Hippo_Region010Overview.jpg
```

### Exercise. Using Napari.
<img src="https://imgur.com/gkDCsMd.gif" /><br>
In terminal type
```bash
module load miniconda3
conda activate napari
napari
```

this should open Napari.
:::warning
:warning: Please be patient starting any applications with a user-interface using X2Go.
:::

### Exercise. Extracting files.
Consider the following `script.sh`
```
#!/bin/bash
#dummy example
if [ ! -d images ]; then
mkdir images
fi
cd images
wget -nc --no-check-certificate http://murphylab.web.cmu.edu/data/Hela/3D/multitiff/cellorganizer_full_image_collection.zip
unzip cellorganizer_full_image_collection.zip
rm -f cellorganizer_full_image_collection.zip
```
We can however, submit this script as a job on the scheduler to avoid waiting for the job to finish interactively. You can do this by adding some headers to your script
```
#!/bin/bash
#SBATCH -p compute
#SBATCH -n 4
#SBATCH --mem=8Gb
#dummy example
if [ ! -d images ]; then
mkdir images
fi
cd images
wget -nc --no-check-certificate http://murphylab.web.cmu.edu/data/Hela/3D/multitiff/cellorganizer_full_image_collection.zip
unzip cellorganizer_full_image_collection.zip
rm -f cellorganizer_full_image_collection.zip
```
However for each job in the scheduler, the scheduler will create a temp folder in `/local`. We can use to store temp files. This folder has some IO constraints as any shared resource, i.e. it is not neccesarily faster to read/write to this folder.
```
#!/bin/bash
#SBATCH -p compute
#SBATCH -n 4
#SBATCH --mem=8Gb
set -x
#dummy example
#whenever you request a job a temp folder named /local (unique to each job) will be created automatically
DIRECTORY=/local/images2
if [ ! -d $DIRECTORY ]; then
mkdir $DIRECTORY
fi
cd $DIRECTORY
wget -nc --no-check-certificate http://murphylab.web.cmu.edu/data/Hela/3D/multitiff/cellorganizer_full_image_collection.zip && \
unzip cellorganizer_full_image_collection.zip && \
rm -f cellorganizer_full_image_collection.zip
mv $DIRECTORY ~/Desktop/zip
```
However there exists a shared space named `scratch` that you can write to. As a user you are responsible for deleting your files after using this space. IO operations are generally faster on `/scratch` than `/local`.
### Exercise. Computing checksums.
Consider the files from the previous exercise living on `~/Desktop/zip/images1/HeLa/3D`

there are many files in this folder that I want to compute checksums for

Computing checksums for each file could take some time. So consider this script
```
#!/bin/bash
DIRECTORY=/bil/users/icaoberg/Desktop/zip/images1/HeLa/3D
lfs find -type f $DIRECTORY | xargs -I {} md5sum {}
```
This script will compute md5 checksums for every file in the folder. Running this script takes about
```
real 2m31.819s
user 0m11.418s
sys 1m4.443s
```
in the workshop VM.
If you consider the previous example, can turn the script above to a SLURM script by adding headers
```
#!/bin/bash
#SBATCH -p compute
#SBATCH -n 2
#SBATCH --mem=4Gb
DIRECTORY=/bil/users/icaoberg/Desktop/zip/images1/HeLa/3D
lfs find -type f $DIRECTORY | xargs -I {} md5sum {}
```
However the script above compute these checksums serially. Can we do it in parallel?
```
#!/bin/bash
#SBATCH -p compute
#SBATCH -n 20
#SBATCH --mem=10Gb
DIRECTORY=/bil/users/icaoberg/Desktop/zip/images1/HeLa/3D
lfs find -type f $DIRECTORY | xargs -I {} -P 20 md5sum {}
```
The script above now allocates 20 cores and asks `xargs` to run 20 of these checksums in parallel. Submitting this job to the scheduler makes it run faster.
```
real 0m29.984s
user 0m11.806s
sys 0m17.861s
```
It ran significantly faster. But what if I want to be greedy?
```
#!/bin/bash
#SBATCH -p compute
#SBATCH -N 1
#SBATCH -n 70
DIRECTORY=/bil/users/icaoberg/Desktop/zip/images1/HeLa/3D
lfs find -type f $DIRECTORY | xargs -I {} -P 70 md5sum {}
```
However, even though I allocated a full node for this task I didn't run faster. Probably due to IO limits.
```
real 0m27.744s
user 0m11.726s
sys 0m15.678s
```
### Exercise. Contrast-stretching with ImageMagick.
This exercise is trying to tie up together all the concepts discussed in this workshop.
Imagine we are interested in collection `84c11fe5e4550ca0` that I found in the portal

:::info
:bulb: There is no need to download the data locally because the data is available when you our resources.
:::

*I can navigate to `/bil/data/84/c1/84c11fe5e4550ca0/` to see the contents of the collection.*
Unbfortuntaly it is difficult to visually inspect the images because these are not contrast stretched.

*The images are not contrast stretched and cannot be visually inspected.*
Fortunately there are tools like Fiji that can contrast stretch the images. However I want to do this in batch mode as a job since this process can be automated.

[ImageMagick](https://imagemagick.org/index.php) is a robust library for image manipulation. The `convert` tool in this library has an option to [contrast-stretching](https://imagemagick.org/script/command-line-options.php#contrast-stretch).
The format is
```
convert <input-file> -contrast-stretch <output-file>
```
Next I will create a file called `script.sh` and will place it in a folder in my Desktop.
```
#!/bin/bash
#this line is needed to be able to use modules on the compute nodes
source /etc/profile.d/modules.sh
#this command loads the ImageMagick library
module load ImageMagick/7.1.0-2
#this for loop finds all the images in the sample folder and contrast-stretch
for FILE in /bil/data/84/c1/84c11fe5e4550ca0/SW170711-04A/*tif
do
convert $FILE -contrast-stretch 15% $(basename $FILE)
done
```
:::info
:bulb: For simplicity, you can find the script above in
```
/bil/workshops/2022/data_submission
```
to copy the script to your Desktop run the command in terminal
```
cp /bil/workshops/2022/data_submission/script.sh ~/Desktop/
```
:::
Next I can submit my script using the command
```
sbatch -p compute --mem=64Gb script.sh
```
Since I am doing serially I don't need much memory but if I were to do this in parallel I might.
To monitor your job progress use the command `squeue -u <username>`. For example,
```
squeue -u icaoberg
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14243 compute script.s icaoberg R 15:34 1 l001`
```
This leads to

---
The Brain Image Library is supported by the National Institutes of Mental Health of the National Institutes of Health under award number R24-MH-114793. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.