Data Submission & Exploration Workshop

# Data Submission & Exploration Workshop April 3 and 4, 2023 1-4pm ET ###### tags: `Spring 2023` `Brain Image Library` `workshop` ## GitHub <img src="https://i.imgur.com/hMES3Xp.png" width="10%" /><br> Presentations and scripts for both days of the workshop are available here * https://github.com/brain-image-library/workshops ## Resources As a member of the [Brain Image Library](https://www.brainimagelibrary.org/) project you have access to * A login node (`login.brainimagelibrary.org`) to access the computing resources (not for computing) * A virtual machine (`workshop.brainimagelibrary.edu`) for applications that require viz, like Fiji and Matlab * An OpenDemand instance, `ondemand.brainimagelibrary.org`, with access to Jupyter Lab and R-Studio * A set of compute nodes managed by a schedular (`SLURM`) * 8 large-memory compute nodes that can be accessed using SLURM from within the virtual machine * 1 GPU-enabled compute node (access upon request) :::info :bulb: The VM `workshop.brainimagelibrary.edu` will be generally online for use by members of the Brain Image Library. If a resource is unavailable or should be become unavailable for updates or upgrades, then you will receive a notification from the team. ::: ## Connecting to the `workshop` VM Open terminal and run the command ```bash ssh <your-username>@workshop.brainimagelibrary.edu ``` For example, ```bash ssh icaoberg@workshop.brainimagelibrary.edu icaoberg@workshop.brainimagelibrary.edu's password: Last login: Mon Jan 24 10:46:38 2023 from pool-71-162-2-190.pitbpa.fios.verizon.net ********************************* W A R N I N G ******************************** You have connected to workshop.bil.psc.edu This computing resource is the property of the Pittsburgh Supercomputing Center. It is for authorized use only. By using this system, all users acknowledge notice of, and agree to comply with, PSC polices including the Resource Use Policy, available at http://www.psc.edu/index.php/policies. Unauthorized or improper use of this system may result in administrative disciplinary action, civil charges/criminal penalties, and/or other sanctions as set forth in PSC policies. By continuing to use this system you indicate your awareness of and consent to these terms and conditions of use. LOG OFF IMMEDIATELY if you do not agree to the conditions stated in this warning Please contact support@psc.edu with any comments/concerns. ********************************* W A R N I N G ******************************** ```` If you can see the message above when you connect, then you should be ready to start using the resources. Tools you can use to connect to any of these resources | Tool name | Operating System | | -------- | -------- | | [iTerm2](https://iterm2.com) | MacOSX | | [Terminal](https://support.apple.com/guide/terminal/welcome/mac) | MacOSX | | [Putty](https://putty.org/) | Windows | | [OpenSSH/PowerShell](https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse?tabs=gui) | Windows | | Terminal | Linux | | OnDemand | WebBrowser | | X2Go | MacOSX/Windows/Linux | ### Exercise. Open a Terminal on X2Go. Open X2Go and create a `New Session` ![](https://i.imgur.com/ds4H6Z9.png) Fill the new session dialog with the information below ![](https://i.imgur.com/vI1mExo.png) Notice that we are selecting a `Single application`. click `Ok` and double click the `Applications` box and login ![](https://i.imgur.com/qaSqDxS.png) ## LMOD <img src='https://i.imgur.com/TiNg8y8.png' width="25%" /> Lmod is a Lua based module system that easily handles the `MODULEPATH` Hierarchical problem. Environment Modules provide a convenient way to dynamically change the users’ environment through modulefiles. In a nutshell, we use LMOD to manage software that can be used in the VM as well as in the compute nodes. Software available as modules should be accessible on both resources. This document only lists a few commands. For complete documentation click [here](https://lmod.readthedocs.io/en/latest/010_user.html). :::info :bulb: If you want us to install a piece of software in our resources, then please remember to submit software installation requests to `bil-support@psc.edu`. ::: ### Listing available modules To list all available software modules use the command ```bash module avail ``` ![](https://i.imgur.com/q3osnBm.png) ``` The command above will list all available software. :::info :envelope: Cannot find the software you need to explore the collections? Then please send a request to `bil-support@psc.edu`. ::: ### Listing specific modules To list specific modules use the command ```bash module avail <package-name> ``` For example, ```bash module avail matlab -------------- /bil/modulefiles --------------- matlab/2019a matlab/2021a ``` ### Listing useful information To list useful info about a module use the command ```bash module help <package-name> ``` For example, ![](https://i.imgur.com/8lq3qtW.png) ### Loading modules To load a module use the command ```bash module load <package-name> ``` For example, ```bash module load matlab/2021a ``` Running the command above will make the matlab binary available in the current session ```bash which matlab /bil/packages/matlab/R2021a/bin/matlab ``` In this example, you can simply type `matlab` to start the Matlab engine ![](https://i.imgur.com/qhHiuLb.png) #### Loading a specific version of a module There are times when there are multiple versions of the same same software. For example, ![](https://i.imgur.com/pfe2EiM.png) If you wish to load a specific version of a package use the command ```bash module load <package>/<version> ``` For example, ```bash module load bioformats/6.4.0 ``` ### Listing loaded modules To list the loaded modules use the command ```bash module list ``` For example, ![](https://i.imgur.com/I6pU4Mf.png) ### Unload module To load a module use the command ```bash module unload <package-name> ``` for example ```bash module unload matlab/2021a ``` ### Using modules on scripts When building scripts that are using more than one tool available as modules, simply type the module command for each tool ![](https://i.imgur.com/aoKGwYv.png) ## SLURM ![Logo](https://i.imgur.com/3gPmu1r.png) [Slurm](https://slurm.schedmd.com/documentation.html) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. This document only lists a few commands. For complete documentation click [here](https://slurm.schedmd.com/documentation.html). ### sinfo ```bash sinfo - View information about Slurm nodes and partitions. SYNOPSIS sinfo [OPTIONS...] ``` For example ![](https://i.imgur.com/YBDkK5K.png) ### squeue ```bash squeue - view information about jobs located in the Slurm scheduling queue. SYNOPSIS squeue [OPTIONS...] ``` For example ![](https://i.imgur.com/7A6yuVF.png) ### scontrol ```bash scontrol - view or modify Slurm configuration and state. SYNOPSIS scontrol [OPTIONS...] [COMMAND...] ``` As a regular user you can view information about the nodes and jobs but won't be able to modify them. The view information about the nodes use the command ![](https://i.imgur.com/IY3GosT.png) To view information about a specific node, use the node name to print this information. For example ![](https://i.imgur.com/hWzju9c.png) Because there exists one partition, the you can run `sinfo` or `sinfo -p compute` to gather basic information about this partition. For example ![](https://i.imgur.com/LLLim4v.png) ### sbatch ```bash sbatch - Submit a batch script to Slurm. SYNOPSIS sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...] ``` ### srun ```bash srun - Run parallel jobs SYNOPSIS srun [OPTIONS(0)... [executable(0) [args(0)...]]] [ : [OPTIONS(N)...]] executable(N) [args(N)...] ``` For example, to allocate an 8 hour debugging session you can type ```bash srun -p compute --mem=16Gb --time=08:00:00 --pty /bin/bash ``` ##### interact The interact command is an in-house script for starting interactive sessions ![](https://i.imgur.com/PDQMlPi.png) * At the moment, there only exists one partition named `compute`, so running ```bash interact ``` or ```bash interact -p compute ``` * To specify the amount of memory use the option `--mem=<MB>`. For example `interact --mem=1Tb` * This is a shared partition, if you wish to get all the resources in a compute node, use the option `--nodes`. For example `interact -N 1`. Since this is a shared resource, please be considerate using this resource. ### scancel ```bash scancel - Used to signal jobs or job steps that are under the control of Slurm. SYNOPSIS scancel [OPTIONS...] [job_id[_array_id][.step_id]] [job_id[_array_id][.step_id]...] ``` There is no need to * To cancel a specific job use the command `scancel <job_id>`. For example `scancel 00001` * To cancel all your running jobs use the command `scancel -u <username>`. For example `scancel -u icaoberg`. ## Exercises ### Exercise. Opening an image. Let's download an image ``` wget https://download.brainimagelibrary.org/33/9b/339bbe4c4d1bbe2f/20210520_CJLoreleai_Ex17_S2_SST_NPY_Hippo_Region010Overview.jpg ``` and open it on `Eye of Gnome` ``` eog 20210520_CJLoreleai_Ex17_S2_SST_NPY_Hippo_Region010Overview.jpg ``` ![](https://i.imgur.com/yqlWzgr.png) ### Exercise. Using Napari. <img src="https://imgur.com/gkDCsMd.gif" /><br> In terminal type ```bash module load miniconda3 conda activate napari napari ``` ![](https://i.imgur.com/iD2T0Ro.png) this should open Napari. :::warning :warning: Please be patient starting any applications with a user-interface using X2Go. ::: ![](https://i.imgur.com/q7vTeDk.png) ### Exercise. Extracting files. Consider the following `script.sh` ``` #!/bin/bash #dummy example if [ ! -d images ]; then mkdir images fi cd images wget -nc --no-check-certificate http://murphylab.web.cmu.edu/data/Hela/3D/multitiff/cellorganizer_full_image_collection.zip unzip cellorganizer_full_image_collection.zip rm -f cellorganizer_full_image_collection.zip ``` We can however, submit this script as a job on the scheduler to avoid waiting for the job to finish interactively. You can do this by adding some headers to your script ``` #!/bin/bash #SBATCH -p compute #SBATCH -n 4 #SBATCH --mem=8Gb #dummy example if [ ! -d images ]; then mkdir images fi cd images wget -nc --no-check-certificate http://murphylab.web.cmu.edu/data/Hela/3D/multitiff/cellorganizer_full_image_collection.zip unzip cellorganizer_full_image_collection.zip rm -f cellorganizer_full_image_collection.zip ``` However for each job in the scheduler, the scheduler will create a temp folder in `/local`. We can use to store temp files. This folder has some IO constraints as any shared resource, i.e. it is not neccesarily faster to read/write to this folder. ``` #!/bin/bash #SBATCH -p compute #SBATCH -n 4 #SBATCH --mem=8Gb set -x #dummy example #whenever you request a job a temp folder named /local (unique to each job) will be created automatically DIRECTORY=/local/images2 if [ ! -d $DIRECTORY ]; then mkdir $DIRECTORY fi cd $DIRECTORY wget -nc --no-check-certificate http://murphylab.web.cmu.edu/data/Hela/3D/multitiff/cellorganizer_full_image_collection.zip && \ unzip cellorganizer_full_image_collection.zip && \ rm -f cellorganizer_full_image_collection.zip mv $DIRECTORY ~/Desktop/zip ``` However there exists a shared space named `scratch` that you can write to. As a user you are responsible for deleting your files after using this space. IO operations are generally faster on `/scratch` than `/local`. ### Exercise. Computing checksums. Consider the files from the previous exercise living on `~/Desktop/zip/images1/HeLa/3D` ![](https://i.imgur.com/CK5LHfo.png) there are many files in this folder that I want to compute checksums for ![](https://i.imgur.com/HXXMg4k.png) Computing checksums for each file could take some time. So consider this script ``` #!/bin/bash DIRECTORY=/bil/users/icaoberg/Desktop/zip/images1/HeLa/3D lfs find -type f $DIRECTORY | xargs -I {} md5sum {} ``` This script will compute md5 checksums for every file in the folder. Running this script takes about ``` real 2m31.819s user 0m11.418s sys 1m4.443s ``` in the workshop VM. If you consider the previous example, can turn the script above to a SLURM script by adding headers ``` #!/bin/bash #SBATCH -p compute #SBATCH -n 2 #SBATCH --mem=4Gb DIRECTORY=/bil/users/icaoberg/Desktop/zip/images1/HeLa/3D lfs find -type f $DIRECTORY | xargs -I {} md5sum {} ``` However the script above compute these checksums serially. Can we do it in parallel? ``` #!/bin/bash #SBATCH -p compute #SBATCH -n 20 #SBATCH --mem=10Gb DIRECTORY=/bil/users/icaoberg/Desktop/zip/images1/HeLa/3D lfs find -type f $DIRECTORY | xargs -I {} -P 20 md5sum {} ``` The script above now allocates 20 cores and asks `xargs` to run 20 of these checksums in parallel. Submitting this job to the scheduler makes it run faster. ``` real 0m29.984s user 0m11.806s sys 0m17.861s ``` It ran significantly faster. But what if I want to be greedy? ``` #!/bin/bash #SBATCH -p compute #SBATCH -N 1 #SBATCH -n 70 DIRECTORY=/bil/users/icaoberg/Desktop/zip/images1/HeLa/3D lfs find -type f $DIRECTORY | xargs -I {} -P 70 md5sum {} ``` However, even though I allocated a full node for this task I didn't run faster. Probably due to IO limits. ``` real 0m27.744s user 0m11.726s sys 0m15.678s ``` ### Exercise. Contrast-stretching with ImageMagick. This exercise is trying to tie up together all the concepts discussed in this workshop. Imagine we are interested in collection `84c11fe5e4550ca0` that I found in the portal ![](https://i.imgur.com/pTX96xE.jpg) :::info :bulb: There is no need to download the data locally because the data is available when you our resources. ::: ![](https://i.imgur.com/TMhoZTw.png) *I can navigate to `/bil/data/84/c1/84c11fe5e4550ca0/` to see the contents of the collection.* Unbfortuntaly it is difficult to visually inspect the images because these are not contrast stretched. ![](https://i.imgur.com/WzYR67M.png) *The images are not contrast stretched and cannot be visually inspected.* Fortunately there are tools like Fiji that can contrast stretch the images. However I want to do this in batch mode as a job since this process can be automated. ![](https://i.imgur.com/Z6AcrjZ.jpg) [ImageMagick](https://imagemagick.org/index.php) is a robust library for image manipulation. The `convert` tool in this library has an option to [contrast-stretching](https://imagemagick.org/script/command-line-options.php#contrast-stretch). The format is ``` convert <input-file> -contrast-stretch <output-file> ``` Next I will create a file called `script.sh` and will place it in a folder in my Desktop. ``` #!/bin/bash #this line is needed to be able to use modules on the compute nodes source /etc/profile.d/modules.sh #this command loads the ImageMagick library module load ImageMagick/7.1.0-2 #this for loop finds all the images in the sample folder and contrast-stretch for FILE in /bil/data/84/c1/84c11fe5e4550ca0/SW170711-04A/*tif do convert $FILE -contrast-stretch 15% $(basename $FILE) done ``` :::info :bulb: For simplicity, you can find the script above in ``` /bil/workshops/2022/data_submission ``` to copy the script to your Desktop run the command in terminal ``` cp /bil/workshops/2022/data_submission/script.sh ~/Desktop/ ``` ::: Next I can submit my script using the command ``` sbatch -p compute --mem=64Gb script.sh ``` Since I am doing serially I don't need much memory but if I were to do this in parallel I might. To monitor your job progress use the command `squeue -u <username>`. For example, ``` squeue -u icaoberg JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 14243 compute script.s icaoberg R 15:34 1 l001` ``` This leads to ![](https://i.imgur.com/uHB0Vr6.png) --- The Brain Image Library is supported by the National Institutes of Mental Health of the National Institutes of Health under award number R24-MH-114793. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.