Baskerville Virtual Traiing

# Baskerville Virtual Training <center><img src="https://hackmd.io/_uploads/BJhCpazt2.png" alt="baskerville-logo" width="200"/></center> ## Month of July Online session Every Friday in the month of July from 13:00 - 15:00 we will cover a Baskerville virtual session. We are creating and will go through interactive examples with tips for optimum usage from the Baskerville team. The key points covered in each session are displayed below if you would like to attend these sessions please add your **Baskerville username** to the attendance list at the bottom of the page. ## Introduction and Navigating Baskerville :calendar: **05/07/2024** :watch: 13:00 - 15:00 with a questions and answer session from 15:00 - 16:00 ### Areas covered - Baskerville Access - Baskerville Portal - Baskerville Modules - Globus and our file system ### Questions - [name=Dimitrios_Bellos] For Rosalind Franklin users if you want to learn more about how to work using Linux OS we can offer access to PluralSight with courses for you to learn. Here is the link to learn the Linux Fundamentals (https://app.pluralsight.com/paths/skill/linux-fundamentals-1) - [name=David_Llewellyn-Jones] `my_quota` gives free space in your home directory; but usually I'm working in a project directory. Is there an equivalent for finding free space in a project directory? - [name=James_Allsopp] Go to admin.baskerville.ac.uk then projects and click on your project. This will then tell you your quota and the amount a project has used. - [name=Gavin_Yearwood] /proc provides information about your current linux system, there's also meminfo for instance. Read more here (https://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html) - [name=Dimitrios_Bellos] Can all users access https://admin.baskerville.ac.uk/ ? Do only the representatives of the institutions have further permissions e.g. to change the storage quota of a project or add users to a project ? - [name=Simon_Hartey] Users can investigate their own projects and see the storage quota but not change the storage quotas. A Project PI/Project Manager will be able to add and remove people from their own projects. - [name=Dimitrios_Bellos] Ah, I see. I guess only the Project PI/Project Managers can do these things (change the storage quota of project, add and remove users). So for these operations the users will have to contact their institution's representative(s). - [name=James_Allsopp] You can see all available modules here https://apps.baskerville.ac.uk/ - [name=Nathan_Simpson] Not sure when the right time for questions is, apologies if this is off-topic: what’s the recommended way to open a persistent ssh tunnel? i’ve had to re-enter my OTP due to inactivity during this talk! - [name=James_Allsopp] the answer is tmux! That will make your connection more stable and preserve state if you do lose it - [name=Gavin_Yearwood] I also like to use `watcn -n 30 echo "waiting"` If I have an idle window - [name=Gavin_Yearwood] Globus can be accessed here: https://app.globus.org/ - [name=David_Llewellyn-Jones] If I want to transfer material from my local machine using Globus, is it necessary to install Globus Connect? - [name=Dimitrios_Bellos] Yes you need to install Globus Connect Personal https://docs.globus.org/globus-connect-personal/install/windows/ but if you look step 3b in the link you do not need to have administative permissions to install it. - [name=Dimitrios_Bellos] Also the Globus team during this year's GlobusWorld conference they offered a preconference session that was free and open for all (https://www.globusworld.org/conf/program). During it they offered training on how to use Globus. So the institution representatives might want to reference their users to join this preconference in the future. GlobusWorld is typically held every year in May. And they also have open calls for submissions, if you want to present your work that you have done with the help of Globus and how you utilised the tools Globus offers (Globus Compute, Globus SDK, Globus Flows, Globus Search, etc) to get your work to the next level. - [name=David_Llewellyn-Jones] Just a note about Globus for Turing users: we usually log in using SSO and using SSH keys, so using a password can be uncommon. So I found this confusing at first: do I have a password? But yes, we do, and you'll need it if you want to connect your Baskerville account with your Globus account. The docs explain how to [reset your password](https://docs.baskerville.ac.uk/logging-on/#login-nodes) in case you forget it. - [name=David_Llewellyn-Jones] Some institutions have a subscription for Globus, but it's also possible to log in using an individual account. From a user's perspective, are there any benefits to having an institutional subscription over logging in as an individual user? Would either have any impact on functionality for using Globus on Baskerville? - [name=Dimitrios_Bellos] The users when they login to Globus they can login using their user account and link identities to their existing account. By doing this, and also using your account to ask to join the institution's subscription (aka to ask to be invited to the institution's Globus Plus group), you get access to Globus Flows that are maybe shared to the whole institution and maybe be able to see certain Globus groups of the institution and ask for you to be added to them. Regarding the impact question, there is not any impact, the transfer speeds are not affected based on the subscription level. Access to more advanced Globus features is unlocked with a Globus subscription. - [name=David_Llewellyn-Jones] Thanks Dimitrios, that's really helpful! ## Baskerville Jobs :calendar: **12/07/2024** :watch: 13:00 - 15:00 with a questions and answer session from 15:00 - 16:00 ### Areas covered - Batch jobs - SLURM - Array jobs - Dependent jobs - Interactive Jobs - `nvidia-smi` ### Resources https://github.com/baskerville-hpc/basicSlurmScripts https://github.com/hibagus/CUDA_Bench - [name=SimonH] Slurm -[Slurm Docs](https://slurm.schedmd.com/documentation.html) -[Software Carpenties Linux Shell Basics](https://swcarpentry.github.io/shell-novice/) -[Python Modules Availible on Baskerville](https://apps.baskerville.ac.uk/search?search=python) -Baskerville currently has three login nodes named `bask-pg-login01`, `bask-pg-login02` and `bask-pg-login03` -[Baskerville Login Nodes](https://docs.baskerville.ac.uk/logging-on/#login-nodes) - [name=DimitriosBellos] Here is the link to the [Basic Slurm Scripts GitHub repo](https://github.com/baskerville-hpc/basicSlurmScripts) ### Questions - [name=DimitriosBellos] Regarding the time argument what is the recommended value for running something for the first time? Overestimate? And then see how much time it took and try to set better estimate for the time argument ? - [name=GavinYearwood] It depends I would normally go under as a shorter job will start earlier and see how much it has completed. It is very dependent I would start with 10 minuted to confirm it will run, 1 - 10 hours to see how far it will progress. - [name=DimitriosBellos] It makes sense. Especially the first 10min run to at least check the #SBATCH and module load commands are correct. On general we recommend our users to overestimate so not to have a run that was wasted and has to be repeated. Ideally, if they have run their process elsewhere they can extrapolate how much time it may take. - [name=GavinYearwood] That is fine as long as they overestimate within reason. Going over by a couple of hours is fine, but a couple of days will probably not be worth it due to increase time in the queue. - [name=DimitriosBellos] 👍 - [name=DimitriosBellos] Use the `my_baskerville` command to find on which projects your are a member. To access the project storage directory run `cd /bask/projects/<first-letter-of-project-name>/<project-name>` (e.g. `/bask/projects/m/my-project`). By default you login to your home directory. You need to navigate to the project directory in order to start working and/or downloading data. Home directory storage is very limited and most of your data and code is expected to be in a directory within a project directory that you are a member. Read more here https://docs.baskerville.ac.uk/storage/ - [name=GavinYearwood] Use `scontrol show res` to see available reservations on Baskerville - [name=DavidLlewellyn-Jones] Is there any way to stipulate a directory relative to the original location of the batch file, rather than where it's run from? (I understood that there isn't, but just checking) - [name=GavinYearwood] You can set the environment varaible `$WORKDIR` I think that will enable it to work from a different location than the batch file. - [name=SimonH] is the Slurm variable for the path of the job submission `$SLURM_SUBMIT_DIR` - [name=DavidLlewellyn-Jones] Thank you both, but I think I'm after the "opposite" of `$SLURM_SUBMIT_DIR`. My use case is developing tutorials, where the batch file has to access other files relative to itself and I don't want to have to worry about where the user runs the script from. But I think adding instructions to explicitly set a `$WORKDIR` or stating to start executing from a specific dir, are the only/best ways to go. - [name=SimonH] Current working directory is the calling process working directory unless the `--chdir` argument is passed, which will override the current working directory. - [name=GavinYearwood] We might can go through your example in the Q&A section - [name=DavidLlewellyn-Jones] Okay; that'd be great. Your comments here are really helpful and I think I've got a better idea about things now. Thank you! - [name=zxy239] How to use cross-node cluster computing - [name=GavinYearwood] To use multi node jobs your software/code must be designed to run on multi-nodes typically using something like MPI(Message Passing Interface) - [name=zxy239] Thanks I will check. Single node with multi-GPU work for my code. So only require multi-node in slurm is not enough, and the code needs to be changed? - [name=GavinYearwood] Possibly we can either talk about this during the Q&A section or you can raise a ticket using an Other BEAR request as an advice session and we can examine your code. - [name=SimonH] Software Like Pytorch has it enabled. So [Pytorch](https://pytorch.org/docs/stable/distributed.html) - [name=DimitriosBellos] For PyTorch read here https://gist.github.com/TengdaHan/1dd10d335c7ca6f13810fff41e809904, you need to write your code using the DistributedDataParallel module in PyTorch https://pytorch.org/tutorials/intermediate/ddp_tutorial.html and when loading Baskerville modules you need to load the NCCL Baskerville module - [name=zxy239] Thanks, I will check one by one! - [name=DimitriosBellos] The a100_40 partition is the default one, correct? Meaning that if a constraint argument is not set the job will be set to use A100 40GB VRAM GPUs. Aka it will wait even if there are free A100 80GB VRAM GPUs ? - [name=GavinYearwood] Not always, but the A100-80's typically are in more demand and there might be a queue for them even if they appear free. For example 1 A100-80 might be free, but someone has asked for 2 A100-80 GPUs with the constraint command. It appears free, but it is scheduled for this user unless you also explicitly state A100-80, the scheduler will probably make you wait if you have no constraint. If no one is requesting the A100-80's then you are almost as likely to get one as an A100-40. Also if you run lots of short jobs 10 minues to an hour you might get fit in to these A100-80's - [name=SimonH] more info at: -- [tmux cheatsheet](https://tmuxcheatsheet.com/) -- [Baskerville Docs Interactive Jobs](https://docs.baskerville.ac.uk/interactive-jobs/) - [name=DavidLlewellynJones] Do interactive jobs have different priority compared to `sbatch`-submitted jobs? - [name=James Allsopp] Should be the same queue. - [name=James Allsopp] You can use proxy jump in the SSH config to always got to the same login node for Tmux sessions. This is the equivalent to logging into any node, then logging into the node you want. - [name=SimonH] The QoS and Account are used to determine priority for a Job so With the same QoS and Account the Interactive job would be the same. Usually interactive jobs will not be as large Either CPU/GPU so would be easier to schedule. - [name=DimitriosBellos] Also for interactive terminal jobs (using srun) you can use X11 forwarding and display GUIs. Simply ssh to Baskerville using the `-X` argument like this: `ssh -X <your-Baskerville-ID>@login.baskerville.ac.uk` and when you start your interactive job add the `--x11` argument like this: `srun --account _projectname_ --qos _qos_ --gpus _count_ --time 5 --export=USER,HOME,PATH,TERM --pty --x11 /bin/bash`. If you do both, if you execute a command that spins up a GUI, the GUI will open your machine using X11 forwarding. Warning: It might be not very responsive though as the signals hops from the compute node to the login node to your machine. - [name=DLJ] Super nice, thank you! Maybe worth adding that you also need an X11 server running locally (e.g. X410 on Windows or XQuartz on macOS) - [name=DimitriosBellos] Yes, for Windows and Mac this is needed too. For Linux no. But software that runs on Baskerville is software that runs on Linux (minus containers). So in most cases, people would already have access to a Linux machine, probably, and from there they can also ssh to Baskerville and using X11 easily use GUIs. ## Baskerville Self installs :calendar: **19/07/2024** :watch: 13:00 - 15:00 with a questions and answer session from 15:00 - 16:00 [**Virtual Training Zoom Link** :link:](https://bham-ac-uk.zoom.us/j/88336075146?pwd=dOU5P7D47Yk5dDGEd9tot3IndykoyY.1) :arrow_left: Here ### Areas covered - Pip - Conda - Containers ### Resources [Interactive Jobs documentation](https://docs.baskerville.ac.uk/interactive-jobs/) [Software (self-)Installation documentation](https://docs.baskerville.ac.uk/self-install/) [User Conda Environments Documentation](https://docs.baskerville.ac.uk/portal/jupyter-conda/) [Containerisation with Apptainer documentation](https://docs.baskerville.ac.uk/containerisation/) ### Questions Any Questions please ask below: - [name=DimitriosBellos] As far as I know module commands are for display purposes on the login nodes. To trully load them and use them you have to do this within a compute node. - [name=GavinYearwood] That is correct, anything that needs the GPUs needs to be loaded on a compute node. - [name=DimitriosBellos] For RFI users if you want to start e.g. a 30 minute interactive terminal job with 1 GPU here is the command to run: `srun --account jgms5830-rfi-train --reservation=jgms5830-rfi-train-july19 --qos rfi --ntasks=1 --cpus-per-task=8 --gpus-per-task=1 --time 30 --export=USER,HOME,PATH,TERM --pty /bin/bash` This allow you in essence to "ssh" to a compute node with 1 GPU allocated (and 8 logical cpu cores) for 30 minutes Please also `cd /bask/projects/j/jgms5830-rfi-train` and then `mkdir <your-name>` and `cd <your-name>` to create code within your project directory. - [name=David L-J] Just in case it's helpful, on my system here before I've installed any version of Python, I have to use `pip2` or `pip3` rather than `pip`. - [name=DimitriosBellos] To see available versions of a module you may also type `module load <name-of-module>/` and press Tab twice and it will print the available versions. - [name=David L-J] Is it always safe to delete `~/.cache` in full? - [name=GavinYearwood] There should be nothing too vital in this area. It might mean that to install something again it might take longer, but you should not have lost anything. - [name=DimitriosBellos] for Miniconda and Miniforge as it is mentioned here https://docs.baskerville.ac.uk/portal/jupyter-conda/#conda-cache you can also change the cache directory to `/tmp` by executing the command `export CONDA_PKGS_DIRS=/tmp` Do this before the `eval` command that activates either mamba or conda. - [name=David L-J] Is there an equivalent for pip? Do you recommend creating a symlink from `~/.cache` into a project folder, say? - [name=TomNeep] See also [here](https://pip.pypa.io/en/stable/cli/pip_cache/) for commands on inspecting/clearing your pip cache - [name=GavinYearwood] I think both the symlink and changing your cache will work. The symlink may enable your code to be easier to share, but you have the limit of your home space to consider. - [name=DimitriosBellos] The commands to activate mamba that is within the Miniforge module `eval "$(${EBROOTMINIFORGE3}/bin/conda shell.bash hook)"` `source "${EBROOTMINIFORGE3}/etc/profile.d/mamba.sh"` I mention these because there not in the [User Conda Environments Documentation](https://docs.baskerville.ac.uk/portal/jupyter-conda/) for the time being. - [name=GavinYearwood] We will be updating our documentation soon about using Miniforge and Mamba - [name=DimitriosBellos] I am getting the following error when I try to install Python in the mamba environment. Why ? - [name=GavinYearwood] What were all the commands you used? - [name=DimitriosBellos] I solved it. In Miniforge the `export CONDA_PKGS_DIRS=/tmp` is not an option. It is necessary to execute otherwise a ""no plugins" errors shows up. - :+1: - [name=DimitriosBellos] for apptainer you do not need to module load it. - [name=DimitriosBellos] Also with apptainer if you include not a .sif file in the 'apptainer run' or the 'apptainer exec' command but e.g. a Docker image (docker://) it will build the apptainer container in the cache dir. Here https://docs.baskerville.ac.uk/containerisation/#apptainers-cache-and-temporary-directories using this command `export APPTAINER_CACHEDIR=/path/to/preferred/cache/directory` you can set this directory to a location preferably within one of your project directories so you will not fill your HOME dir. - [name=TomNeep] I think the docs might be outdated here? I think `APPTAINER_CACHEDIR` might be set to `/tmp/container_data` by default? - [name=GavinYearwood] We will update the docs with the new location. I always like to chack my environment variables just to be sure using `printenv | grep _name_` - [name=DimitriosBellos] Thank you. I just mentioned this because some people may instead of creating the .sif files and then use them to run containers, go directly to build & run containers. For instance the 2 commands: `apptainer pull docker://python:3.8.11` `apptainer exec --nv python_3.8.11.sif nvidia-smi` Can be executed as a single command: `apptainer exec --nv docker://python:3.8.11 nvidia-smi` And then having the cache pointing to the correct location is important ## RELION and Doppio :calendar: **26/07/2024** :watch: 13:00 - 15:00 with a questions and answer session from 15:00 - 16:00 ### Areas covered - RELION - GUI - Job options - Batch script - Efficient use - Doppio - Explanation about Doppio - How to access and use Doppio ### Resources - <https://docs.baskerville.ac.uk/support/#relion> - <https://relion.readthedocs.io/en/release-4.0/SPA_tutorial/Introduction.html> - <https://www.ssh.com/academy/ssh/tunneling-example> ### Questions - [name=SimonH] See Apps page [RELION](https://apps.baskerville.ac.uk/applications/RELION/) - [name=SimonH] [RELION 4.0](https://relion.readthedocs.io/en/release-4.0/) - [name=SimonH] Examples use Baskerville Module [RELION 4.0.0](https://apps.baskerville.ac.uk/applications/live/RELION/4.0.0-foss-2021a-CUDA-11.3.1/) - [name=SimonH] Data Folder <code>$BASK_APPS_DATA/RELION/Particle_Tutorial/</code> - [name=DimitriosBellos] The number of hours in the Baskerville Portal is only for the RELION GUI. Any jobs submited from within the GUI (that may take many hours to complete) will not die, after you close the GUI. You can restart in the future the GUI pointing to the same directory location for the RELION Project directory. The RELION GUI will read all available files and it will appear as if it was never closed. - [name=RhiannaRowland] I understand Relion5 is still in beta testing; would it be possible to get this onto Baskerville anyways? - [name=DimitriosBellos] Once it becomes stable, aka available as a release here https://github.com/3dem/relion/releases, the process of adding it to Baskerville will begin. - [name=SimonH] [Doppio module](https://apps.baskerville.ac.uk/applications/doppio/) - [name=SimonH] [wikipedia on portforwarding](https://en.wikipedia.org/wiki/Port_forwarding) for information - [name=SimonH] previous video on portforwarding on baskerville [2023-07-26-remote-drop-in](https://github.com/baskerville-hpc/2023-07-26-remote-drop-in/blob/main/presentations/04a-vscode-remote-development/video/05-port-forwarding-tensorboard.mp4) - Are there any written instructions on how to launch Doppio on Baskerville? - it's ok now I see they're on the slides thanks -[name=JohnC] Please can you remind us on how to do this? - from my notes, I have the following: -- `module load RELION` -- `module load doppio-web` -- `doppio-web` -- Note down the portforwarding address 127.0.0.1:nnnnn -- new terminal `ssh -L nnnnn:127.0.0.1:nnnnn login.baskerville.ac.uk` -- `squeue` to identify compute node -- but then what?; what do we do once we know the compute node? -- `ssh -L nnnnn:compute-node:nnnnn login.baskerville.ac.uk` appears to work, but I get a "Cannot assign requested address" error, and then if I try to connect to the portforwarding address in a browser, I get "channel 3: open failed: connect failed: Connection Refused". Please advise? - [name=RhiannaRowland]Is it still the case that the Import Job must be done through the RELION GUI rather than the DOPPIO interface? - [name=SimonH] at present yes - [name=RhiannaRowland] This is the beta version of DOPPIO, are there plans to install the 1.0.1 version? which i think has some bug fixes/new installs like morhen etc. - [name=GavinYearwood] Yes we will install it in the future I have not yet as when testing I did not find any significant differences # Attendance List Add your **Baskerville username** to the list below: - yearwoog - allsoppj - cocv8214 - fspo1218 - diic5302 - cxx075 - sxz363 - fscd8270 - tgir5769 - nfhs4095 - jros3986 - qvgx3109 - cobf3567 - lkuu3118 - rybf4168 - butchena - sxz115 - hartleys - ovau2564 - zxy239 - jnxh9672 - xrcc5170 - lksm0291 - dzdf7145 - wixi8809 - gemk0317 - urkq3451 - natm2376 - gigg5481 - pjqr0607 - lugh0174 - lzsi3389 - umgv8106 - vehn7110 - lybl0052 - dyyx0410 - tydk9687 - xgmf7182 - zyna5591 - lowz2179 - ewof8133 - crdt2323 - unzi2487 - zowl114 - otau1500 - fedu7800 - zowl1145 - garciaay - gigg5481 - jkck9187 - exd949 - kimbe - cfan9722 - ihuh9690 - vguv3184 - vwnh0603 - wowt1140 - dzia8957 - lpje5033 - ihcb9867 - exd949

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.