# TTT4HPC 23/04/2024 ARCHIVE ## Tuesday Tools & Techniques for High Performance Computing - Episode 2 :: Day-to-Day working with HPC :::danger ## Infos and important links - This is the archive document - Program for the day: https://coderefinery.github.io/TTT4HPC_Interactive/ ::: *Please do not edit above this* # Tuesday Tools & Techniques for High Performance Computing - Episode 2 :: Day-to-Day working with HPC ## Icebreaker What's your favorite free time activity? - baking o - reading oooo - crafts oo - running o - swimming oooo - skiing - gymming o - hiking ooo - crossfit - cycling o What's the weather where you are?: - snow: ooooooooo - rain: o - sun: oooo - cloudy:oooooo - cold: o - warm: - . Where are you + weather: - Helsinki/heavy snow: ooooooo - The Netherlands sunny: - Geneva, sun but cold for the season - Kirkkonummi (Southern Finland) / 20 cm of snow - Tromso: freezing but blue skies - Modena, Italy: rainy. - Lohja going towards 30 cm snow - Gothenburg, Sweden - Ski, Norway: Super cloudy - Bergen, Norway: Cloudy, peeping sun - Gothenburg, Sween: Cloudy - Athens, Greece: Cloudy When doing development or debugging on HPC or for your work, what do you do? - Editor in terminal on cluster: oooooooo - Development on laptop and then transferring it to the HPC: oooooo - Either SFTP into it, or just sync via a local git - VS code, occasionally : oo - any comments on "mosh"? - I'd love to learn this method! - How does it deal with SSH Agent forwarding? I had heard there were problems with it. - Good question. I'm not sure, it's a different protocol but I think it might start the connection with SSH. (update: quick search says not possible, which I understand) ## Questions practice - Test this shared notes documment! - This is a question - here an answer! - This is more of a comment... - something else - Not me! - Where are you? - I'm in Helsinki, currently in my office at Aalto, loving the snowfall! ## Intro ## Motivation https://coderefinery.github.io/TTT4HPC_Interactive/Motivation/ - does the word "scratch" come from scratchpad? - Good question, I'm really not sure. How many clusters have their big fast space labeled "scratch"? - Maybe the name is chosen to imply "this is not backed up", and scratchpad would be a logical origin - As far as I know, it comes from "scratch space", from the analogy with scratch paper - basically an (intentionally) unreliable, temporary space that's highly dynamic and faster than network filesystems. - when I first started on a cluster I found the name "scratch" funny/confusing since I thought of DJs who scratch records/disks but I guess that's not what we want to do with the file system - Is it ok for small runs on the login node? What is the limitation? ## Project arrangement https://coderefinery.github.io/TTT4HPC_Interactive/Project_arrangement/ - Should we use English only? Local language allowed? only "8.3" format or long farmat allowed? all character allowed? - long file names are OK but I would avoid spaces in file names - not good: "My first experiment.txt" - reason: if I am not careful putting quotes around the file name, commands might be confused whether the argument is "My" or "My first experiment.txt" - Devils advocate: Always put spaces and non-ascii characters in your file names to catch early very obvious bugs where inputs are not sanitized and escaped properly. :-) - better: "my-first-experiment.txt" or "my_first_experiment.txt" - about language: choose the language which your present and future collaborators and group members understand which these days is likely English. avoid characters in file names/directories which are not in the ASCII set. in other words avoid special characters. - but ASCII is only for English. - good point - on one hand it is a bit sad that we are losing local languages in computing and programming and possibly make them less accessible to those who don't speak English, but on the other hand I have spent some time fixing problems where I could not run something because of special characters that my computer did not understand - I prefer recursive structure so that everything related to one project is in one place; maybe the paper won't be a subdirectory in the code but the code and paper will be subfolders of one big project directory! - The issue is that everyone in the research group will have read/write access to everything. Sometimes you want to have fine-grained permissions and so multiple folders (with different group ownership) can help with that - Great point... - sounds like a good setup! - Is there any way to search not only the file/folder name but also the "meta" data in different kinds of files? For example, meta data(EXIF) in pictures. - to my knowledge tools like `find` and `grep` do not know about EXIF but there are specialized tools, also command line tools which do understand EXIF and one could use those tools in combination with `find` to first locate all files with a certain file ending and then in each to extract meta information. - Why would I put a paper directory on the cluster? In my head, I was thinking to have a project directory just for the computation and results that I get from running experiments on the cluster. - it is also personal/group preference. I personally agree with you. I would keep computations on the cluster and keep the paper manuscript somewhere else where all group can access it. - I think it is more of a conceptual distinction (sub)project == paper == folder. With the paper being the minimum unit of project management. ## Data sync https://coderefinery.github.io/TTT4HPC_Interactive/Data_sync/ - How can I check if I have ssh setup correctly? - One option is to try to connec to the cluster and see if you can use it without password. Another option is to use `ssh-keygen -F <cluster-address>` to check if there is a ssh key for that host. - Yeah: just try it. If you get the features you need (short alias, no password, etc), then it works. - How do I make sure the files and folders I transfer between two different locations and file systems are complete and correct. - Rsync does it for you! This is why it much better than scp. Rsync can perform a check based on the files metadata (size, date, etc), or - more strict - it can checks the fingerprint of the file (an hash string obtained by the actual bits of the file). Richard is just talking about this. - If I only want to check, not transfer, how to do it in Rsync? - without rsync: `md5sum <filename>` is the command that generates the checksum - with rsync: add --dry-run option (a "simulated" run of rsync without any transfer). Remember to have the -v option (verbose) so that you get some feedback from rsync. - How useful is it to use the `-z` option for compressing data with `rsync`? - Context: on laptops etc it's a bad idea because the HW connection on USB/etc is fast enough and it'd just waste CPU cycles, but what about cluster(s) and networked storage? - I would consider it only if I want to transfer huge amount of data. I mostly need to try to transfer many small files than very few gigantic files so in my own work I almost never use it. - Also consider that SSH might be compressed itself. - Can it be automatically transfer the files if anything is changed? - that is basically the default mode for rsync (check for diferences, transfer only that). It can check this via a full file checksum, or only looking at the timestamps. - When I transfer files from unix to windows file system, what should I do with the hard and soft link? - Very good question. I personally gave up and instead mount remote on windows when needed and use only Linux based storage systems. - I just never bothered and used WSL to transfer to a local Windows-accessible folder. It provides a good unix-to-unix interface and one can follow hard/soft links as on linux. - Why I would see different file sizes on the same file in different file system? For example: EXT4, NTFS, ZFS...etc. - there are multiple reasons, I would say the most releavant is related to the metadata about the file which might be stored differently in different system. When it comes to the actual disk space your file is taking, with lustre filesystem (usually the /scratch in an HPC system is lustre) the actual minimal size of a block can be 4M, so anything smaller than 4M actually takes 4M. - The "du" is "disk usage" which rounds up to the block size. Possibly the same for other commands. Leading to the above. - Is there a command to know the real "actual" size of a file?(which should be universal consistent among different file systems)? - `du -b` (which is `du --bytes`) prints bytes of size. - Note there are also [sparse files](https://en.wikipedia.org/wiki/Sparse_file) which have holes that don't take up disk space, which is another possible complexity - depending on your needs. - any good tools to generate and check the "md5" recursively at the target? - find with the -exec option. For example `find . -exec md5sum {} \; -print` (I like to add a -print always when running find with - exec) :::info If you want to try this later, replace triton.aalto.fi with whatever your cluster entry point is. Check your cluster documentation. Feel free to ask here if you are unsure, there are admins of most of other nordic clusters around. ::: - Is anyone using git-annex in their work? There is also a nice wrapper on top of git-annex called [datalad](https://www.datalad.org/) - I've used Git LFS. Can you compare?, +1 - with git lfs you have a single place for storage. Git-annex is truly distributed and can store redundant copies on multiple places - and know where they are. - git lfs works on GitLab as well! - Is `git annex` a git command? - no, it is a separate tool which needs to be installed, does not come with "normal" git - In order to use `git annex`, the machine where the large files store must to be online 24/7? - It keeps a copy, you only need access when you get/push. - sshfs installation on Mac M1 processors is a big trouble and frustrating to find a hacky solution with each update in the operation system version. Do you have any suggestion for an alternative? - Good point, I did not know. Is it specific to the M1/M2 processors? - In my case, I only know about the M1 - An alternative (if your cluster supports it) is to use samba mount. At Aalto for example: https://scicomp.aalto.fi/triton/tut/remotedata/#remote-mounting-using-smb - Thanks for this! I will try. - SSHFS requires to install OSXFuse (for ext support), which requires you to enable legacy extension support, which is essentially a backwards-compatibility for x86 kernel extensions - Yeah, unfortunately M1/M2 requires custom compilition for performance, so not everything is supported. - ... ## Code Sync Materials: https://coderefinery.github.io/TTT4HPC_Interactive/Code_sync/ - Is git the only option for code syncing? In our group we use google drive, would that work on the cluster? - git has very nice options for solving the possible confilicts, and also keeps track of all the file changes, which makes reversing to previous version very easy. So it's not the only option, but it's the best practice. - Google Drive access on remote computers can sometimes be tricky. And if it goes wrong and gets desynced? - ## Graphical User Interfaces on cluster Materials: https://coderefinery.github.io/TTT4HPC_Interactive/GUIs/ - What is the differences between SSH `-X` and `-Y`? - There is some concept of "trusted X forwarding" which is somehow more limited but safer. But it doesn't work for many applications, so Debian/Ubuntu alias -X to -Y. https://askubuntu.com/questions/35512/what-is-the-difference-between-ssh-y-trusted-x11-forwarding-and-ssh-x-u :::success Do you use GUIs over the cluster? Which program? - No: ooo (The latency is always bad even I am near the cluster) - Tensorboard - QGIS (to look at spatial data results): o - RStudio - Is jupyterlab a gui over cluster? :) ::: ## Working Interactively (with Command Line Interface) Materials: https://coderefinery.github.io/TTT4HPC_Interactive/Working_interactively/ Share your workflows for others to lear! How often do you use interactive jobs? - Daily: ooo - Sometimes: o - Basically never: - If I got disconnected in interactive job session, can I get it back? Will I waste the computing time for that session? - If you use screen/tmux the job will be alive; otherwise it will be canceled. But it also depends on the cluster settings, this is for Aalto. - If the job is still running and idling (depending on your cluster) you should be able to ssh to the node where the job was allocated to, but you won't be able to resume an actual session (e.g. you had a vim open and the connection died but the job is still running) - In my experience when the connection dies with srun --pty bash, the job also dies. With `sinteractive` (not available on all clusters) things can be different - ... - ... Do you normally use tmux/screen? - Daily user: ooo(go tmux!) - Rarely: - I have never used them: o - A comment: screen/tmux was absolutely one of the most important thing I learned with linux, back in the days to make sure you can stay connected to IRC chats via terminal. - A note for better understanding: It's like opening a 'terminal' inside the cluster, so if you loose your connect, the terminal is still there. - Is it better to run tmux in the login node, or in some other node (e.g. my workstation)? - I usually use multiple terminals in my local computer, and use tmux on the login node. You can devide the screen etc in tmux, which is nice to have. - For experienced tmux users a program called [tmuxinator](https://github.com/tmuxinator/tmuxinator) is very useful for managing multiple different tmux environments / sessions and opening same sessions after they have been closed. ## Using VSCode on HPC Materials: https://coderefinery.github.io/TTT4HPC_Interactive/VSCode/ Have you used VSCode or other IDEs with remote connection to the cluster? - Daily:oo - Sometimes: oo - I had no idea this is possible: oo - How large files could you drag and drop? Is it safe? - Everything would be transferred over SSH so it should be safe. In theory SSH can transfer a lot but I don't know if vscode can resume failed transfers, so I wouldn't use it for too big things. - I don't understand: where is the VScode running? If I have a jupyter notebook in vscode, would the kernel run on the login node or in my local laptop? - VSCode is running locally, but it has a server on the cluster. Not sure about Jupyter, but probably on the cluster. - If you start a terminal, for example, it runs on the cluster. - VSCode is split into two parts: a user interface and backend server that accesses files. They can be split, via remote ssh, to run on two different computers. That's what is happening here. - So a job was submitten from the VScode terminal. But if I want to run bits of what I have opened, it would run in the login node, unless I set it up differently? - Yes. But if you can ssh to a compute node you have reserved, you could connect to it and run the whole VSCode server there. - Do I first request a compute node, and then open VS code and point it to that node? - Yes, at least that works on Triton and probably many other systems. Reserve a node with `srun --pty bash`. Run hostname and then point VSCode to that hostname. You might need to tunnel through the login node. - I tried to figure out a way to go straight to nodes <https://scicomp.aalto.fi/triton/apps/vscode/#vscode-remote-ssh-host-directly-to-interactive-job> - which seems to work. Could use more eyes to make sure it works, though. ## Feedback :::info Daily notes: - **Lunch break now, after lunch is Zoom sessions for exercises** - If you are registered, you come by the exercise session about an hour after we finish - If you need the 1 ECTS, come mark your presence - Many of the partners listed on the website have support also: they would be *very happy* to receive your questions and help you with anything discussed today. - For CSC users: weekly user support session tomorrow https://ssl.eventilla.com/usersupportcoffee ::: Today was: - too fast: - too slow: - right speed: ooooo - too advanced: - too basic: - right level: One good thing about today: - sshfs - vscode +1o - rsync over ssh - ... One thing to improve for next time: - debugging with VSCode a bit slower please - Need written material for "debugging with VSCode" session. - ... Any other feedback? - great practical approach and lots of tools to learn and use - ... - ... # Welcome to the hands-on zoom session! - We start at 12:00 Oslo time / 13:00 Helsinki time. - This session will **not** be recorded so that everyone can interact freely. - Usual zoom etiquette. Please mute yourself if you don't want to talk. - Ask questions on this document or raise your zoom hand to ask live (we will write down live questions into this document anyway for documentation purposes) - We follow the CodeRefinery Code of Conduct: https://coderefinery.org/about/code-of-conduct/ - If you feel that the Code of Conduct was violated, please report it to any of the instructors (Host/Co-host in this zoom) or via scip@aalto.fi (reaches also people who are not here) - If you would like to receive the credit, please send a direct message to the host (Enrico Glerean) to mark your presence. Enrico will confirm that your presence is marked. ## Exercises ### For those who need the credit: try at least 4 exercises and document what you did and its output - Try rsync to/from the cluster. See section "Homework: rsync" https://coderefinery.github.io/TTT4HPC_Interactive/Data_sync/#transferring-data Document your results with a simple textual copy paste from terminal - Sshfs. See section "Homework: sshfs" https://coderefinery.github.io/TTT4HPC_Interactive/Data_sync/#mounting-data-from-place-to-place - One of the two GUI exercises at https://coderefinery.github.io/TTT4HPC_Interactive/GUIs/ - Interactive jobs on your cluster (First homework) https://coderefinery.github.io/TTT4HPC_Interactive/Working_interactively/ ### Other optional exercises: - Unison. See section "homework: unison" https://coderefinery.github.io/TTT4HPC_Interactive/Data_sync/#syncing-data-two-ways Note: it might not work in some clusters. - Git annex exercise https://coderefinery.github.io/TTT4HPC_Interactive/Data_sync/#git-annex Note: it might not work in some clusters. - Code sync exercises (two homeworks) at https://coderefinery.github.io/TTT4HPC_Interactive/Code_sync/ - Any other homework listed at https://coderefinery.github.io/TTT4HPC_Interactive/Working_interactively/ - Try out VS code https://coderefinery.github.io/TTT4HPC_Interactive/VSCode/ ### Poll: which cluster are you using? Some of the exercises do not work in some clusters (e.g. Unison and git-annex are not available) - Aalto University Triton cluster: oo - CSC (Puhti and Mahti): - Dardel at PDC, Stockholm: - Leonardo Booster (Italy): - LUMI: - NRIS/Sigma2 clusters: - Tetralith at NSC, Linköping: - Uppsala University's UPPMAX: o - DelftBlue, The Netherlands ### Questions/comments - sshfs is nice! How come I did not know this existed? - we should add to the material how to unmount again (issue: https://github.com/coderefinery/TTT4HPC_Interactive/issues/3) - I am trying to run the VSCode remote exercise on Triton. However, I am not on Aalto network and currently not on VPN. I am trying to establish the connection without VPN, which seems like it should be possible. However, kosh requires both ssh key and using password (I use kosh for ProxyJump to triton). My problem is that VSCode extension fails to provide a password prompt when it's connecting to SSH. Do you have any solution for this? (MacOS + Aalto laptop) - Another option would be using proxyjump from kosh to triton. So would be having a ssh config for kosh in your local config file, and then a `triton_via_kosh` config. Look at here for more info: https://scicomp.aalto.fi/scicomp/ssh/#proxyjump - Yes, this is what I am doing. It's a VSCode+MacOS issue! - Do you also have ssh-key? - Multiplexing would allow you to make one SSH connection from the terminal, and then future connections (like from VSCode) reuse that (no key or password): https://scicomp.aalto.fi/scicomp/ssh/#multiplexing Assuming VSCode supports the mulitplexing, which it actually might not... - I think this is what VSCode does by default. - The issue is that for some reason VSCode fails to show a password prompt. A workaround is to use VPN to avoid the password prompt. - Note: git version on Triton is 1.8.3.1 which makes VSCode complain... - Another issue (official solution?): https://code.visualstudio.com/docs/setup/linux#_visual-studio-code-is-unable-to-watch-for-file-changes-in-this-large-workspace-error-enospc - Ok, for me it's not worth debugging this issue for longer :-) There seems to be another problem that kosh is refusing connections 90% of the time outside the VPN. - lyta in my personal experience is better than kosh - (out of scope discussion) Anyone use `typst` instead of `LaTeX`? Or any good replacement of LaTeX? - not using `typst` yet but experimenting a bit with quarto - Thread in the coderefinery chat https://coderefinery.zulipchat.com/#narrow/stream/303751-TIL/topic/Typst (welcome to the chat! You need to register first) - Do you know DVC (Data Version Control, https://dvc.org/) as an alternative to Git Annex or Git LFS? - It has telemetry, but you can opt out - https://dvc.org/doc/user-guide/analytics - `dvc config --global core.analytics false` or `export DVC_NO_ANALYTICS` - How can I run an interactive job on a gpu node? Do I modify something in ```srun --pty bash```? - You have to ask for GPU nodes :) You can ask it in a specific partition by adding `-p <partition>` or by setting specific gpu requirments: ` --gres=gpu:n` which n shows the number of GPUs you want. - At Aalto the gpushort partition is a good one for these quick interactive tests with gpu nodes. - CSC Mahti (this is new): https://docs.csc.fi/support/wn/comp-new/#mahti-has-small-gpus-available-for-interactive-work-532024 - I didn't get exactly how git annex finds out where to put files (point 6 of the exercise). Is there a configuration file for each remote? - `git annex wanted` will show - $ git annex wanted allas include=*/out/* - $ git annex wanted triton anything - ... ## were you able to run anything? Which exercise? - still trying to install rsync on my Windows laptop :( Thanks for the tips. I do have admin rights, but it's now a point of principle :) - how can we submit the exercises? - for the credit? just write a report with what you did, some copy paste fromt he terminal etc and send it with all other days exercises to scip _at_ aalto.fi. See more instructions at: https://scicomp.aalto.fi/training/scip/ttt4hpc-2024/#credits - Can we get certificates? I am not currently a student so I don't need the credit - If you are affiliated with a research institution (e.g. University) then you can get a ceritificate that can be converted into a credit. If you are at Aalto, please mention this when you submit the homework - I am afiliated with a govt research body - it's fine then. Basically we don't give credits to people working at companies / self employed just for fairness towards our own goals, not for any other reason. - email the exercise after each course or all at the end? - As long as you send them all by end of May. Maybe easier to send them all at once? I can clarify this at https://scicomp.aalto.fi/training/scip/ttt4hpc-2024/#credits - ...