# Summer interns https://wiki.ugent.be/pages/viewpage.action?spaceKey=HPC&title=Student+interns Michiel Lachaert - https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=119510 - 3 Aug-1 Sept (21 days) - https://hackmd.io/EE--831BRe-CQS4qrTMzXQ Xander Bil - https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=114727 - 7 Aug-1 Sept (19 days) - https://hackmd.io/311zbOWgQCiuwJaszJnLpw ## Ticket list @ Xander/Michiel: to indicate you're working on a ticket: - make yourself owner in OTRS (via People -> Owner) in the ticket - add [[YourName]] in the corresponding bullet below **Solved tickets**: move them to the end of this HackMD note! ## Planning *Kenneth will take a couple of days of leave in week of 14-18 Aug (Thu+Fri 17+18 Aug, maybe also Wed 16 Aug)* - week 0 - Thu 3 Aug (only Michiel): KH, ML @ S9 - Fri 4 Aug (only Michiel): KH, ML @ home - week 1 - Mon 7 Aug: KH, ML, XB @ S9 - Kenneth leave ~3pm - Tue 8 Aug: KH, XB, ML @ home - Kenneth @ home - Wed 9 Aug: all @ S9 - sync meeting at 10:00 (Miami) - Thu 10 Aug: KH, ML @ home - Kenneth @ home - Fri 11 Aug: all @ S9 - sync meeting at 10:00 (Miami) - week 2 - Mon 14 Aug: all @ S9 - sync meeting at 10:00 (Miami) - FIXME meeting clash for Kenneth, move sync meeting to 11:00? - Tue 15 Aug: public holiday! \o/ - Wed 16 Aug: all @ S9 - sync meeting at 11:30 (Miami) - maybe half-day (pm) leave Kenneth, working @ home - Thu 17 Aug: all w@h - maybe leave Kenneth, maybe working remotely - Fri 18 Aug: all w@h - leave Kenneth (afk) - week 3 - Mon 21 Aug: all @ S9 - sync meeting at 10:00 (Miami) - Tue 22 Aug: all w@h - Wed 23 Aug: all @ S9 - sync meeting at 10:00 (Miami) - Thu 24 Aug: all w@h - Fri 25 Aug: all w@h - sync meeting at 10:00 (Teams) - week 4 - Mon 28 Aug: ML, XB w@h KH leave (AM or all day) - Tue 29 Aug: ML, XB @ S9 / KH w@h - sync meeting at 10:00 (Miami) - Wed 30 Aug: all @ S9 - Thu 31 Aug: all @ S9 - sync meeting at 11:00 (Miami) - Fri 1 Sept: all w@h KH leave PM ## Getting started - VSC account - https://www.ugent.be/hpc/en/access/faq/access - HPC-UGent website: https://ugent.be/hpc - HPC-UGent intro - slides: https://www.ugent.be/hpc/en/training/2023/introhpc202306 - recording: https://www.ugent.be/hpc/en/training/introhpcugent-recording - HPC-UGent docs: https://docs.hpc.ugent.be - sources at https://github.com/hpcugent/vsc_user_docs (mkdocs/docs/HPC subdirectory) - access to HPC-UGent helpdesk - OTRS (https://otrsdict.ugent.be) - mails to hpc@ugent.be or compute@vscentrum.be or ... - chat via Teams - no longer possible, need to resort to Teams? :-/ - shared HackMD notes - keep track of work done and work-in-progress - list of feasible helpdesk tickets - one shared note + separate notes for Michiel + Xander - semi-daily quick stand-up meeting in the morning (10:00?) ## Tasks ### Helpdesk - https://otrsdict.ugent.be - answer incoming (easy) tickets - create additional templates to make replying to common questions easier - cleanup of old open tickets - figure out what's blocking them - reply to ask if it's still relevant - compose template reply for old software installation requests ### HPC-UGent docs - improve existing documentation - based on incoming helpdesk tickets (could question have been avoided) - make progress towards answering with pointer to the documentation - add chapters on specific use case - basic Python - basic R - via Rscript - Bioconductor R packages are in R-bundle-Bioconductor module - review/improve "best practices" chapter - in http://localhost:8000/HPC/Gent/account/ - consistently use `[link](example.md)` instead of `[link](../example)` to link to other pages - [[Michiel]] generate overview of available software that can be added to HPC-UGent docs - Python script - based on "module avail" - table with software as rows, clusters as columns - software name as link to page with more information on that software: software versions, usage info, etc. - also add extensions to overview (when Lmod includes them in output of "ml --terse avail") - auto-update overview of available software - for example via cron job that runs every week that makes a pull request to HPC-UGent docs? - can also be used in EESSI project, to auto-update page with available software in EESSI docs - Javascript to create search box to filter overview based on query - sort of like https://packages.spack.io (see also https://github.com/spack/packages.spack.io) - or via MkDocs plugin, see https://bwmarrin.github.io/MkDocsPlus/datatables - This does not work anymore, BUT other solution found: - https://github.com/squidfunk/mkdocs-material/discussions/2584 - Just add HTML to the markdown and add custom javascript - PRs: - https://github.com/hpcugent/vsc_user_docs/pull/532 ### Software installation requests - make it easier to create an issue in vsc-software-stack repo for a software installation request - extract data via OTRS API, or scrape the page? - create Python script that takes content of installation request as input and produces (MarkDown) content for issue - implement auto-generating template easyconfig file (PythonBundle) - [[Xander]] look into better support for installing software with conda (cfr. Stijn's suggestions) - conda toolchain - can conda environments be stacked on top of each other? - can software installed with conda be combined with software installed from source with EasyBuild in a safe way? - example of pure conda install with EasyBuild: https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/q/QIIME2/QIIME2-2022.8.eb (compare with https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/q/QIIME2/QIIME2-2023.5.1-foss-2022a.eb) - basically just feeds https://raw.githubusercontent.com/qiime2/environment-files/master/2022.8/release/qiime2-2022.8-py38-linux-conda.yml to `conda install` - 100k files, includes Python, FFTW, etc. - QIIME2/2022.8 can't be combined with modules installed from source with EasyBuild - can we install stuff with conda on top of modules installed from source? - https://conda.io/projects/conda/en/latest/user-guide/configuration/use-condarc.html#searching-for-condarc - quick & dirty notes Stijn: ``` als ik het nu goed begrijp moeten we in de condarc de pkgs_dirs opbouwen: dat kan met veel foefels, maar er moet gekene worden hoe we de search_path zelf kunnen aanpassen zodat die voor elke conda module (en ik bedoel heir terug met conda als toolchain) wordt opgepikt. worst case is dat een patch in conda die bijv de env afspeurt op EBROOT variables, en voor elke EB kijkt of er een EBROOTHUPPEL/condarc bestaat, en die dan zo opbouwen. het alternatief is dat een conda module zelf de enig CONA_ROOT wordt, en dat de eb install van die module de condarc maakt voor alle module deps, en dat ook de lmod commands die ergens kunnen aanpassen? maar dat wordt dan denk ik vuile lmod lua code maken, en dan is mss een pyton patch in conda pak eenvoudiger ``` ### Other - training LLM to help with HPC-UGent support? - how realistic is it to get a good enough "chatbot" to offload HPC-UGent helpdesk? - can we extract training data from OTRS ticket to finetune existing open source LLMs? - links - https://beebom.com/how-train-ai-chatbot-custom-knowledge-base-chatgpt-api - https://lightning.ai/pages/blog/how-to-finetune-gpt-like-large-language-models-on-a-custom-dataset - https://hacks.mozilla.org/2023/07/so-you-want-to-build-your-own-open-source-chatbot - https://simonwillison.net/2023/Aug/3/weird-world-of-llms/ - https://agi-sphere.com/llama-models/ - Python library: - https://llm.datasette.io - https://github.com/simonw/llm - Document based LLM-Powered Chatbot (https://medium.com/@abonia/document-based-llm-powered-chatbot-bb316009de93) - langchain (framework to work with LLM's): https://python.langchain.com - embedding (vector database): https://github.com/chroma-core/chroma ### Work done #### Michiel - [[Michiel]] Ticket#2023080460000546 — Problem with worker module for non-default clusters (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=126530) - ticket already answered by KH, common `Illegal instruction` problem with worker - docs can be improved, see https://github.com/hpcugent/vsc_user_docs/issues/535 - Ik zal ook verder het ticket antwoorden (Michiel) - [[Michiel]] Ticket#2023080760000086 — Problems when running in parallel (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=126566) - [[Michiel]] Ticket#2023081160000103 — swapping login nodes (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=126959) - answer: yes, use `ssh gligar08` (or `ssh gligar07` to go from 08 to 07) - I think that always works (it does for me, but I have some key forwarding setup), but check it first - maybe also good to cover in the docs, along with mentioning `screen` and `tmux` to create "persistent" sessions? - [[Michiel]] Ticket#2023080860000164 — Downloading data directly onto the HPC (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=126687) - ask what "a secured transweb link" is exactly, and how much data this concerns - suggest one option to set up desktop session via web portal (https://docs.hpc.ugent.be/web_portal/), start firefox (via terminal if needed), and use that browser session to do the download - suggest to use donphan interactive cluster for this (https://docs.hpc.ugent.be/interactive_debug/) - [[Michiel]] Ticket#2023081060000721 — mem alloc (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=126930) - default memory limit is determined by # requested cores (quarter of the cores => quarter of the memory) - more memory can be requested via "-l mem", see https://docs.hpc.ugent.be/fine_tuning_job_specifications/#pbs_mem - error suggests 256GB of memory is needed (?!), suggest using gallade large-memory cluster? - MB not GB right? => yup :) - mem limit is per node, shared over cores - [[Michiel]] Ticket#2023081660001317 — 回复: VSC: personal storage VSC_HOME usage at 93.69% (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=127247) - explain that storage quota in home directory is fixed, is not increased for anyone (also because filesystem on which home directories are located is not a good fit for large volumes of data or large amount of files) - => should use symlink to directories in other filesystems like $VSC_DATA, $VSC_SCRATCH - [[Michiel]] Ticket#2023081660001201 — github copilot op vsc (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=127236) - only looks into this if you're familiar enough with VS Code - This doesn't work for me either (Michiel) - [[Michiel]] Ticket#2023082060000042 — Use of CPLEX (user vsc42286) (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=127480) - wrong GCC version - [[Michiel]] Ticket#2023082360000868 — Storage for large files (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom&TicketID=127859) - [[Michiel]] More tickets, see OTRS... - [[Michiel]] Docs about `Illegal instruction` - [[Michiel]] Changing login nodes with `ssh gligar07`, along with mentioning `screen` and `tmux` to create "persistent" sessions - [[Michiel]] add section in troubleshooting "Why does my job not run faster when using more cores and/or nodes?" - (see also HPC-UGent intro slides) - software does not run magically faster when more cores are available, the software must be able to actually use multiple cores (and may need to be told to use N>1 cores), if it supports this at all; - for multi-node jobs: using resources of the other N-1 nodes requires using something like MPI (https://mpitutorial.com/tutorials), which usually involves starting the software via a command like `mpirun` or `mympirun` (see https://docs.hpc.ugent.be/mympirun), discovering which nodes are available (and how many cores per node), etc. - PR: https://github.com/hpcugent/vsc_user_docs/pull/546 - [[Michiel]] Docs about cross cluster job submission #### Xander - [[Xander]] Ticket#2023080960000377 — Error message: Illegal instruction (core dumped) (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=126807) - [[Xander]] Ticket#2023080760001111 — Connecting to the HPC infrastructure (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=126669) - point to https://docs.hpc.ugent.be/troubleshooting/#sec:connecting-issues (use OS-less link!) - [[Xander]] Ticket#2023081460000447 problem with worker (https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=127085) - [KH] maybe `module purge` in job script helps (to avoid incompatibilities between worker module and software modules being loaded), but it's probably a different problem in this case... - [[Xander]] Ticket#2023081760000745 — unable to start a jupter notebook job - https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=127325 - [[Xander]] Ticket#2023081760000951 — Issues with worker framework jobs - https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=127346 - [[Xander]] setup linux-tutorial navigation - https://github.com/hpcugent/vsc_user_docs/pull/539 - [[Xander]] looks into broken macros (https://github.com/hpcugent/vsc_user_docs/issues/537) - PR https://github.com/hpcugent/vsc_user_docs/pull/540 - [[Xander]] dark mode - inspiration: https://github.com/easybuilders/easybuild-docs/pull/117 - PR https://github.com/hpcugent/vsc_user_docs/pull/544 - [[Xander]] add "getting started" chapter (https://github.com/hpcugent/vsc_user_docs/issues/534) - could be useful as example workload: https://github.com/EESSI/eessi-demo/tree/main/TensorFlow - PR https://github.com/hpcugent/vsc_user_docs/pull/542 - [[Xander]] improve landing page (see also https://github.com/hpcugent/vsc_user_docs/issues/533) - [[Xander ]] clarify that SSH key pair is only needed for connecting with SSH (not needed for web portal) - [[Xander]] AlphaFold: https://otrsdict.ugent.be/otrs/index.pl?Action=AgentTicketZoom;TicketID=57912 + https://www.vscentrum.be/alphafold