ASC Tech audience notes

# ASC Tech audience notes ## General links * Schedule: https://scicomp.aalto.fi/tech/ * Zoom: https://aalto.zoom.us/j/65371910202 * This page: https://hackmd.io/la6yjoEcREyWxLLoT45SAA ### Questions here ## Session Friday 28.5.2021, Research Software Engineer Service :::info We start at 10:15 ::: **This talk will be recorded and published CC-BY, please leave your video off. Instead of voice, you can write your questions here:** - [Slides](https://docs.google.com/presentation/d/1Ti4TvjAilnElk9ITBZVsMnR0g7pfgPg8t5HHe2YOQs4/edit) - Do you help anyone at Aalto or should I be some group member, project or alike? - We'll talk about this at the end, but right now anyone in the School of Science since that is where our funding comes from. - But if anyone at Aalto comes, we try to set them on the right track no matter who they are. - Developing a database for the web, is it a part of RSE's profile? Materials in question, and experiment results. - We are working on projects like this and have some expertice in it. - Different RSEs have different expertice, we try to cover a wide range. - Maybe we can talk about your work a bit more in the garage (or set up a meeting at another time?) - It sounded like the RSEs are going to be permanent part of the HUS+Aalto joined "research group"? What is the reality/plan? - We'll take it as it comes, we don't need to decide now. It is possible that the group funds a fraction of a RSE permanently and they are always available (this is the case with others in Science-IT and works great) - It could also be on-and-off consultation, with occassional deeper projects. - A lot of this depends on what funding model works (see future) - At the moment, just having the PI's involved know where to find us to send students with questions is already a good thing. - What is MAGICS? - A new Aalto infrastructure for behavioral data collection from humans - The one Jarno mentioned? - Yes - https://magics.fi/ - If in question not exactly the computational project but data that comes from the experiment, its collection, and analyzes, is it still the thigh RSE can be involved? Can you help with LabVIEW? - RSEs can help you with any software development and we have experts in data management as well. - Data collection and anlysis is definitely something we do. - I'm not sure we have a LabVIEW expert, but it seems like something we can take a look at. - No current expert yet, but give us a LabVIEW project and at the end of it we will have one. - I think it is worth-mentioning your effort towards Nordic RSE collaboration. - https://nordic-rse.org - It was through these people that I first learned of the RSE concept! - How much time you can spend on the project? What is the guarantied minimun and absolute maximum? - There is no exact limit. We can always take a couple of hours in the garage. - Upper limit would be a couple of months (we only have about 2 people) - No upper limit if you fund an RSE :) - It also depends on full-time/part-time considerations. One thing as permanent staff is that we can offer support for a long time, working on it from time to time. - As a feedback: great work guys! I guess Aalto if the first one to go, strange that none else does the same in Finland. - Thanks! - There are plans at other institutions I believe. Aalto was the first to put it into practice. Hopefully others will follow! - What about a general survey of development work that is being done at Aalto. e.g. students/postdocs building, contributing and maintaining open source projects. This would help a lot to understand the circumstances and to build the network of people. - maturity level, audience, collaborators outside, - +1. Also a survey on existing code repos would be good like [name=A] suggested to complement [name=B]'s searches - ... --- ## Session Friday 14.5.2021, scicomp-docs and Sphinx :::spoiler - [Slides](https://scicomp.aalto.fi/tech/sphinx-docs/) - I've noticed the search works also on the "local preview". So it works on the client side somehow. Can you fill any details? - The search data is stored in a JSON file created at site generation time and the search happens in client Javascript. - Sphinx vs. LaTeX, can you contrast them or say what is common between them - Can you comment on licensing model used at scicomp.aalto.fi? CC-BY was used right? - The authors are not really visible! (How to cc-**BY** ?) - Is there a way to make part of the documentaion private, protected by a password or accessible by known signed in users only? Kind of internal docs vs public. - Might be a bit tricky as the documentation is a git repository. Perhaps for internal docs you can use something else (we use confluence for Triton internal admin docs) or separate limited git repository for similar docs format. - Also the issue-tracker at version.aalto.fi is a closed place where information is stored - Where is it hosted, then? - readthedocs.org ::: ## Session Fri 7.5.2021; Anaconda :::spoiler - [Slides for this session](https://users.aalto.fi/~tuomiss1/fgci-tech-2021-05-07/anaconda-presentation.html) - RD: This is an attempt at conda activation in lmod, but can't fully unload, it's unpolished because it was not really useful, our system does everything needed https://github.com/AaltoSciComp/lmod-conda - I learned of mamba from Anne in Research Software Hour - Does mamba uses an actually different solving algorithm? - At least it is implemented in C++ (as opposit to Anaconda in Python) - But is the main difference C++ vs Python, or an actually better algorithm it's using? - Maybe it uses unit propagation :) - 'module use ...' kind of enables additional packages, that have been pre-installaed but not made accessible. In practice. it just append the $MODULE_PATH variable. - Have you heard about https://python-poetry.org/ ? it can't do non-python packages like conda, but I think it handles conflicting / multiple versions of packages at the same time, like npm or yarn - What is is the Aalto's touch on the Anaconda becoming commercial? Will Triton still have it? - Anaconda is still free for academic use (universities). So, we will still have it at Triton. For other uses individual components (e.g. conda tool) are under open-source licensing models. So using those in non-academic cases is also possible without using Anaconda prebuild repositories. - Should I set my Conda env (Anaconda/Miniconda etc) into HOME or WRKDIR on Triton? - Currently HOME is faster SSD based storage area. Thus, home is a good option. Only issue might be if you are building many conda envs that take a lot of space. In that case we recommend on contacting Triton support. We can think a good solution in that case. - Do you have any recommendations to people doing their own development, to make stuff easier in the future? ::: ## Session Fri 30.4.2021; Jupyter :::spoiler Outline: * Basic concepts (slides, ~20 min?) * Tour of our JH (live demo, ~20 min?) * Further discussion Icebreaker: What is JupyterHub? * Put your questions here in these dots * You can also follow up on a previous discussion * It's a way of running programs (the kernels) from a web browser on a remote system. * Web interface to Triton or alike * A hub where I can use jupyter notebooks * . * A web interface to ipython * A versatile programming environment that provides a "click-and-play" user interface to the computing cluster. Questions: * (ST) The [pull request](https://github.com/jupyterhub/jupyterhub/pull/2726) that Richard alluded to is a good read. * . Tour of our JH: * jupyter01.triton.aalto.fi * Cluster node * PAM for user authentication * Slurm submit node * Config files at /share/apps/jupyterhub/ * Secret config at jupyter01:/etc/jupyterhub/ * Deployment config public: https://github.com/AaltoSciComp/triton-jupyterhub (secrets separated) * Project environment: /share/apps/jupyterhub/ * live/, dev/ subdirs * miniconda/ within there that is the Python environment * Makefile to semi-automate basic setup tasks (more like hackish script) * create miniconda environment to run the hub * jupyterhub_config.py * singleuser server * Runs in the same /share/apps/jupyterhub environment. * Submitted via batchspawer * Submission script in jupyterhub_config.py * Create /scratch/work/$USER/.jupyterhub-tree as the runtime dir (via `/share/apps/jupyterhub/setup_tree.sh`). This has ./home, ./scratch, ./work, etc. links * cd to that * execute `jupyterhub-singleuser` (actually `batchspawner-singleuser`) * The singleuser server runs from ``/share/apps/jupyterhub/live/miniconda/`` * Kernels (this could be a whole other topic) * Installed at `/share/apps/jupyterhub/live/miniconda/share/jupyter/kernels/python3` * In general, these all integrate to existing Triton modules, I don't install any software myself * Or per-user * EXAMPLE: figure out and debug a kernelspec * `module load jupyterhub/live` * `jupyter --paths` * `jupyter kernelspec list` * [envkernel](https://github.com/NordicHPC/envkernel) helps with installing kernels * Provides operations like "Load this module and then execute the kernel's command" ::: ## Session Fri 9.4.2021; Software installations with CI build system :::spoiler Click to expand * [Slides](https://users.aalto.fi/~tuomiss1/fgci-tech-2021-04-09/spack-presentation.html#/fgci-tech-software-installations-with-ci-build-system) * Your question or comment / feedback * Answer comes here * How to best handle software that is not (currently) available at Spack? * That depends: Triton has an NFS shared dir cross-mounted over whole cluster, software can go there. Who is building it, depends on the community. If more than one user is interested, it is us, we do central installation, we create lmod config for the module and maintain the software further. If only one user needs it, we usualy assist the user with the WRKDIR setup, but we can also create a common dir and let the user to do the install. * Then anaconda/pip, singularity come on top of this. * One can make her/his own spack config: well documented procedure. * This is a bit of a history talk...what was the timescale of the steps? * This was partially organic development of different features. But roughly: * 2016-2017 Easybuild (already CI). Worked ok at the beginning but then more trouble => Found out about spack * 2018 moved to spack. We noticed that we want better automation on top of this. * 2019 CI for spack. In the side anaconda + singularity still manual * 2020 Better CI for spack, anaconda included * 2020/12 system crashed and new people started => simplifying the setup * Singularity still on the way. There is automation for that but those will be added to science-builder * Q2/Q3 2021. Including thing to cvmfs to make all available for FCCI community. Open questions where the CVMFS will be running in the future. * How do you know when some packages have never versions available and should be updated? Is there a way to "update all"? * The normal way it to install software when requested by users. So, we do not necessarily always keep "everything to the latest version". But every now and then we do analysis ourselves and do a major update. * In addition we use ELK-stack to run analysis on our software usage. This helps in removing old un-used software to keep the stack more manageable. * How other FCCI sited could participate right now? * A ::: ## Session Mon 29.3.2021; User support :::spoiler Click to expand * Question * should some staff work in support most of their time or most staff work on support some of their time? * EG (my opinion): most staff better from users' point of view (faster reply "we are on it", feeling that there is a team helping them not just the same person) * We need to support people with variety of problems. Not all of us can be experts in all the things we do. So it makes sense for many people to do support. * should a support person be an expert in the infrastructure, or closer to the customer (so not an expert) * EG: closer to the customer is better in my experience (it is better to actually use the infra in a nonperfect way and get something done, rather than getting the perfect answer and no time or no idea how to implement it) * How should I respond in this scenario: Someone asks me how to do something really difficult without having the proper background (i.e. implement a custom deep learning model, without any programming experience). I reply that what they're trying to do is complicated and it's going to be a long journey for them. I point them to materials to help them get started on this journey the best I can. The person responds: I don't have time to read all of that. Can't you just tell me how to do it? Why are you not helping me? * maybe both ("give them the fish but also teach them how to fish next time"): give the solution but also point to material to understand more * "give them the fish" as in create the model for them? That is a huge job on my end : * good point. they may think that what they ask takes 30 minutes and it can be good to communicate that they are asking for 2 weeks of work and that is outside of possibility without creating dedicated project * I have one example when a user asked me to do the whole thing form him. I redirected him to Kaggle to learn the basics rather than building the model. In about one week he asked already more detailed hints and in two weeks already very high level questions. So, it turned out to be good approch in that case. Pushing him to learn himself. * That is really comforting to hear! * EG: you give them a quick working solution (e.g. a tutorial that works on the infra you support). They can get started even though it is not optimal. They can decide later if they want to go deeper (= I agree with the Kaggle example before) * How to balance the amount of information passed to users: * More information allows proficient users to do more * More information confuses and misleads others * Should we keep up an "aura of mystique" how things simply work, professional secrets, sort of * (ST) To me peeking behind the curtains serves a purpose when the user can utilize the information. For example, an advanced MPI user might benefit form knowing how SLURM distributes MPI workers, but a novice does not benefit from that information. However, if user is interested in how things work that is usually to be encouraged. * EG: never keep an aura of mystique, I just say if I know something or not, I think there are no professional secrets * When it comes to scicomp I do feel we may have too much text there. I have no good suggestion what should be taken out but keeping this compact would help users to find the relevan "short" bit of data. I do feel we sometimes get tl;dr from users. * (ST) Writing documentation is like playing blues: it's the notes that you don't play that count. Adding too much exposition is definitely a problem, but it is very hard to know how to write something concicely. See: this answer as an example of this problem. * EG: I think we need more "TL;DR pages" under scicomp so that users can start small and go deeper if they need * And don't get me wrong. Scicomp is great. But maybe some customer jouney would help here. See how users really use our documentation and help them navigate it better. * JR: There should be both a detailed manual with all the information and a "tutorial" or a "DL;DR" that gives a good basic model. These have a different purpose and a different target audience. * True! * Have you really experienced "Strategic risks" (middle layer support vanishing)? * Comments * Often it is faster doing something self than teaching others. * EG: The best is to write an how to while you do it the first time, and next time you hope the user can follow the documentation (and if they can't you still do it yourself) * ST: I usually write all of the commands I used to solve some problem. This in conjunction with the GitLab setup that we have (easy search) makes it quite easy to find the solution next time. * EG: Is it good that us (the helpers) are up to date with how things are done outside academia? I always feel that I have a more motivated user when I tell them that this way of solving your problem is also how they do it outside academia. 98% of the cases will be people leaving academia, and often what we can help them learning might be more important than what it is learned in formal courses. ::: ## Session Mon 22.3.2021; Ansible workflow :::spoiler Click to expand * I had same issue with making a new login2: there are a lot of little things that is managed outside of Ansible. * AWX * RD: I didn't know of it * Question2 * strategy:free could be useful https://docs.ansible.com/ansible/latest/collections/ansible/builtin/free_strategy.html * Ansible forks: https://docs.ansible.com/ansible/latest/cli/ansible-playbook.html#cmdoption-ansible-playbook-f ::: ## Session Fri 12.3.2021; Networking :::spoiler Click to expand Kind of material, speaker's [reference page at google drive](https://docs.google.com/document/d/1YFxS7XbPWGPa0EVs-5LuyVvo65BmFz-utXmIJc6x5-o/edit?usp=sharing) * Your today's comments here * Happily we are not using IPV6 :-) * What are the considerations for IPV6? * Well, it has been coming for years. We kind of support it nowadays as University. Basic networking is fully compliant with ipv6. That can be taken into external use easily. Internally there really is no clear reason to do that yet. * What about clusters? Is there any motivation to use it at all, or does ipv4 stay indefinitely? * For now, there is no clear incentive to go to ipv6. Many tools, ansible roles, networking fully supports ipv4. If changing then we need to verify deployment (pxe) etc if we would go to ipv6 fully. I'd say it's still better to stick with ipv4 as that is battle tested and there is nothing "missing". * Question: has anyone used/considered RoCE instead of infiniband (https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet)? * ::: ## Session Wed 3.3.2021; Triton hardware-wise :::spoiler Click to expand Kind of material, speaker's [reference page at google drive](https://docs.google.com/document/d/1OMwkpIjKWxGeKvvojojc62lbOH2JJsnHHO5Ns5_ZceE/edit?usp=sharing) * Ask questions like this * get answers this way * or we'll see at the end and answer then * You can ask anonymously at any time * Don't be afraid to ask via voice, too * Is there a rule of thumb or something saying: * power from below and network from top of racks? * Regarding sensitive data we need to cover certain VAHTI regulations for machine room (user access, video surveilance etc). * (RD) How practical is it to run your own machine room these days? * About doors, once I heard a story of how they weren't tall enough when there was a machine room in a normal building, causing major trouble * Do the racks change? Should we save money/resources by keeping old racks instead of buying new ones when buying nodes only. * Do the PDUs come with the racks? * (MH) Yes. At Triton we get racks fully installed most commonly includind pdu's * (Ivan's "factory setup") * (RD) About heterogeneous clusters: any philosophical comments on "recycling and buying a brand new cluster" vs incrementally upgrading like we do? * I (Simppa) can tell that jumping into the game at a later stage is more demanding when there is a long history * (MH) CSC does it in the "full-new-cluster" way. In Aalto case we have seen it better with smaller amount of money to keep older nodes for capacity. But we also need to upgrade systems for e.g. new needs. Maintenance is more complex that way but if done properly and clearly then we can provide better resources to users. * (ST) There can be problems with software installations when dealing with heterogeneous clusters. Basically either software needs to be built for the greatest common devisor (available CPU optimizations) or built for each architecture. Both can cause problems. We use the first approach. * (MH) But the aim to move towards the latter. * Do we do performance tests (LINPACK) or something? Stress-testing? * (MH). Yes. Not really Linpack but for general cpu testing we check specint etc tests and for comparing different CPU's internally we have our own benchmark. This uses FCCI commonly used sotware and runs those in docker. This was used last time in our procurement and was written by Simo. * (ST) Here's a [link to that benchmark](https://github.com/AaltoSciComp/docker-fgci-benchmark). Ivan also recently used [hpl](http://www.netlib.org/benchmark/hpl/) to run some LINPACK benchmarks. Numbers on those still pending. Spack can install hpl quite easily. * There was some serial interface set-up in the ansible for OS installation. Is this tied to IPMI & al remote terminals? * (MH) Yes. We do tell kernel to forward serial console output to ipmi. So you can access nodes via IPMI. In addition to this there is way to connect using iDrac/ILO etc. But these may require licensing. IPMI-serial console works similarly regardless of the vendor. :::