Use wiki for future updates/edits
Checklist for the daily worker
Duty days
- Monday: Richard
- Tuesday: Ivan
- Wednesday: Simo
- Thursday: Enrico
- Friday: Mikko
Day-to-day customer support
- Check tracker issues https://version.aalto.fi/gitlab/AaltoScienceIT/triton/issues
- Assign what is not assigned. If unsure ask on IRC. If not replied for more than 24h, write a "We are looking into this"
- Some assign default rules:
- Enrico: new accounts, quotas, matlab, NBE
- Richard: jupyter, python, CS
- Simo: containers, GPU stuff
- Mikko: installations
- Ivan: DELL, DDN, PHYS & MATH
- Some common responses if the case is unclear
- Please provide more information so that we can understand what is going on - you haven't given us enough information to quickly understand what the situation is. Please see https://scicomp.aalto.fi/triton/help.html#give-enough-information .
- This software is/is not in spack, so it is/is not able to be installed quickly for you.
- Check esupport https://esupport.aalto.fi/itsm/EfecteFrameset.do#/workspace/report/BOOKMARK/76256
- Same as point #2 above: Assign what is not assigned. If unsure ask on IRC. If not replied for more than 24h, write a "We are looking into this"
- Zoom meetings links are here Admins: https://aalto.zoom.us/j/982314225 Garage: https://aalto.zoom.us/j/177010905
Day-to-day dev review
- Check if there are open pull requests at https://github.com/pulls?q=is%3Aopen+is%3Apr+archived%3Afalse+user%3AAaltoScienceIT+user%3ANordicHPC+user%3Afgci-org
Triton clusters health
- Check daily email
- Check activity:
- Check lustre status (on rivendell) http://tcont1.nbe.aalto.fi:3000/d/tEPSRztik/triton-stats?orgId=1&from=now-48h&to=now Good to see if the load is very high on some servers. Then we can identify issues (usually users having anaconda env in scratch). Also servers up should be 6 and sfa-errors 0.
- Check disk usage on critical servers. Act if usage above 75%.
Deeper user/job inspection
Did they request many nodes but are only using one?
Again from scontrol show job JOBID you can check MinMemoryCPU and under the RES. You can also look at slurm history for this user to see if his memory usage is way lower. Slurm history can give you a memorwise stats and the time they have been running version what they asked. And the seff command:
seff JOBID
If the memory per core is more than 4GB then it resorves more slots and has idle CPUS in the node so we should go to those users who use MORE than 4GB
In triton-bin-private there is the slurm_seriff.sh script is a manual job (this is for example the case with ssh)
pdsh -g compute /share/apps/bin/slurm_sherrif.sh