Try   HackMD

MOVED TO WIKI https://wiki.aalto.fi/display/Triton/Admin

Use wiki for future updates/edits

Checklist for the daily worker

Duty days

  • Monday: Richard
  • Tuesday: Ivan
  • Wednesday: Simo
  • Thursday: Enrico
  • Friday: Mikko

Day-to-day customer support

  1. Check tracker issues https://version.aalto.fi/gitlab/AaltoScienceIT/triton/issues
  2. Assign what is not assigned. If unsure ask on IRC. If not replied for more than 24h, write a "We are looking into this"
    • Some assign default rules:
      • Enrico: new accounts, quotas, matlab, NBE
      • Richard: jupyter, python, CS
      • Simo: containers, GPU stuff
      • Mikko: installations
      • Ivan: DELL, DDN, PHYS & MATH
    • Some common responses if the case is unclear
      • Please provide more information so that we can understand what is going on - you haven't given us enough information to quickly understand what the situation is. Please see https://scicomp.aalto.fi/triton/help.html#give-enough-information .
      • This software is/is not in spack, so it is/is not able to be installed quickly for you.
  3. Check esupport https://esupport.aalto.fi/itsm/EfecteFrameset.do#/workspace/report/BOOKMARK/76256
  4. Same as point #2 above: Assign what is not assigned. If unsure ask on IRC. If not replied for more than 24h, write a "We are looking into this"
  5. Zoom meetings links are here Admins: https://aalto.zoom.us/j/982314225 Garage: https://aalto.zoom.us/j/177010905

Day-to-day dev review

  1. Check if there are open pull requests at https://github.com/pulls?q=is%3Aopen+is%3Apr+archived%3Afalse+user%3AAaltoScienceIT+user%3ANordicHPC+user%3Afgci-org

Triton clusters health

  1. Check daily email
  2. Check activity:
ssh admin2        # go to admin2
sudo su -         # become root
slurm dead        # Check which nodes have problems
sinfo -Rl         # same as above with timestamps
ssh login2        # go to login node
slurm top|head    # check most active users

htop              # check activity on login2
perf top          # spy on CPU activity if needed
  1. Check lustre status (on rivendell) http://tcont1.nbe.aalto.fi:3000/d/tEPSRztik/triton-stats?orgId=1&from=now-48h&to=now Good to see if the load is very high on some servers. Then we can identify issues (usually users having anaconda env in scratch). Also servers up should be 6 and sfa-errors 0.
  2. Check disk usage on critical servers. Act if usage above 75%.
pdsh -w home2,admin2,ohpcadm,install2,avustaja,ssd1 "df -hl"|sort -n -k6|column -t

Deeper user/job inspection

squeue -u USERNAME	# check one of the top users
scontrol show job JODID 	# check job ID from user
ssh NODE_WHERE_USER_JOB_IS_RUNNING 	# go to one node of that jobID
htop
perf top

Did they request many nodes but are only using one?

perf record -p PROCESSID	# record a process
                               # wait some 30 seconds, CTRL+C
perf report

Again from scontrol show job JOBID you can check MinMemoryCPU and under the RES. You can also look at slurm history for this user to see if his memory usage is way lower. Slurm history can give you a memorwise stats and the time they have been running version what they asked. And the seff command:

seff JOBID

If the memory per core is more than 4GB then it resorves more slots and has idle CPUS in the node so we should go to those users who use MORE than 4GB

In triton-bin-private there is the slurm_seriff.sh script is a manual job (this is for example the case with ssh)

pdsh -g compute /share/apps/bin/slurm_sherrif.sh