# MOVED TO WIKI https://wiki.aalto.fi/display/Triton/Admin Use wiki for future updates/edits # # # # # # # # # # # # # # # # # # # # # # # # # Checklist for the daily worker ## Duty days * Monday: Richard * Tuesday: Ivan * Wednesday: Simo * Thursday: Enrico * Friday: Mikko ## Day-to-day customer support 1. Check tracker issues https://version.aalto.fi/gitlab/AaltoScienceIT/triton/issues 2. Assign what is not assigned. If unsure ask on IRC. If not replied for more than 24h, write a "We are looking into this" * Some assign default rules: * Enrico: new accounts, quotas, matlab, NBE * Richard: jupyter, python, CS * Simo: containers, GPU stuff * Mikko: installations * Ivan: DELL, DDN, PHYS & MATH * Some common responses if the case is unclear * Please provide more information so that we can understand what is going on - you haven't given us enough information to quickly understand what the situation is. Please see https://scicomp.aalto.fi/triton/help.html#give-enough-information . * This software is/is not in spack, so it is/is not able to be installed quickly for you. 3. Check esupport https://esupport.aalto.fi/itsm/EfecteFrameset.do#/workspace/report/BOOKMARK/76256 4. Same as point #2 above: Assign what is not assigned. If unsure ask on IRC. If not replied for more than 24h, write a "We are looking into this" 5. Zoom meetings links are here Admins: https://aalto.zoom.us/j/982314225 Garage: https://aalto.zoom.us/j/177010905 ## Day-to-day dev review 1. Check if there are open pull requests at https://github.com/pulls?q=is%3Aopen+is%3Apr+archived%3Afalse+user%3AAaltoScienceIT+user%3ANordicHPC+user%3Afgci-org ## Triton clusters health 1. Check daily email 2. Check activity: ``` ssh admin2 # go to admin2 sudo su - # become root slurm dead # Check which nodes have problems sinfo -Rl # same as above with timestamps ssh login2 # go to login node slurm top|head # check most active users htop # check activity on login2 perf top # spy on CPU activity if needed ``` 3. Check lustre status (on rivendell) http://tcont1.nbe.aalto.fi:3000/d/tEPSRztik/triton-stats?orgId=1&from=now-48h&to=now Good to see if the load is very high on some servers. Then we can identify issues (usually users having anaconda env in scratch). Also servers up should be 6 and sfa-errors 0. 4. Check disk usage on critical servers. Act if usage above 75%. ``` pdsh -w home2,admin2,ohpcadm,install2,avustaja,ssd1 "df -hl"|sort -n -k6|column -t ``` # Deeper user/job inspection ``` squeue -u USERNAME # check one of the top users scontrol show job JODID # check job ID from user ssh NODE_WHERE_USER_JOB_IS_RUNNING # go to one node of that jobID htop perf top ``` ## Did they request many nodes but are only using one? ``` perf record -p PROCESSID # record a process # wait some 30 seconds, CTRL+C perf report ``` Again from scontrol show job JOBID you can check MinMemoryCPU and under the RES. You can also look at slurm history for this user to see if his memory usage is way lower. Slurm history can give you a memorwise stats and the time they have been running version what they asked. And the seff command: ```seff JOBID``` If the memory per core is more than 4GB then it resorves more slots and has idle CPUS in the node so we should go to those users who use MORE than 4GB In triton-bin-private there is the slurm_seriff.sh script is a manual job (this is for example the case with ssh) ```pdsh -g compute /share/apps/bin/slurm_sherrif.sh```