Try   HackMD

Teoroo cluster: admin guide

Here you can find information about the cluster setup and how one maintaines the cluster. Those information should be open to all group members, but only admin(s) should need to read those. For administration of the wordpress page, please refer to this note.

Routine task cheatsheet

Those are the tasks that admins have to do routinely.

Manual backup

cd borg_backup    # on clooney, swith to root and run with a tmux session
source borg.env                                # yyyy-mm-dd <- replace with date
borg create --list --exclude-from borg.exclude ::20xx-xx-xx /homefs/ --progress
# on rackham, need the arhive password.
# List all versions
borg list /proj/uppoff2019008/Backups/clooney
# List files for an archive
borg list /proj/uppoff2019008/Backups/clooney::2021-05-03

There is a way to limit the bandwidth used by bord, the borg.env configures the borg command according to this link. Check the link to see how to set the limit.

To restore, use borg mount.

Reducing the amount of backups

borg prune --list --keep-weekly 12 --keep-monthly -1

This command keeps one backup a week for the past 12 months, and afterwards keeps one backup for every month. Note that this needs to be -1. This means it needs to keep one for every month.

Setup of borg backup

Borg backup is used through rackham using ssh keys. This means that the user who hosts the backup on rackham needs to have the ssh key in their authorized keys keyfile. Note that this only needs to be changed when ownership of the backups is transferred. Uppmax will always ask for the user password first. Setup an ssh key pair to avoid this. After this is done, backups can always be made on clooney as long as the correct password is being used. Note that this password is specific for the backup and it not the same as the root password.

When transferring the backup to a new user, make sure that they have been added to the correct uppoff project on Uppmax. Afterwards, one must give them permission to modify the backup folder. To give permission to everyone in the group:

DIR=/proj/uppoff2019008/Backups/
GROUP=uppoff2019008
find $DIR -type d ! -perm -2070 -print0 | xargs -r -n 10 -0 chmod -v +2070
find $DIR -type f ! -perm -g+rw -print0 | xargs -r -n 10 -0 chmod -v g+rw
find $DIR ! -group $GROUP -print0 | xargs -r -n 10 -0 chgrp -v $GROUP

Adding user

# ssh to router@router and switch to root
adduser --no-create-home --ingroup teoroo {username}
vim /etc/auto.home # here, add the NFS location of the new user to the file
cd /var/yp/
make # this add the new user to the NIS server
# ssh to clooney as root
mkdir /homefs/{username}
chown -R {username}:teoroo /homefs/{username}
vim /etc/exports #add the user's home folder to NFS exports
service nfs-server reload
su - {username} #test the new user, it should land in the home folder now
# ssh to brosnan
cp /etc/skel/.* .

Boiler plate to send to the user

Your password is: xxxxxxxxxx
Login with `ssh {username}@teoroo2.kemi.uu.se`, you will need a VPN or the university network to access teoroo2.
Once logged in, you can and should change password with `yppasswd`

Some information about software, storage and hardware can be found here:
https://hackmd.io/@teoroo-cluster/user-guide

Booting/shutdown sequence

shutdown

  1. shutdown jackie and w nodes
  2. shutdown clooney and aberlour
  3. shutdown brosnan
  4. shutdown router

boot teoroo2

  1. boot router
  2. boot clooney, and aberlour
  3. boot brosnan, jackie, and w machines

Disk failure on Clooney

Outdated documentation
  • On the new file server (Clooney) - I do not know the procedure. The controller is LSI 3108 (look in the document I sent you) and reading on the net some 100 years ago I found that one can use the MegaCli64 tool I use to monitor the health. Also, it seems that the failed disk is not marked by blinking LED so other suggested using
./MegaCli64 -PDList -aALL

# Cause the front LED of the drive to blink to help locate a particular drive:
./MegaCli64 -PdLocate -start -physdrv\[E:S\]  -aALL

# Stop the blinking:
./MegaCli64 -PdLocate -stop  -physdrv\[E:S\]  -aALL

Status report

As the admin, you will receive an email everyday from logwatch updating you on the status of Clooney's disks. This email is entitled "Logwatch for RAID", and the contains the following important information:

                Device Present
                ================
  Virtual Drives    : 1
  Degraded        : 0
  Offline         : 0
  Physical Devices  : 14
  Disks           : 12
  Critical Disks  : 0
  Failed Disks    : 0

As of now, this is the only part of the email that contains information. In the case of a failed disk, this will only be visible here. It is also important to check that Degraded: 0. Any other status will require further investigation.

Clooney uses a RAID6 data storage approach that consists of 12 physical hard drives. This means that, in theory, two disks can fail before data loss occurs. However, in the case of a disk failling, it is important to replace it as soon as possible to reduce stress on the system.

Disk management tool

The updated version of the MegaCli64 disk management tool is called storcli and is also available on Clooney. Both tools can be found in the /bin folder.

Storcli is recommended for admin tasks, as it is more recent and documentation is easier to find. The reference manual can be found here, and a useful list of commands here.

An alternate approach of replacing the failed disk using MegaCli64 can be found here but this has never been tried.

Replacing a failed disk

In the case of a failed disk, the first thing to do is to check the disk statuses using

./storcli /c0 /eall /sall show

This command returns the status of all disks in the RAID array, and ideally the output will look something like this

Drive Information :
=================

-------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model               Sp Type
-------------------------------------------------------------------------------
0:0       1 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:1       2 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:2       3 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:3       4 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:4       5 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:5       6 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:6       7 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:7       8 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:8       9 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:9      10 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:10     11 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
0:11     12 Onln   0 7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -
-------------------------------------------------------------------------------

The state of the drive is the important part here, if anything other than Onln (online) is shown, further investigation is needed.

To get more in-depth information on all drives use

./storcli /c0 /eall /sall show

If the disk has failed, it should be replaced with a new one. These, as is the case for aberlour's disks can be found in Chao's office. They are in box underneath his bookcase. The box that contains a disk for Clooney is marked with 'Cloney' so should be easy to find. The exact kind of disk is

Seagate Enterprise Exos 7E8 8TB
SATA3 3.5" 7k2 256MB 512e
HDD-T8000-ST8000NM0055

Note that this number is the same as is reported in the Drive Information block above.

When you have found the replacement disk, go down to Clooney and locate the red LED light which marks the failed disk. Make sure to double check that this is actually the failed disk by turning on the blinking LED functionality. This can be done with

./storcli /c0 /e0 /sx start locate # Here x is the slot number

/sx is the slot number of the failed disk and can be found in the EID:Slt column in the Drive Information block above.

To stop the LED blinking

./storcli /c0 /e0 /sx stop locate # Here x is the slot number

Now that you have located the failed disk the procedure (ref. and ref.) is as follows

# Set the failed drive as Offline 
./storcli /c0 /e0 /sx set offline
#  x = Controller defined slot number

# Set the failed drive as Missing
./storcli /c0 /e0 /sx set missing

#Spindown the failed drive 
./storcli /c0 /eall /sx spindown # NOTE Use /eall here

Then remove the failed drive and replace it with a new drive. Documentation for this can be found here (See section 2-2). You will have to replace the disk from the drive handle chassis by unscrewing it, as the new disks do not come with one.

The rebuild then starts automatically and can be monitored using

./storcli /c0 /eall /sall show rebuild

It is possible to change how fast the rebuild occurs using storcli /c0 set rebuildrate=<value>. This value should be between 0 and 100, and is 30 by default on Clooney. Changing this value can slow down I/O speeds significantly and might make it difficult to login to Brosnan so do this with caution. With the a rate of 30 a rebuild of one drive should take less than 24 hours.

Make sure to order a new drive for when this occurs again in the future, and that's it.

Disk failure - HP Server

For HP servers aberlour (and mackmyra): The mail will have something like this in the report

> /usr/sbin/hpacucli controller slot=1 logicaldrive all show status

   logicaldrive 1 (40.0 GB, 6): Interim Recovery Mode
   logicaldrive 2 (16.3 TB, 6): Interim Recovery Mode

> /usr/sbin/hpacucli controller slot=1 physicaldrive all show status

   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 3 TB): OK
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 3 TB): OK
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 3 TB): OK
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 3 TB): OK
   physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 3 TB): OK
   physicaldrive 1I:1:6 (port 1I:box 1:bay 6, 3 TB): Failed
   physicaldrive 1I:1:7 (port 1I:box 1:bay 7, 3 TB): OK
   physicaldrive 1I:1:8 (port 1I:box 1:bay 8, 3 TB): OK
  • Find a replacement disk from the inventory.
  • on the old HP machines (mackmyra and aberlour) the failed disk will flash. Just remove the failed and insert the replacement. Check with the command bellow, it should say rebuilding.
$ /usr/sbin/hpacucli controller slot=1 logicaldrive all show status

   logicaldrive 1 (40.0 GB, 6): OK
   logicaldrive 2 (16.3 TB, 6): Recovering, 36% complete

$ /usr/sbin/hpacucli controller slot=1 physicaldrive all show status

   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 3 TB): OK
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 3 TB): OK
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 3 TB): OK
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 3 TB): OK
   physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 3 TB): OK
   physicaldrive 1I:1:6 (port 1I:box 1:bay 6, 3 TB): Rebuilding
   physicaldrive 1I:1:7 (port 1I:box 1:bay 7, 3 TB): OK
   physicaldrive 1I:1:8 (port 1I:box 1:bay 8, 3 TB): OK
  • Job done (to increase the priority of rebuilding you can run) hpacucli controller slot=1 modify rebuildpriority=[high|medium|low]

Adding node to TEOROO2

  1. Install the OS on the system and hook it to the node
    • Default usesrname, does it matter?
    • Disk partition (use LVM)?
    • Automatic update? (perhaps no)
    • The following clients/services have to run on the node. openssh-server, autofs, nis, nfs-common
  2. Configuring DHCP and DNS on TEOROO2
    • Visit https://router:10000, login with root and password
    • Create the host in DHCP server
    • Choose the hostname and IP address
  3. Configure DHCP on the node You should only need to configure the network so that it looks for a DHCP service. ref. Check with ifconfig that the ip address is correct. Check that you can access the node from the login node.
  4. Confgure NIS If it's the first install on Ubuntu there will be a prompt for you to input the domain. You'll still need to edit /etc/yp.conf and /etc/nsswitch.conf ref. add nis to the lines (?automount) retart the servicessystemctl restart rpcbind nis autofs known issue for Ubuntu until 16[1]
  1. Configuring NIS on the node
    • For Ubuntu machines Edit the /etc/yp.conf and /etc/defaultdomain to match the domain and ip
    • For Centos/Scientific Linux machines Run system-config-authentication (GUI tool) Or edit /etc/yp.conf and /etc/systemconfig/network
  2. Configure the NTP change the following line in /etc/ntp.cfg from
    ​​​server mackmyra.cluster.mkem.uu.se
    
    to
    ​​​server 10.1.10.1
    
    then restart the ntp service
    ​​​/etc/init.d/ntpd restart
    
    then test
    ​​​ntpq -p
    
  3. Configure the software mount
    ​​​#add this to /etc/fstab
    ​​​clooney:/homefs/sw            /sw                     nfs     defaults        0 0
    ​​​# mount the directory
    ​​​mount -a
    
  4. Configure mail client The mail client config is different for Ubuntu(Postfix) and SL(sendmail)
    • On SL machine Edit /etc/mail/genericdomain, set to {node}.cluster.
      Edit /etc/mail/sendmail.mc, set MASQUERADE_AS to kemu.uu.se.
      Test with root that sendmail works.
  5. Result After the configuration the node will function as a normal compute nodes in teoroo2. User can login using their teoroo2 account and password, they will have access to their files on teoroo2 on this node. Old, local files on this node will still be in /homefs. :::

Setup details

Service information

Both teoroo and teoroo2 and build up with the following services:

  • Router: exposes the cluster to external
  • DNS: assigning domain names for nodes in the subnet
  • DHCP: assigning ip addresses for nodes in the subnet
  • NFS: serving files
  • NIS: serving account and home folder information
  • SSH (login node): the default place people lands
Service teoroo2 teoroo (not accessable)
Router router router
DNS router mackmyra
DHCP router mackmyra
NFS clooney & aberlour mackmyra
NIS router mackmyra
SSH brosnan teoroologin

Thoes services are configured in a similar way such that the migration of compute nodes are designed to be easy. The difference is that in teoroo2 most services are running on router rather than a dedicated machine in the subnet. To migrate machines from one server to the other one need only to modify the configurations (for the service and the compute node).

Fail2ban

fail2ban bans some suspicious IP address if they failed certain times (e.g. with a wrong password). The config file locates at /etc/fail2ban on brosnan. To check manually the banned IPs, run fail2ban-client status sshd on brosnan.

Monitoring

A new monitor system has been set up using zabbix. This includes a zabbix server running on jackie and zabbix agents running on all currently running nodes of teoroo2. In a nutshell, the agents run (customizable) scripts and reports to the server so that they can be monitored in one place.

A dashbord should be running on jackie:19999, to which the admin has access to. This dashboard should cover most of the information an admin is interested in. And upon problems zabbix should notify the sysadmin through mail notifications (configured in Zabbix->Users->Users->teoroosys).

As admin you can add the port forward to your .ssh/config to have the dashboard available once you log into brosnan, at localhost:19999.

Host BROSNAN
  HostName teoroo2.kemi.uu.se
  ForwardAgent yes
  LocalForward 19999 jackie:19999

The old system based on cron jobs is still running. Its setup is described below in archived notes. The admin might want to switch off those cron jobs once the new system is deemed as stable.

Zabbix Setup details

The system is setup according to the zabbix installation doc. Specifically:

  • Official or borrowed tempaltes:
  • Extra scripts (in /etc/zabbix/zabbix_agent*.d):
    • Aberlour and clooney runs respective RAID checks defined in raid_info.conf.
    • Cloony reports latested borg backup state defined in borginfo.conf
    • Aberlour reports ambient temperature defined in ambient_temp.conf

Troubleshooting

This is an incomplete list of bugs that past admins have solved before:

autofs and /home directories

On all nodes the /home/user directories are controlled by the autofs service, sometimes the services might hang, preventing users from accessing their home, to fix that, try:

service autofs restart

Restarting services

After large fixes, such as repairing the filesystem, several important services might hang. These can be restarted using:

for s in {nis,rpcbind,autofs}; do service $s restart; done

DB lock on NFS fix

sudo systemctl enable rpc-statd  # Enable statd on boot
sudo systemctl start rpc-statd  # Start statd for the current session

Failed disk on Clooney

Drive s1 failed, I marked the drive as Offline, and Missing but rebuild started automatically without needing to replace the failed disk. Rebuild finished succesfully, and it seems operational for now.

./storcli /c0 /e0 /s1 set offline
./storcli /c0 /e0 /s1 set missing

File system corruption

On two different occasions Clooney's file system got corrupted, entering a read-only state. This can be fixed by restarting Clooney and then running e2fsck. Make sure that the disk is unmounted first.

umount /dev/sda
e2fsck -fp /dev/sda

Past admins

Lisanne Knijff (lisanne.knijf@kemi.uu.se)

Yunqi Shao (yunqi.shao@chalmers.se)

Pavlin Mitev (pavlin.mitev@uppmax.uu.se)


Archived notes

folded

Copying password across clusters

copy the corresponding lines in /etc/shadow to the NIS server of the other cluster.

granting access from outside

By default the computers can only be accessed from within Sweden. If one wants to access the cluster (without a VPN) one can do this by editting the /etc/host.allow file on the login node. (you can get the IP address of that person using https://ipv6-test.com)

Setup of mail alert

Most alerts and monitoring information is sent to the admin(s) with email, this requires a working mail-sending service, below is the setup for each node.

teoroo

0  1 * * * /root/ProLiant_Status.awk

aberlour

0  1 * * * /root/ProLiant_Status.awk
*/15 * * * * /root/Ambient_Temp.awk

clooney

7 7 * * *  /root/bin/RAID-status.sh | /usr/sbin/sendmail teoroosys@kemi.uu.se

W nodes

most nodes via Ubuntu or ScientificLinux package logwatch - just install it the rest is done by the package (set mail address to teoroosys@kemi.uu.se during installation). Some nodes has smartmontools running to monitor the HDD health and sends a mail in case of pre-failure, high temperature, etc.

Disk failure report

The disk status on the file servers is monitored by cron jobs send mails every night. On some nodes, the daemon of smatmontools is running and sends mails in case of error, failure or high temperature of the disks.

On disk failure, see here, you can probably find a spare disk in the inventory.

QNAP

The qnap NAS can be accesse at the address https://qnap:443, with the username: admin and admin password.


  1. See https://askubuntu.com/questions/771319/in-ubuntu-16-04-not-start-rpcbind-on-boot ↩︎