Here you can find information about the cluster setup and how one maintaines the cluster. Those information should be open to all group members, but only admin(s) should need to read those. For administration of the wordpress page, please refer to this note.
Those are the tasks that admins have to do routinely.
There is a way to limit the bandwidth used by bord, the borg.env
configures the borg command according to this link. Check the link to see how to set the limit.
To restore, use borg mount
.
This command keeps one backup a week for the past 12 months, and afterwards keeps one backup for every month. Note that this needs to be -1. This means it needs to keep one for every month.
Borg backup is used through rackham using ssh keys. This means that the user who hosts the backup on rackham needs to have the ssh key in their authorized keys keyfile. Note that this only needs to be changed when ownership of the backups is transferred. Uppmax will always ask for the user password first. Setup an ssh key pair to avoid this. After this is done, backups can always be made on clooney as long as the correct password is being used. Note that this password is specific for the backup and it not the same as the root password.
When transferring the backup to a new user, make sure that they have been added to the correct uppoff project on Uppmax. Afterwards, one must give them permission to modify the backup folder. To give permission to everyone in the group:
Boiler plate to send to the user
shutdown
boot teoroo2
LSI 3108
(look in the document I sent you) and reading on the net some 100 years ago I found that one can use the MegaCli64
tool I use to monitor the health. Also, it seems that the failed disk is not marked by blinking LED so other suggested usingAs the admin, you will receive an email everyday from logwatch updating you on the status of Clooney's disks. This email is entitled "Logwatch for RAID", and the contains the following important information:
As of now, this is the only part of the email that contains information. In the case of a failed disk, this will only be visible here. It is also important to check that Degraded: 0
. Any other status will require further investigation.
Clooney uses a RAID6 data storage approach that consists of 12 physical hard drives. This means that, in theory, two disks can fail before data loss occurs. However, in the case of a disk failling, it is important to replace it as soon as possible to reduce stress on the system.
The updated version of the MegaCli64
disk management tool is called storcli
and is also available on Clooney. Both tools can be found in the /bin
folder.
Storcli is recommended for admin tasks, as it is more recent and documentation is easier to find. The reference manual can be found here, and a useful list of commands here.
An alternate approach of replacing the failed disk using MegaCli64
can be found here but this has never been tried.
In the case of a failed disk, the first thing to do is to check the disk statuses using
This command returns the status of all disks in the RAID array, and ideally the output will look something like this
The state of the drive is the important part here, if anything other than Onln (online) is shown, further investigation is needed.
To get more in-depth information on all drives use
If the disk has failed, it should be replaced with a new one. These, as is the case for aberlour's disks can be found in Chao's office. They are in box underneath his bookcase. The box that contains a disk for Clooney is marked with 'Cloney' so should be easy to find. The exact kind of disk is
Note that this number is the same as is reported in the Drive Information
block above.
When you have found the replacement disk, go down to Clooney and locate the red LED light which marks the failed disk. Make sure to double check that this is actually the failed disk by turning on the blinking LED functionality. This can be done with
/sx
is the slot number of the failed disk and can be found in the EID:Slt
column in the Drive Information
block above.
To stop the LED blinking
Now that you have located the failed disk the procedure (ref. and ref.) is as follows
Then remove the failed drive and replace it with a new drive. Documentation for this can be found here (See section 2-2). You will have to replace the disk from the drive handle chassis by unscrewing it, as the new disks do not come with one.
The rebuild then starts automatically and can be monitored using
It is possible to change how fast the rebuild occurs using
storcli /c0 set rebuildrate=<value>
. This value should be between 0
and 100
, and is 30
by default on Clooney. Changing this value can slow down I/O speeds significantly and might make it difficult to login to Brosnan so do this with caution. With the a rate of 30
a rebuild of one drive should take less than 24 hours.
Make sure to order a new drive for when this occurs again in the future, and that's it.
For HP servers aberlour (and mackmyra): The mail will have something like this in the report
hpacucli controller slot=1 modify rebuildpriority=[high|medium|low]
openssh-server
, autofs
, nis
, nfs-common
ifconfig
that the ip address is correct.
Check that you can access the node from the login node./etc/yp.conf
and /etc/nsswitch.conf
ref.
add nis to the lines (?automount)
retart the servicessystemctl restart rpcbind nis autofs
known issue for Ubuntu until 16[1]/etc/yp.conf
and /etc/defaultdomain
to match the domain and ip/etc/yp.conf
and /etc/systemconfig/network
/etc/ntp.cfg
from
to
then restart the ntp service
then test
/etc/mail/genericdomain
, set to {node}.cluster
./etc/mail/sendmail.mc
, set MASQUERADE_AS
to kemu.uu.se
.root
that sendmail
works./homefs
.
:::Both teoroo and teoroo2 and build up with the following services:
Service | teoroo2 | teoroo (not accessable) |
---|---|---|
Router | router | router |
DNS | router | mackmyra |
DHCP | router | mackmyra |
NFS | clooney & aberlour | mackmyra |
NIS | router | mackmyra |
SSH | brosnan | teoroologin |
Thoes services are configured in a similar way such that the migration of compute nodes are designed to be easy. The difference is that in teoroo2 most services are running on router rather than a dedicated machine in the subnet. To migrate machines from one server to the other one need only to modify the configurations (for the service and the compute node).
fail2ban bans some suspicious IP address if they failed certain times (e.g. with
a wrong password). The config file locates at /etc/fail2ban
on brosnan
. To
check manually the banned IPs, run fail2ban-client status sshd
on brosnan
.
A new monitor system has been set up using zabbix. This includes a zabbix server running on jackie
and zabbix agents running on all currently running nodes of teoroo2. In a nutshell, the agents run (customizable) scripts and reports to the server so that they can be monitored in one place.
A dashbord should be running on jackie:19999, to which the admin has access to. This dashboard should cover most of the information an admin is interested in. And upon problems zabbix should notify the sysadmin through mail notifications (configured in Zabbix->Users->Users->teoroosys).
As admin you can add the port forward to your .ssh/config
to have the dashboard available once you log into brosnan, at localhost:19999.
The old system based on cron jobs is still running. Its setup is described below in archived notes. The admin might want to switch off those cron jobs once the new system is deemed as stable.
The system is setup according to the zabbix installation doc. Specifically:
Linux by Zabbix agent
template;Zabbix Smartctl
template for disk self-checks;Nvidia GPUs Performance
template;/etc/zabbix/zabbix_agent*.d
):
raid_info.conf
.borginfo.conf
ambient_temp.conf
This is an incomplete list of bugs that past admins have solved before:
autofs
and /home
directoriesOn all nodes the /home/user
directories are controlled by the autofs
service, sometimes the services might hang, preventing users from accessing
their home, to fix that, try:
After large fixes, such as repairing the filesystem, several important services might hang. These can be restarted using:
Drive s1 failed, I marked the drive as Offline, and Missing but rebuild started automatically without needing to replace the failed disk. Rebuild finished succesfully, and it seems operational for now.
On two different occasions Clooney's file system got corrupted, entering a read-only state. This can be fixed by restarting Clooney and then running e2fsck. Make sure that the disk is unmounted first.
Lisanne Knijff (lisanne.knijf@kemi.uu.se)
Yunqi Shao (yunqi.shao@chalmers.se)
Pavlin Mitev (pavlin.mitev@uppmax.uu.se)
copy the corresponding lines in /etc/shadow
to the NIS server of the other
cluster.
By default the computers can only be accessed from within Sweden. If one wants
to access the cluster (without a VPN) one can do this by editting the
/etc/host.allow
file on the login node. (you can get the IP address of that
person using https://ipv6-test.com)
Most alerts and monitoring information is sent to the admin(s) with email, this requires a working mail-sending service, below is the setup for each node.
teoroo
aberlour
clooney
W nodes
most nodes via Ubuntu or ScientificLinux package logwatch
- just install it
the rest is done by the package (set mail address to teoroosys@kemi.uu.se
during installation). Some nodes has smartmontools
running to monitor the HDD
health and sends a mail in case of pre-failure, high temperature, etc.
The disk status on the file servers is monitored by cron jobs send mails every
night. On some nodes, the daemon of smatmontools
is running and sends mails in
case of error, failure or high temperature of the disks.
On disk failure, see here, you can probably find a spare disk in the inventory.
The qnap NAS can be accesse at the address https://qnap:443, with the username:
admin
and admin password.