Try โ€‚โ€‰HackMD

PLUG Server Upgrade Plan

Current State

We currently have two cloud VMs up and running:

  1. power.plug.org.au
    • Running in AWS
    • 1GiB RAM, 4GiB swap (1.4GiB in use)
    • Debian 7/wheezy (released in 2013, support ended in 2018)
    • Runs most services
    • Configured over time.
  2. edison.plug.org.au
    • Running in Digital Ocean
    • 2GiB RAM
    • Debian 9/stretch (released in 2017, support ended in 2022)
    • Runs DNS and Letsencrypt certificate requests
    • Configured via Ansible playbook

Goals

  1. Decommission edison.
  2. Replace with a server running Debian 12/bookworm (released in 2023, support ends in 2028)
    • preferably using a version of our Ansible playbook.
  3. Decommission power.
  4. Have a plan to upgrade the new server to Debian 13 later this year.
    • and beyond: we can't let our infrastructure get so out of date.

Blockers

  1. PLUG DNS has an NS ns1.plug.org.au record pointing at edison. We will need to coordinate with Tim White to update the glue records and point his secondary DNS servers elsewhere.
  2. power currently uses TLS certificates issued by edison's letsencrypt setup.
    • We will need a new system capable of requesting certificates before the 90 day expiry of the last certs from edison.
  3. Testing plan: web/postfix/mailman/ugmm/PHP/fail2ban
  4. ugmm may run, or may have new errors and warnings

Plan

  1. Backup all the things

  2. Stand up a new VM in binarylane.com.au :

    • Image: latest official Debian 12 AMI
    • Instance type: 2GB $7.50 (+GST, AUD=$8.25)
    • Hostname: usb for its brevity
  3. Run the Ansible playbook on new server until error.

  4. Try to manually configure certbot, similar to the following?

    • The Ansible playbook won't bootstrap a system from scratch. It does not configure certbot to request certificates, so config files for nginx, etc are not created. On a fresh machine this causes those services not to start, and the playbook fails.
      โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹# certbot --dns-rfc2136 --preferred-challenges dns -d 'plug.org.au' -d '*.plug.org.au' -m admin@plug.org.au certonly
      โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹# certbot --nginx -d mail.plug.org.au -m admin@plug.org.au certonly
      โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹# cp /usr/lib/python3/dist-packages/certbot/ssl-dhparams.pem /etc/letsencrypt
      โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹# cp /usr/lib/python3/dist-packages/certbot_nginx/_internal/tls_configs/options-ssl-nginx.conf /etc/letsencrypt
      โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹โ€‹# ln -s plug.org.au /etc/letsencrypt/live/mail.plug.org.au
      
      mail.plug.org.au on edison is set up with nginx mode authenticator, so can't set up until we migrate that hostname. Maybe this is what created the options-ssl-nginx.conf and ssl-dhparams.pem files?
  5. Run playbook again, hopefully to completion.

    • This should bring up the services listed below, under "Start migrating services"
  6. Nick: Testing and iteration

    • The services that have web endpoints have some basic and easy tests with our webcheck.py or similar, comparing to power
    • Mail and mailing lists might need more manual? but repeatable-and-repeated testing, previous testing was started on a subdomain po1.plug.org.au
      • swaks(1) is handy
      • postsuper -h ALL
      • Don't send floods of testing/duplicate/membership expiry email to our mailman subscribers/UGMM members
      • Clear the held-for-moderation messages in our mailing lists before cutover
    • Look for log errors and warnings, especially from mail/postfix/mailman/ugmm - iterate, record or script/playbook the "manual" steps until we've fixed enough of them
  7. Update DNS zone files in playbook to point ns1.plug.org.au at new server's IP.

  8. Contact Tim White, and request that he update glue records and his secondary DNS servers to the new ns1.

  9. Verify that new server can successfully request certificates.

  10. Turn off edison. If nothing breaks, consider closing Digital Ocean account.

  11. Start migrating services from power, updating CNAMEs as we go and using power's data backups:

    • website
    • mailing lists
      • postfix
      • mailman
      • pipermail archives
    • LDAP plus UGMM
    • Nick: backups on the new machine
    • Nick: fail2ban on login services: ssh, ugmm, mailman, dovecot?
      • sshd` MaxAuthTries/MaxStartups/PerSourceMaxStartups/PerSourceNetBlockSize
    • Other low priority services like mumble
  12. Turn off power.

Future

  1. Once mailing lists have been migrated, attempt migration to Mailman 3.
  2. When Debian 13 is released or close to release, try snapshotting new machine's storage and perform the upgrade on the live system. Rollback if it fails.

Mail problems

Postfix complained about missing aliases because mailman hadn't yet created /var/lib/mailman/data/aliases.

clamav-daemon.service hadn't started due to missing virus definitions, and hadn't retried. Seems to have been a bootstrap issue, since they have since been updated.