IAD2 -> RDU3 overview

# IAD2 to RDU3 datacenter move overview This is a overview of the 2025 datacenter move from IAD2 to RDU3. The move is now over as of 2025-07-05. Please only use this document for historical reasons. Remaning cleanup/tasks are being tracked in https://pagure.io/fedora-infrastructure/issue/12620 and also in seperate infra tickets now. The move will be done in a number of phases. Refreshed hardware in RDU3 will be installed, data and applications will be setup there and switched to, then newer hardware from IAD2 will be shipped to RDU3 and staging env will be re-setup. ## Considerations * From phase4 (switcharoo) to the end of phase6 (staging setup/balance), there will be limited STAGING env to test with. For this reason, during this time, we should strive to make as few changes as possible. * 2025-05-01 - expected access to out of band mgmt on new hosts * 2025-06-05 - flock in prague * 2025-06-26 - staging switcharoo * 2025-06-30 - switcharoo week! * 2025-07-07 <- We are HERE - **MOVE complete, cleanup underway** * 2025-07-09 - old iad2 hardware ships to rdu3 ## Communications * 2025-01-23: Initial community blog post / devel-announce post sent in jan: * https://lists.fedoraproject.org/archives/list/devel-announce@lists.fedoraproject.org/thread/NJKLAGMCTDD2YVINWEAT4CSZLXXEYFSL/ * https://communityblog.fedoraproject.org/fedora-datacenter-move-later-this-year-2025-version/ * 2025-04-21: Update blog post / announcement to be sent this week * https://communityblog.fedoraproject.org/2025-fedora-datacenter-move-update/ * 2025-05-14: Update blog post / announcement to be sent this week * 2025-06-09: reminder blog post / announcement about moving coming in the next week * 2025-06-23: post to announce to end users to just let them know it's happening. * 2025-06-29: reminder email that this is happening tomorrow. list what services may be down temporarily * 2025-06-30: announce its happening & what services may be out temporarily * 2025-07-09 <- We are HERE - ** MOVE done, cleanup underway** * 2025-07-14: status update / everything good? ## Internal touchpoints TODO: we are going to need to coordinate with storage and networking and IT in various steps to get them to do things so we can move to the next. We should denote those. ## phases :::spoiler These phases have been completed, click to expand and see them ### 0: planning / footprint reduction / changes Planning includes finishing/refining this document, as well as finishing any infrastructure work that will help us in the new datacenter or would cause disruptions during the move. #### work items to be done before move These are only ideas, we need to discuss and finalize 1. Use zabbix in RDU3 to replace nagios (might wait) 2. Investigate moving the wiki into OpenShift 3. Complete moving registries to quay.io so we don't have to move ours 4. Use RHEL9/new rabbitmq in RDU3 5. nftables replacing iptables (might wait) ### 1: pre-install RDU3 setup These are things we want to do as soon as we have access to RDU3 hardware but before we start bootstrapping instances/installs. - [x] Get access to mgmt on one each "small" or "large" server in RDU3 - [x] Adjust firmware settings and create a xml template to use - [x] Test raid with hardware/software/encrypted to know tradeoffs - [x] Setup dns zones as soon as info is known from networking - [x] Setup dhcp information as soon as known from machines mac addrs - [-] Install and configure power10 boxes when available - [x] sort out the ipv6 story at RDU3. - [x] make sure ipv6 firewalling is correct (since rdu3 will have ipv6) mgmt/OOB/BMC setup: 1 login with initial password and reset it to new one 2 add to fedora-prod drac group (or fedora-stage) 3 group add user / group firmware update drac 4 sort out one machine, export config in json/xml to rest 5 bios and other firmware updates 6. set default disk status for newly-added disks (figure out where...) autosign01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 sign-vault01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 buildhw-x86-01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 buildhw-x86-02.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 autosign02.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 sign-vault02.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 buildhw-x86-03.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 qvmhost-x86-01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 backup01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-x86-01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-x86-02.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-x86-03.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-x86-04.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-x86-05.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-x86-06.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-x86-01-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-x86-02-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-x86-03-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 worker01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 worker02.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 worker03.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 worker04.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 worker05.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 worker01-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 worker02-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 worker03-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-02.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-03.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-04.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-05.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-01-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-02-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-03-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-04-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-05-stg.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 vmhost-x86-riscv01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 openqa-x86-worker01.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 openqa-x86-worker02.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 openqa-x86-worker03.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 openqa-x86-worker04.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 openqa-x86-worker05.mgmt.rdu3.fedoraproject.org 1 2 3 4 5 bvmhost-a64-01.mgmt.rdu3.fedoraproject.org kevin - done bvmhost-a64-02.mgmt.rdu3.fedoraproject.org kevin - done bvmhost-a64-03.mgmt.rdu3.fedoraproject.org kevin - done bvmhost-a64-04.mgmt.rdu3.fedoraproject.org kevin - done buildhw-a64-01.mgmt.rdu3.fedoraproject.org kevin - done buildhw-a64-02.mgmt.rdu3.fedoraproject.org kevin - done openqa-a64-worker01.mgmt.rdu3.fedoraproject.org kevin - done openqa-a64-worker02.mgmt.rdu3.fedoraproject.org kevin - done bvmhost-a64-01-stg.mgmt.rdu3.fedoraproject.org kevin - done power10 01 power10 02 ### 2: bootstrap in RDU3 This phase we want to setup manually enough things so we can then leverage ansible to install the rest. - [x] Install one large host as a vmhost (manually via mgmt interface) - [x] Install a bastion01.rdu3 vm (can have batcave01.iad2 reach in via) - [x] Install a noc01.rdu3 vm (dhcp/tftp/pxe) - [x] Install a tang01.rdu3 - [x] Install a ns01.rdu3 vm (dns) - [x] Install a batcave01.rdu3 - [x] Instaal a proxy01.rdu3 - [x] Install a log01.rdu3 - [x] Install os-control01.rdu3 - [x] Install OpenShift cluster - [?] Install ipa01.rdu3 and set to replicate from iad2 - zlopez in progress - [?] Install new rhel9/newer rabbitmq rabbitmq server cluster - james/aurelian - done? - [x] Install the rest of vmhosts/hardware (add list/table when available) - [x] HSM setup and configuration and testing in RDU3 - kevin - in progress autosign01.rdu3.fedoraproject.org - installed - azsiblized sign-vault01.rdu3.fedoraproject.org - installed - ansiblized buildhw-x86-01.rdu3.fedoraproject.org - installed, but had to install f41 and upgrade, f42 GA kernel reboots on boot buildhw-x86-02.rdu3.fedoraproject.org - installed. ^ ditto autosign02.rdu3.fedoraproject.org - would like to repurpose to buildhw-x86-04 - sign-vault02.rdu3.fedoraproject.org buildhw-x86-03.rdu3.fedoraproject.org qvmhost-x86-01.rdu3.fedoraproject.org - rhel9 06disk deployed & booted - ansiblized backup01.rdu3.fedoraproject.org - rhel9 6disk - built, ansible needs a vpn cert - cert made, ansibilized bvmhost-x86-01.rdu3.fedoraproject.org - installed - ansiblized bvmhost-x86-02.rdu3.fedoraproject.org - rhel9 8disk - ansiblized bvmhost-x86-03.rdu3.fedoraproject.org - rhel9 8disk - ansiblized bvmhost-x86-04.rdu3.fedoraproject.org - rhel9 8disk - ansiblized bvmhost-x86-05.rdu3.fedoraproject.org - rhel9 8disk - ansiblized bvmhost-x86-06.rdu3.fedoraproject.org - rhel9 8disk - ansiblized bvmhost-x86-01-stg.rdu3.fedoraproject.org - rhel9 8disk - ansiblized bvmhost-x86-02-stg.rdu3.fedoraproject.org - rhel9 8disk - ansiblized bvmhost-x86-03-stg.rdu3.fedoraproject.org - rhel9 8disk - ansiblized worker01.rdu3.fedoraproject.org - kevin done worker02.rdu3.fedoraproject.org - kevin done worker03.rdu3.fedoraproject.org - kevin done worker04.rdu3.fedoraproject.org - kevin done worker05.rdu3.fedoraproject.org - kevin done worker01-stg.rdu3.fedoraproject.org - kevin - done worker02-stg.rdu3.fedoraproject.org - kevin - done worker03-stg.rdu3.fedoraproject.org - kevin - done vmhost-x86-01.rdu3.fedoraproject.org - installed - ansiblized vmhost-x86-02.rdu3.fedoraproject.org - installed - ansiblized vmhost-x86-03.rdu3.fedoraproject.org - rhel9 8disk - ansiblized vmhost-x86-04.rdu3.fedoraproject.org - rhel9 8disk - ansiblized vmhost-x86-05.rdu3.fedoraproject.org - rhel9 8disk - ansiblized vmhost-x86-01-stg.rdu3.fedoraproject.org - installed - ansiblized vmhost-x86-02-stg.rdu3.fedoraproject.org - rhel9 8disk - ansiblized vmhost-x86-03-stg.rdu3.fedoraproject.org - rhel9 8disk - ansiblized vmhost-x86-04-stg.rdu3.fedoraproject.org - rhel9 8disk - ansiblized vmhost-x86-05-stg.rdu3.fedoraproject.org - rhel9 8disk - ansiblized vmhost-x86-riscv01.rdu3.fedoraproject.org - rhel9 8disk - ansibilized openqa-x86-worker01.rdu3.fedoraproject.org - Greg: openqa-x86-worker02.rdu3.fedoraproject.org | all installed with f41, and openqa-x86-worker03.rdu3.fedoraproject.org |-- upgraded to f42 via dnf openqa-x86-worker04.rdu3.fedoraproject.org | RAID looks ok, needs the bond openqa-x86-worker05.rdu3.fedoraproject.org - device setting up though bvmhost-a64-01.rdu3.fedoraproject.org - kevin - done bvmhost-a64-02.rdu3.fedoraproject.org kevin - done bvmhost-a64-03.rdu3.fedoraproject.org - kevin - done bvmhost-a64-04.rdu3.fedoraproject.org - kevin - done buildhw-a64-01.rdu3.fedoraproject.org - kevin - done buildhw-a64-02.rdu3.fedoraproject.org - kevin - done openqa-a64-worker01.rdu3.fedoraproject.org - kevin - done openqa-a64-worker02.rdu3.fedoraproject.org - kevin - done bvmhost-a64-01-stg.rdu3.fedoraproject.org - kevin - done power10 01 (78d1291) - in progress - power10 02 (78d1281) - provisioned on centos side ### 2.5: reboot cycle as per above, except for not-yet-installed / OCP workers ##### stg bvmhost-x86-01-stg.rdu3.fedoraproject.org - done bvmhost-x86-02-stg.rdu3.fedoraproject.org - done bvmhost-x86-03-stg.rdu3.fedoraproject.org - done vmhost-x86-01-stg.rdu3.fedoraproject.org - done vmhost-x86-02-stg.rdu3.fedoraproject.org - done vmhost-x86-03-stg.rdu3.fedoraproject.org - done vmhost-x86-04-stg.rdu3.fedoraproject.org - done vmhost-x86-05-stg.rdu3.fedoraproject.org - done bvmhost-a64-01-stg.rdu3.fedoraproject.org - done ##### prod autosign01.rdu3.fedoraproject.org sign-vault01.rdu3.fedoraproject.org - done - kevin buildhw-x86-01.rdu3.fedoraproject.org buildhw-x86-02.rdu3.fedoraproject.org qvmhost-x86-01.rdu3.fedoraproject.org - done - kevin backup01.rdu3.fedoraproject.org - done - kevin bvmhost-x86-01.rdu3.fedoraproject.org - done - kevin bvmhost-x86-02.rdu3.fedoraproject.org - done - kevin bvmhost-x86-03.rdu3.fedoraproject.org - done - kevin bvmhost-x86-04.rdu3.fedoraproject.org - done - kevin bvmhost-x86-05.rdu3.fedoraproject.org - done - kevin bvmhost-x86-06.rdu3.fedoraproject.org - done - kevin vmhost-x86-01.rdu3.fedoraproject.org - done - kevin vmhost-x86-02.rdu3.fedoraproject.org - done - kevin vmhost-x86-03.rdu3.fedoraproject.org - done - kevin vmhost-x86-04.rdu3.fedoraproject.org - done - kevin vmhost-x86-05.rdu3.fedoraproject.org - done - kevin vmhost-x86-riscv01.rdu3.fedoraproject.org - done - kevin bvmhost-a64-01.rdu3.fedoraproject.org bvmhost-a64-02.rdu3.fedoraproject.org bvmhost-a64-03.rdu3.fedoraproject.org bvmhost-a64-04.rdu3.fedoraproject.org buildhw-a64-01.rdu3.fedoraproject.org buildhw-a64-02.rdu3.fedoraproject.org openqa-a64-worker01.rdu3.fedoraproject.org openqa-a64-worker02.rdu3.fedoraproject.org ### 3: prep / early movers In this phase we will find applications/things that we can move in advance of the switcharoo. These are things that don't have databases or nfs mounts or the like and can get the information they need to operate from IAD2 until those datasources also move to RDU3. <strike>- [ ] consider moving /srv/git on pkgs01 to a netapp volume in IAD2</strike> Possible early move applications in openshift (needs discussion): application-monitoring asknot badges bugzilla2fedmsg (needs access to BZ STOMP endpoint) cloud-image-uploader compose-tracker discourse2fedmsg elections languages maubot ipsilon-website kanban kerneltest openvpn (switch to wireguard) planet webhook2fedmsg websites zezere - dead: doesn't need to move fedora coreos can move their build pipeline anytime we have cluster ready - done Possible move applications not in openshift: Openqa can move anytime we have their machines setup/installed - will move next week during main move possibly move download vm's to the r/o snapmirror version in rdu3 riscv secondary hub and builders and composers - done at the same time as staging new rabbitmq cluster (rhel9 + centos messaging rabbitmq-server) - done ::: ### 3.5 staging 2025-06-26 (or so) take down iad2 staging and move it all to rdu3 This will be a good 'dress rehersal' for doing the actual switcharoo the following week. It will allow us to improve processes and learn. - [x] build OpenShift Cluster in stg.rdu3 - in progress (Greg) At 10:00 UTC (6am EDT, 11am UK) - [x] Shutoff nagios alerts ( ansible/scripts/shutup-nagios ) Disable the following staging services: - [x] proxy01.stg.iad2.fedoraproject.org httpd - [x] proxy02.stg.iad2.fedoraproject.org httpd - [x] shutdown wiki01.stg.iad2.fedoraproject.org vm - eg `ansible bvmhost-x86-01.stg.iad2.fedoraproject.org -m command -a "systemctl shutdown"` or ssh - [x] scale openscanhub pods to 0 in iad2 staging openshift - eg `oc scale --replicas=0 d/fedora-osh-hub` - [greg@topaz]$ oc get deployments ⑂ /srv/web/infra/ansible:main NAME READY UP-TO-DATE AVAILABLE AGE fedora-osh-hub 0/0 0 0 440d redis 0/0 0 0 440d resalloc-server 0/0 0 0 440d - [x] shutdown bodhi-backend01.stg.iad2.fedoraproject.org vm - [x] shutdown buildvm's in iad2.stg: - [x] buildvm-x86-01.stg.iad2.fedoraproject.org - [x] buildvm-x86-02.stg.iad2.fedoraproject.org - [x] buildvm-x86-03.stg.iad2.fedoraproject.org - [x] buildvm-x86-04.stg.iad2.fedoraproject.org - [x] buildvm-x86-05.stg.iad2.fedoraproject.org - [x] buildvm-a64-01.stg.iad2.fedoraproject.org - [x] buildvm-a64-02.stg.iad2.fedoraproject.org - [x] buildvm-ppc64le-01.stg.iad2.fedoraproject.org - [x] buildvm-ppc64le-02.stg.iad2.fedoraproject.org - [x] buildvm-ppc64le-03.stg.iad2.fedoraproject.org - [x] buildvm-ppc64le-04.stg.iad2.fedoraproject.org - [x] buildvm-ppc64le-05.stg.iad2.fedoraproject.org - [x] shutdown riscv-koji01.iad2.fedoraproject.org - [x] shutdown buildvm-x86-riscv01.iad2.fedoraproject.org - [x] shutdown buildvm-x86-riscv02.iad2.fedoraproject.org - [x] shutdown compose-x86-riscv01.iad2.fedoraproject.org - [x] shutdown value01.iad2.fedoraproject.org - looks like value02? also we have a value02.stg, assume we shut this down Scale all these things in the stg iad2 cluster down to 0 pods: - [x] flatpak-indexer - [x] languages - [x] docsbuilding - [x] mote - [x] maubot - [x] mdapi - [x] openshift-image-registry - scaling didn't work because the registry operator recreated the pods - removed the deployment via editing the operator - `oc edit configs.imageregistry.operator.openshift.io` and change `managementState:` to `Removed` - [x] review-stats - [x] websites Other non system projects could also be scaled down too as we are moving everything, but the above are things with rw nfs volumes At 11:00 UTC (7am EDT, 12noon UK) RHIT storage folks will flip all the staging rw volumes to ro in iad2 and rw in rdu3 If there's a reason this needs to be delayed, notify them on slack At 13-14 UTC (9-10am EDT, 2-3pm UK) - [x] switch dns: diff --git a/fedoraproject.org.template b/fedoraproject.org.template index 6df727dd..77ac9dc1 100644 --- a/fedoraproject.org.template +++ b/fedoraproject.org.template @@ -191,14 +191,14 @@ fasjson.stg IN A 10.3.166.74 fasjson.stg IN A 10.3.166.75 {% else %} ; staging wildcard/proxies (external view) -wildcard.stg IN A 38.145.60.32 -wildcard.stg IN A 38.145.60.33 +wildcard.stg IN A 38.145.32.32 +wildcard.stg IN A 38.145.32.33 ;; id.stg can not be a CNAME -id.stg IN A 38.145.60.32 -id.stg IN A 38.145.60.33 +id.stg IN A 38.145.32.32 +id.stg IN A 38.145.32.33 ;fasjson.stg IN A 38.145.60.32 -fasjson.stg IN A 38.145.60.33 -flatpak-cache.stg IN A 38.145.60.32 +fasjson.stg IN A 38.145.32.33 +flatpak-cache.stg IN A 38.145.32.32 {% endif %} ; This will point stg.fedoraproject.org to the proxy01.stg.rdu3 and proxy02.stg.rdu3 proxies. - [x] migrate databases and data - sync data over and upgrade postgres versions. Once all the dbs are up... - [x] db-datanommer01.stg.rdu3.fedoraproject.org - done - kevin - [x] db-fas01.stg.rdu3.fedoraproject.org - done - [x] db-koji01.stg.rdu3.fedoraproject.org - done - kevin - [x] db01.stg.rdu3.fedoraproject.org - done - kevin - [x] db03.stg.rdu3.fedoraproject.org - done - kevin - [x] db-koji-riscv01.rdu3.fedoraproject.org - done - kevin - [x] Bring up in rdu3 - [x] finish configuring staging openshift cluster. - mostly done - [x] openshift services need their playbooks modified to change os_control_stg to point to rdu3 stg os-control In the easiest case it should be just running the playbook with -l staging. nfs based pv's are added - [x] badges - abompard - depends on datanommer - [ ] blockerbugs - zlopez - needs packager-dashboard to be deployed - sync cron job throws error because of bodhi - https://qa.stg.fedoraproject.org/blockerbugs/ is working, but it's slow - will wait for bodhi to test out - blockerbugs page is much quicker, but https://qa.stg.fedoraproject.org/ landing page doesn't load for me - [x] bodhi - Also requires `playbooks/groups/bodhi-backend.yml` - [x] bugzilla2fedmsg - abompard - [x] cloud-image-uploader - [x] compose-tracker - jnsamyak (it is deployed but due to pagure load it is giving an error) - compose-tracker-build-3-build 1/1 Running - [x] coreos-ci - [x] datagrepper - abompard - depends on datanommer - [x] datanommer - abompard - Blocked by an error accessing the datanommer database: ``` [root@db-datanommer01 ~][STG]# sudo -u postgres psql could not change directory to "/root": Permission denied psql: error: FATAL: cache lookup failed for relation 6100 ``` - [x] discourse2fedmsg - abompard - [x] docsbuilding - darknao - [x] docstranslation - darknao - [x] easyfix - abompard - [x] elections - abompard - [x] fas2discourse-operator - zlopez - - /root/bin/oc doesn't exist - changed to /bin/oc - git package is missing - added - make package is missing - added - needs to wait for kojipkgs to install it - done - Deployed it - [x] fasjson - abompard - [x] fedocal - [ ] fedora-coreos-pipeline - [x] fedora-packages-static - [x] firmitas - zlopez - Error when running cronjob `[FMTS] [2025-07-02 02:16:22 +0000] [ERROR] Please set the directory containing X.509 standard TLS certificates properly` - There probably needs to be some job that is syncing them to local PVC - Fixed by dkirwan - [ ] flask-oidc-dev - I think we can drop this one - [ ] flatpak-indexer - [x] fmn - abompard - [x] forgejo - dkirwan - [x] greenwave - [x] ipsilon-website - abompard - [ ] kanban - [x] kerneltest - abompard - [x] koschei - [ ] languages - [ ] mattdm-knative-dev - [x] maubot - greg - [x] mdapi - greg - storage issue, flagged to BWalker - storage fixed, thanks to darknao for help - [x] mirrormanager - abompard - the centos proxy is unavailable, it'll require a DNS change when we switch prod over - couldn't test the propagation of the mirrorlist db to the proxies because they're only in prod - [x] mote - greg - playbook ran but failed to build, poetry error, darknao to investigate - fixed? - Build is fixed, but app not yet reachable on https://meetbot.stg.fedoraproject.org/ - darknao - [x] noggin - abompard - [x] noggin-centos - abompard - [x] openscanhub - [x] oraculum - zlopez - "Failed to download metadata for repo 'infrastructure-tags': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried" when trying to install psycopg2 package on db01.stg.rdu3.fedoraproject.org - Will try later when kojipkgs will be migrated - Pull image still failed due to error: initializing source docker://registry.access.redhat.com/ubi9/python-312:latest: reading manifest latest in registry.access.redhat.com/ubi9/python-312: StatusCode: 403 - fixed - Deployed and running - [x] planet - abompard - `stg.fedoraplanet.org` presents the wrong TLS certificate (`*.stg.fedoraproject.org`), not sure where to fix that - [x] poddlers - abompard - [x] release-monitoring - zlopez - db for release-monitoring is hardcoded in vars.yml in ansible-private - App is deployed and working fine, but https://stg.release-monitoring.org/ is unreachable (fixed now, but the cert is wrong) - Fedora messaging publish is failing as well (working now) - Common name on cert is not correct *.stg.fedoraproject.org is not stg.release-monitoring.org - fixed - [x] resultsdb - greg - db was set to point to iad2 in private/vars - fixed, pushed, and playbook ran - but now it doesn't seem to want to deploy new pods - restarted the deployment (oc rollout latest dc/resultdb-api) and all pods are up and running - darknao - App not reachable yet on https://resultsdb.stg.fedoraproject.org - [x] resultsdb-ci-listener - [x] review-stats - [ ] stackrox - [x] testdays - patrikp - [x] the-new-hotness - zlopez - [x] transtats - [ ] valkey-operator-system - [x] waiverdb - [x] webhook2fedmsg - abompard - [x] websites - darknao - [ ] check nagios for down things and fix them ### 3.9 staging retrospective - DNS resolving was a problem, where we got prod rdu3 instead of stg.rdu3 - ntap was firewalled, or not routed, for a bit - **need** to make sure we co-ordinate DB migration with service bring up - Gets a bit confusing when the playbook is changing often, Eg. default ipa servers changing from ip.stg.ida2 to ipa.stg.rdu3 (maybe less of a problem with prod) - Couple of places where data is duplicated (likely copy and paste error) in ansible, and there's no warnings and the second version wins (so you change the first and nothing changes) - Hard coded iad2 things in playbook tasks (hopefully all found with staging rollout) - Weird issue with `db01.stg.rdu3` prompt showing PROD-RDU3 -- hostname wasn't in the staging group? - Same happened with `pkgs01.stg.rdu3`, it was missing in staging group and thus using production variables - zlopez - Everyone try to rest on Sunday - we should drop some old databases we still have (like pdc, which is MASSIVE!) before the move (I will do that in the next few days) - Don't forget to move also IPA CA renewal and CRL server to new machines (already done for both staging and production, but let's keep this in mind for next migration). See https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/migrating_to_identity_management_on_rhel_9/index#assigning-the-ca-renewal-server-role-to-the-rhel-9-idm-server_assembly_migrating-your-idm-environment-from-rhel-8-servers-to-rhel-9-servers - openshift-image-registry is an operator, disabling pods will just roll them back. Do `oc edit configs.imageregistry.operator.openshift.io` and change `managementState:` to `Removed` instead - db for release-monitoring is hardcoded in vars.yml in ansible-private - (Greg) openqa.stg was broken - the openqa-lab hosts are still in iad2, and port 80 is not open RDU3->IAD2 for the reverseproxy - AdamW was keen to restore access rather than wait til next week, so I restored the stg.iad2 proxies for *just openqa* - `cd /etc/httpd/conf.d/ ; mkdir disabled ; mv * disabled/ ; mv disabled/openqa* . ; systemctl restart httpd` - DNS updated (I couldn't do this, I think I need to be in sysadmin-dns. Asked DKirwan to help) - the RDU3 proxies don't have the reverseproxy info for openqaa - they just redirect 421. I assume this will be fixed later by "something" - this will need to be reverted once the lab has moved - (Greg) there's a lot of `hosts: db01.iad2.fedoraproject.org:db01.stg.iad2.fedoraproject.org` in the openshift playbooks - Doesn't need to change now as the DBs exist, but we should clean that up ### 4: switcharoo week This is the main outage of the move. During this one week we will switch things from being in IAD2 to being in RDU3. We want to try and have everything moved in mon/tue/wed/thru and save friday for a day to fix problems. We will need to work with storage and networking RHIT folks during this week as we may need vlan/port assignments and storage snapmirror changes. #### Prereq (sunday or before) We want to have some few things done the week before the move if at all possible. - [x] sync data from: db01, db03, db-koji01, db-datanommer02, db-fas01, pkgs01, batcave01, log01 from iad2/rdu3 - in progress - kevin - [x] stop rawhide composes sunday night (us time)(after rawhide compose starts) (will start it at 4:15utc and then disable - kevin) - [x] stop backups and grokmirror sunday night (us time) and umount /fedora_backups on backup01 - [x] stop updates pushes sunday night (us time) - [x] stop cloud / container image nightly composes (us time) - [x] add pinned message in a few matrix rooms (admin, devel, releng) about the move - [x] mail devel-announce / announce one final reminder #### Monday Core apps day. Non build core applications and databases At 10:00 UTC (6am EDT, 11am UK) - [x] update status that the dc move week is started! - [x] Shutoff nagios alerts ( ansible/scripts/shutup-nagios ) - [x] umount /mnt/fedora_stats on log01.iad2.fedoraproject.org, disable cron jobs that refer to it - [x] remount ro the openshift_prod_codecs nfs mount on sundries01/02.iad2: ( 'mount -o remount,ro /srv/web/codecs.fedoraproject.org') (we need this mount to still exist because it syncs to proxies, but we need it ro for the storage switch) - [x] remount ro the fedora_apps nfs mount on batcave01.iad2. - the mount is called fedora_app - mount | grep fedora_app ntap-iad2-c02-fedora01-nfs01a:/fedora_app/app - [x] Shut down the following iad2 vms and also 'virsh autostart --disable' on them: - [x] openqa01 - virsh autostart --disable - virsh shutdown - [x] openqa-lab01 - virsh autostart --disable - virsh shutdown - [x] value02 - virsh shutdown - virsh autostart --disable - [x] oci-candidate-registry01 - virsh autostart --disable - virsh shutdown - [x] Scale the following openshift apps pods in production iad cluster to 0: - [x] openscanhub - deployment.apps/fedora-osh-hub scaled to 0 - deployment.apps/redis scaled to 0 - deployment.apps/resalloc-server scaled to 0 - no cron resource - [x] fedora-packages-static - deploymentconfig.apps.openshift.io/fedora-packages-static scaled to 0 - deploymentconfig.apps.openshift.io/solr scaled to 0 - no cron resource - [x] The cluster registry (to free the ocp_prod_registry volume) - `oc edit configs.imageregistry.operator.openshift.io` and change `managementState:` to `Removed` - [x] docsbuilding - cronjob.batch/cron, cronjob.batch/cron-translated patched to true - [x] maubot - deployment.apps/maubot scaled to 0 - no cron resource - [x] mdapi - deploymentconfig.apps.openshift.io/mdapi scale to 0 - cronjob.batch/mdapi patched to suspend as True - [x] reviewstats - [x] websites - [x] bodhi (this is a build type app, but it's using db01 for database and stopping it will prevent problems with updates during the move) Confirm to storage team internally that all the above is done and ready for them. At 11:00 UTC (7am EDT, 12noon UK) - [x] Storage team will reverse the snapmirror for those volumes (iad2 ro, rdu3 rw) At 14 UTC (7am PDT, 10am EDT, 2-3pm UK) (kevin awake will start in on these) In progress (13:50UTC): - [x] sync and bring up batcave01.rdu3 warn on batcave01.iad2/remove ssh key. At this point everyone should move to using batcave01.rdu3 - [x] stop db01/03/db-datanommer/db-fas01/db-openqa01 - kevin - [x] db-fas01.rdu3 - [x] db01.rdu3 - [x] db03.rdu3 - [x] db-datanommer02.rdu3 - [x] db-openqa01 - [x] migrate all db's over to rdu3 - kevin (except the build ones... tomorrow) - [x] Change 'gateway' in dns from bastion01.iad2 to bastion01.rdu3 - [x] Change ocp, apps.ocp, *.apps.ocp in dns to point to rdu3 - [x] Change id.fedoraproject.org to point to rdu3 - [x] change rabbitmq.fedoraproject.org to point to rdu3 - [x] restart openvpn-client@openvpn on most everything. (This will switch proxies to going to rdu3 for apps) - [x] scale down to 0 and back up the openvpn project in prod rdu3 openshift to get it to reconnect - [x] Change fasjson.fedoraproject.org to point to rdu3 (should be already deployed/ready in rdu3) - [x] Change debuginfod.fedoraproject.org to point to rdu3 (should be already deployed/ready in rdu3) - [x] Change nagios.fedoraproject.org to point to rdu3 (should be ready) - [x] Change registry.fedoraproject.org (and registry-no-cdn) to point to rdu3 (should be ready) bring up openshift apps: - [x] asknot - jnsamyak - asknot-build-1 1/1 Running - [x] badges - abompard - [ ] blockerbugs - zlopez - needs packager-dashboard deployed as well - sync cron job throws error because of bodhi - https://qa.fedoraproject.org/blockerbugs/ is working, but it's slow - will wait for bodhi to test out - blockerbugs page is much quicker, but https://qa.stg.fedoraproject.org/ landing page doesn't load for me - [x] bodhi (hold on this one until tuesday) - [x] bugzilla2fedmsg - [x] datagrepper - abompard - [x] datanommer - abompard - [x] discourse2fedmsg - abompard - [x] fas2discourse-operator - zlopez - fas2discourse operator is deployed and running - [x] docsbuilding - jnsamyak - A cron job was setup, and is now scheduled (suspend is set to false) - cronjob.batch/cron 50 * * * * False - cronjob.batch/cron-translated 0 23 * * * False - [x] docstranslation - jnsamyak - A cron job was setup, and is now scheduled (suspend is set to false) - cronjob.batch/cron 0 21 * * * <none> False - [x] elections - jnsamyak - pod/elections-1-5s6x5 1/1 Running - [x] fasjson - deployed and hopefully ready - kevin - [x] fedocal - jnsamyak - [x] fedora-packages-static - jnsamyak, abompard - [x] firmitas - dkirwan - [x] fmn - abompard - [x] greenwave - abompard - [x] ipsilon-website - abompard - [ ] kanban (is it still used? no change in 6 months) - qa folks were using this? - [x] kerneltest - abompard - [ ] languages (is it still used? no change in 16 months) - [x] maubot - kevin - in progress Getting: Output: mount.nfs: Failed to resolve server ntap-rdu3-c02-fedora01-nfs01a: Name or service not known when trying to mount nfs volume fixed by making the pv use the fqdn... other pv's may need recreating. - [x] mdapi - abompard - [x] mirrormanager - abompard - [x] mote - abompard - [x] noggin - deployed and hopefully ready - kevin - [x] noggin-centos - deployed and hopefully ready - kevin - [x] openscanhub - abompard - [x] oraculum - abompard - [x] planet - abompard - [x] poddlers - abompard - [x] release-monitoring - zlopez - https://release-monitoring.org/ is returning 503 - DNS issue on my side - [x] resultsdb - abompard - [x] resultsdb-ci-listener - abompard - [ ] review-stats - jnsamyak - This is deployed, which means the cronjobs has been set - cronjob.batch/review-stats-make-html-pages 0 * * * * <none> False 0 <none> 19s cronjob.batch/review-stats-work-on-bugs 45 0 * * * <none> False 0 <none> 19s - Does anyone know how to test it? I want to close it but need to know if it is working - check https://fedoraproject.org/PackageReviewStatus/ - [x] testdays - abompard - [x] the-new-hotness - zlopez - [ ] transtats (is it still used?) - [x] waiverdb - abompard - [x] webhook2fedmsg - abompard - [x] websites - abompard - [x] rename openshift cluster back to ocp.fedoraproject.org Confirm other apps are all working: - [x] lists / mailman (web interface and email) - zlopez - mailman playbook fixed - gunicorn doesn't serve anything - it doesn't show redirects, which is annoying - tested out sending the e-mail to infrastructure@lists.fedoraproject.org :-) - [x] wiki - [x] rabbitmq cluster working as expected - [x] dl.fedoraproject.org downloads/rsync - kevin / zlopez / james - [x] log01 is logging - [x] noc01 is nocing - [x] ipsilon works to login to other services with oidc/openid - kevin - [x] ipa web ui works - kevin - [x] sundries works to sync content to proxies - [ ] zodbot working on value01.rdu3 - needs kevin to build limnoria for epel9 - [x] debuginfod serves debuginfo - [x] registry serves containers - zlopez - [x] your app not listed above below here.... Monday retrospective items: - pagure.io was being hammered by ai bots and making doing commits anoying - kevin forgot to drop the pdc db so it made moving db01 not so great - db03 took forever and ever to sync, not sure why. - gunicorn doesn't return redirects, instead it just returns empty output so testing http://localhost:8000/archives didn't work, but testing http://localhost:8000/archives/ did - db03 did not want to rsync. Turns out I had to use --inplace or rsync would copy a ~300GB binary file and run out of disk space. Things that still need to be fixed: - done now - kevin: override openshift workers in proxy balancers on proxy01/10/*rdu3 - done I think? - kevin: fix acls on dns repo - fix cloud dns zone that still has a SHA1 signing setup somehow, and switch batcave to DEFAULT crypto policy - update the ansible ssh root key comment to be more descriptive and drop the older key - #### Tuesday Build pipeline day. Things needed to build, sign, compose At 10:00 UTC (6am EDT, 11am UK) - [x] stop postgresql on db-koji01.iad2 - systemctl stop postgresql.service - [x] shutdown build machines in iad2: - [x] koji01.iad2 - [x] koji02.iad2 - [x] compose-x86-01.iad2 - [x] compose-rawhide01.iad2 - [x] compose-branched01.iad2 - [x] compose-iot01.iad2 - [x] compose-eln01.iad2 - [x] oci-registry01 - [x] oci-registry02 - [x] oci-candidate-registry - [x] secondary01 - [x] buildvm-x86-01.iad2.fedoraproject.org - zlopez - [x] buildvm-x86-02.iad2.fedoraproject.org - zlopez - [x] buildvm-x86-03.iad2.fedoraproject.org - zlopez - [x] buildvm-a64-01.iad2.fedoraproject.org - [x] buildvm-a64-02.iad2.fedoraproject.org - [x] buildvm-a64-03.iad2.fedoraproject.org - [x] buildvm-x86-01.rdu3.fedoraproject.org - zlopez - [x] buildvm-x86-02.rdu3.fedoraproject.org - zlopez - [x] buildvm-x86-03.rdu3.fedoraproject.org - zlopez - [x] buildvm-a64-01.rdu3.fedoraproject.org - [x] buildvm-a64-02.rdu3.fedoraproject.org - [x] buildvm-a64-03.rdu3.fedoraproject.org - [x] buildvm-ppc64le-01.iad2.fedoraproject.org - jnsamyak - [x] buildvm-ppc64le-09.iad2.fedoraproject.org - jnsamyak - [x] buildvm-ppc64le-18.iad2.fedoraproject.org - jnsamyak - [x] buildvm-ppc64le-27.iad2.fedoraproject.org - jnsamyak - [x] buildvm-ppc64le-33.iad2.fedoraproject.org - jnsamyak - [x] buildvm-s390x-11.s390.fedoraproject.org (these are not in iad2, but sshfs mount fedora_koji volume rw) - zlopez - [x] buildvm-s390x-12.s390.fedoraproject.org (these are not in iad2, but sshfs mount fedora_koji volume rw) - zlopez - [x] buildvm-s390x-13.s390.fedoraproject.org (these are not in iad2, but sshfs mount fedora_koji volume rw) - zlopez - [x] bodhi-backend01 - [x] Scale down the following openshift apps in iad2 prod cluster: - [x] flatpak-indexer - jnsamyak - NAME REVISION DESIRED CURRENT TRIGGERED BY deploymentconfig.apps.openshift.io/flatpak-indexer 128 0 0 config,image(flatpak-indexer:latest) deploymentconfig.apps.openshift.io/flatpak-indexer-differ 123 0 0 config,image(flatpak-indexer:latest) deploymentconfig.apps.openshift.io/redis 211 0 0 - [x] remount ro fedora_sourcecache on pkgs01.iad2. 'mount -o remount,ro /srv/cache/lookaside' - [x] add a nftables rule to block ssh (tcp/22) and web (80) on pkgs01 (allow 10.3.x.x and 10.16.x.x still) (we need to do this to prevent people from committing more stuff, but we need the machine reachable for syncing from) Added 443 port as well - [x] confirm to storage folks internally that everything should be ready to switch storage At 11:00 UTC (7am EDT, 12noon UK) - [x] Storage team will reverse the snapmirror for those volumes (iad2 ro, rdu3 rw) At 14 UTC (7am PDT, 10am EDT, 2-3pm UK) (kevin awake to take on the below) - [x] db-koji01 sync and upgrade and bringup in rdu3 - kevin - [x] pkgs01 sync from iad2. - kevin - [x] bring up autosign/sign-bridge/vault - [x] check koji working (via /etc/hosts override) - [x] openqa bringup? - [x] sync db-openqa01 over to rdu3 and upgrade and bring up - [x] deploy openqa01/openqa-lab01 - adamw - [x] deploy workers - adamw - deploy/fix openshift apps - continue to fix issues as they are found - get nagios happy Retrospective for tuesday: - networking outage was a big drag - mtu issue was anoying to track down and is still not fixed - firewall issue from proxies -> koji01/02 took a very long time to debug/figure out. #### Wednesday - [ ] Bugfix and validate - [x] bring up bodhi openshift app and test rawhide builds / retag pending ones - [x] test build/sign/update pipeline with a build / bodhi update / signing - in progress - [x] switch pkgs.fedoraproject.org / koji.fedoraproject.org in dns - [x] switch dns for koji and kojipkgs.fedoraproject.org to rdu3 - [x] bring up bodhi-backend01 and test pushes - [x] remove replication agreements from ipa.iad2/rdu3 - zlopez - [x] switch all rdu3 hosts to use rdu3 ipa cluster in /etc/ipa/default.conf Remove *iad2* from ansible - [x] break ipa replications to iad2 and take down iad2 clusters - zlopez - Replication is broken and the takedown will be down by Greg - Servers are now down #### Thursday - [ ] Bugfix and validate - [x] test updates push if ready - [ ] test rawhide compose if ready - [x] cloud-image-uploader - [ ] compose-tracker - [ ] flatpak-indexer - [ ] koschei - [ ] re-enable backups - kevin - [x] tell people to start reporting issues they find Start shutting down machines in iad2 that we are done with: - [ ] set vm's to not autostart ( virsh autostart --disable) - [ ] set hardware to boot single user mode - [ ] power off #### Friday - [x] us 4th of july holiday - [ ] Bugfix and validate - [ ] Confirm that list of all machines moving to rdu3 are set to boot single user or otherwise not come up on network and are powered off - [ ] confirm all other machines are powered off. ### 5: stablize / validate Friday and the following week we will stablize and validate that everything (except staging) is working in RDU3. Additionally, we will make sure all the old IAD2 instances are powered off and ready for retirement or shipping. Once shipped assets arrive at rdu3, we will need to validate that they are ok and install/configure them. current IAD2 name | notes about RDU3 usage aarch64: bvmhost-a64-01 - for openqa in rdu3 bvmhost-a64-02 - for openqa in rdu3 bvmhost-a64-03 - for openqa in rdu3 bvmhost-a64-04 - for openqa in rdu3 buildhw-a64-03 - will be builder in rdu3 buildhw-a64-04 - will be builder in rdu3 buildhw-a64-05 - will be builder in rdu3 buildhw-a64-06 - will be builder in rdu3 openqa-a64-worker04 - for openqa in rdu3 bvmhost-a64-01.stg - will run staging builders power9: bvmhost-p09-01.stg - newer - for openqa in rdu3 bvmhost-p09-01 - ibm loaner - builders in rdu3 bvmhost-p09-02 - ibm loaner - builders in rdu3 bvmhost-p09-03 - ibm loaner - builders in rdu3 bvmhost-p09-04 - ibm loaner - builders in rdu3 bvmhost-p09-05 - newer - for openqa in rdu3 openqa-p09-worker01 - ibm loaner - for openqa in rdu3 openqa-p09-worker02 - ibm loaner - for openqa in rdu3 x86_64: autosign02 128/64 - possible builder in RDU3 bkernel01 64/48 - builder in RDU3 bkernel02 64/48 - builder in RDU3 bvmhost-x86-riscv01 256/128 - will be still riscv01 in RDU3 kernel02 256/128 - will still be kernel01 in RDU3 new sign-vault01 (not in service yet) 64/48 - for openqa in rdu3 sign-vault02 64/48 - possible builder vmhost-x86-01.stg 256/128 - for openqa in rdu3 vmhost-x86-08 256/128 - for openqa in rdu3 worker04.ocp 256/128 - for openqa in rdu3 worker05.ocp 256/128 - for openqa in rdu3 vmhost-x86-09.stg 256/112 - for openqa in rdu3 bvmhost-x86-02.stg 256/112 - for openqa in rdu3 worker02.ocp 256/192 - for openqa in rdu3 (warentee expires in jan 2025) worker01.ocp 256/192 - for openqa in rdu3 (warnetee expires in jan 2025) worker01.ocp.stg 256/192 - for openqa in rdu3 (warnetee expires in jan 2025) ### 5.5 IAD2 poweroff See https://hackmd.io/eIIZdZAKSOGPmhhgtAYYaQ?edit ### 6: staging setup / rebalance Once the above machines are setup and active, we can reclaim some openqa space and finish building out staging. ### Post move parking lot This section is for things we would love to do before/during the move, but realize we have no time to do so or they would otherwise cause disruption. * Change our ansible repo setup and use a web frontend and/or gitops thing * Change backups from rdiff-backup * Use wireguard in RDU2 to replace openvpn in IAD2 * Change from using network_connections to network_state in linux-system-roles/network * Unrealted to the move, but ANUBIS *

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.