# Fedora IPA outage ## State of machines * ipa01.iad2.fedoraproject.org: Running * ipa02.iad2.fedoraproject.org: Running * ipa03.iad2.fedoraproject.org: Running * noggin: pointing at all 3 servers (`/etc/openshift_apps/noggin/configmap.yml`) and registered with ipa01 (`/etc/openshift_apps/noggin/configmap-ipa-client.yml`) * fasjson: pointing and registered with ipa01 (`/etc/openshift_apps/fasjson/configmap-ipa-client.yml`) ### ipa01 Accidentally removed from topology by running `ipa server-del ipa01.iad2.fedoraproject.org`. Partially restored by creating entry for ipa01 in LDAP: ``` $ ipa server-show ipa03.iad2.fedoraproject.org --all --raw > 90-ipa01.update # After update to ipa01 $ cat 90-ipa01.update dn: cn=ipa01.iad2.fedoraproject.org,cn=masters,cn=ipa,cn=etc,dc=fedoraproject,dc=org default:cn: ipa01.iad2.fedoraproject.org default:iparepltopomanagedsuffix: dc=fedoraproject,dc=org default:iparepltopomanagedsuffix: o=ipaca default:ipamindomainlevel: 1 default:ipamaxdomainlevel: 1 default:objectClass: top default:objectClass: nsContainer default:objectClass: ipaReplTopoManagedServer default:objectClass: ipaConfigObject default:objectClass: ipaSupportedDomainLevelConfig $ ipa-ldap-updater ./90-ipa01.update ``` But the `ipactl status` now throw: ``` [root@ipa01 ~][PROD-IAD2]# ipactl status Failed to get list of services to probe status! Configured hostname 'ipa01.iad2.fedoraproject.org' does not match any master server in LDAP: ipa03.iad2.fedoraproject.org ipa01.iad2.fedoraproject.org ``` And the services pointing to ipa01 don't work (https://accounts.fedoraproject.org) Shut down from vmhost `virsh shutdown ipa01.iad2.fedoraproject.org` Tried to re-initialize from ipa02, didn't work ``` root@ipa01 ~][PROD-IAD2]# ipa-replica-manage re-initialize --from ipa02.iad2.fedoraproject.org Re-run /usr/sbin/ipa-replica-manage with --verbose option to get more information Unexpected error: cannot connect to 'ldaps://ipa01.iad2.fedoraproject.org:636': Transport endpoint is not connected ``` Backup the machine and start replication process from ipa02 Playbook run finished without error and the ipa server seems to be running Everything was redirected back to ipa01 CA renewal role was moved to ipa01 ### ipa02 After unsucesfully trying to update to RHEL9 (this is where the accident on ipa01 happened) restored from backup on vmhost: ``` $ virsh define ipa02.iad2.fedoraproject.org-2024-01-25.xml $ lvrename /dev/vg_guests/ipa02.iad2.fedoraproject.org_2024-01-25_-el8 /dev/vg_guests/ipa02.iad2.fedoraproject.org ``` It seems to be working without issue https://acounts.fedoraproject.org got redirected to ipa02 and started working fasjson is now redirected as well Did backup `ipa-backup --online --data` CA renewal role was moved from ipa02 to ipa01 Backup the machine and start migration to RHEL9 Playbook run finished without error and the ipa server seems to be running fine ### ipa03 kinit doesn't work with error: ``` kinit: Generic error (see e-text) while getting initial credentials ``` `ipactl status` hangs indefinitelly `reboot` didn't help Error in journal `Jan 25 14:42:49 ipa03.iad2.fedoraproject.org ns-slapd[1597]: GSSAPI Error: No credentials were supplied, or the credentials were unavailable or inaccessible (Cannot contact any KDC for realm 'FEDORAPROJECT.ORG')` Shut down from vmhost `virsh shutdown ipa03.iad2.fedoraproject.org` Backup the machine and start replication process from ipa02 Playbook run finished without error and the ipa server seems to be running ## Plan of action 1. Redirect everything to ipa02 - Done 2. Backup ipa01 on vmhost-x86-02 ``` $ virsh dumpxml ipa01.iad2.fedoraproject.org > ipa01.iad2.fedoraproject.org_YYYY-MM-DD.xml $ lvrename /dev/vg_guests/ipa01.iad1.fedoraproject.org /dev/vg_guests/ipa01.iad2.fedoraproject.org_YYYY-MM-DD ``` 3. Remove ipa01 from replication agreement on ipa02 `ipa server-del ipa01.iad2.fedoraproject.org` 4. Replicate ipa01 from ipa02 ``` $ ansible-playbook /srv/web/infra/ansible/playbooks/destroy_virt_inst.yml -e target=ipa01.iad2.fedoraproject.org $ ansible-playbook /srv/web/infra/ansible/playbooks/groups/ipa.yml -l ipa01.iad\* ``` 5. Backup ipa03 on vmhost-x86-06 ``` $ virsh dumpxml ipa03.iad2.fedoraproject.org > ipa03.iad2.fedoraproject.org_YYYY-MM-DD.xml $ lvrename /dev/vg_guests/ipa03.iad2.fedoraproject.org /dev/vg_guests/ipa03.iad2.fedoraproject.org_YYYY-MM-DD ``` 6. Remove ipa03 from replication agreement on ipa02 `ipa server-del ipa03.iad2.fedoraproject.org` 7. Replicate ipa03 from ipa02 ``` $ ansible-playbook /srv/web/infra/ansible/playbooks/destroy_virt_inst.yml -e target=ipa03.iad2.fedoraproject.org $ ansible-playbook /srv/web/infra/ansible/playbooks/groups/ipa.yml -l ipa03.iad\* ``` 8. Redirect everything to ipa01 fasjson, noggin, haproxy 9. Assign CA Renewal role to ipa01 https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/migrating_to_identity_management_on_rhel_9/index#assigning-the-ca-renewal-server-role-to-the-rhel-9-idm-server_assembly_migrating-your-idm-environment-from-rhel-8-servers-to-rhel-9-servers 10. Backup ipa02 on vmhost-x86-03 ``` $ virsh dumpxml ipa02.iad2.fedoraproject.org > ipa02.iad2.fedoraproject.org_YYYY-MM-DD.xml $ lvrename /dev/vg_guests/ipa02.iad1.fedoraproject.org /dev/vg_guests/ipa02.iad2.fedoraproject.org_YYYY-MM-DD ``` 11. Remove ipa02 from replication agreement on ipa01 `ipa server-del ipa02.iad2.fedoraproject.org --force` 12. Replicate ipa02 from ipa01 on RHEL9 ``` $ ansible-playbook /srv/web/infra/ansible/playbooks/destroy_virt_inst.yml -e target=ipa02.iad2.fedoraproject.org $ ansible-playbook /srv/web/infra/ansible/playbooks/groups/ipa.yml -l ipa02.iad\* ``` ## Post plan actions New accounts couldn't log in, this was caused by missing ID range and SID. Solved in https://pagure.io/fedora-infrastructure/issue/11740