# Fedora IPA outage
## State of machines
* ipa01.iad2.fedoraproject.org: Running
* ipa02.iad2.fedoraproject.org: Running
* ipa03.iad2.fedoraproject.org: Running
* noggin: pointing at all 3 servers (`/etc/openshift_apps/noggin/configmap.yml`) and registered with ipa01 (`/etc/openshift_apps/noggin/configmap-ipa-client.yml`)
* fasjson: pointing and registered with ipa01 (`/etc/openshift_apps/fasjson/configmap-ipa-client.yml`)
### ipa01
Accidentally removed from topology by running `ipa server-del ipa01.iad2.fedoraproject.org`. Partially restored by creating entry for ipa01 in LDAP:
```
$ ipa server-show ipa03.iad2.fedoraproject.org --all --raw > 90-ipa01.update
# After update to ipa01
$ cat 90-ipa01.update
dn: cn=ipa01.iad2.fedoraproject.org,cn=masters,cn=ipa,cn=etc,dc=fedoraproject,dc=org
default:cn: ipa01.iad2.fedoraproject.org
default:iparepltopomanagedsuffix: dc=fedoraproject,dc=org
default:iparepltopomanagedsuffix: o=ipaca
default:ipamindomainlevel: 1
default:ipamaxdomainlevel: 1
default:objectClass: top
default:objectClass: nsContainer
default:objectClass: ipaReplTopoManagedServer
default:objectClass: ipaConfigObject
default:objectClass: ipaSupportedDomainLevelConfig
$ ipa-ldap-updater ./90-ipa01.update
```
But the `ipactl status` now throw:
```
[root@ipa01 ~][PROD-IAD2]# ipactl status
Failed to get list of services to probe status!
Configured hostname 'ipa01.iad2.fedoraproject.org' does not match any master server in LDAP: ipa03.iad2.fedoraproject.org
ipa01.iad2.fedoraproject.org
```
And the services pointing to ipa01 don't work (https://accounts.fedoraproject.org)
Shut down from vmhost `virsh shutdown ipa01.iad2.fedoraproject.org`
Tried to re-initialize from ipa02, didn't work
```
root@ipa01 ~][PROD-IAD2]# ipa-replica-manage re-initialize --from ipa02.iad2.fedoraproject.org
Re-run /usr/sbin/ipa-replica-manage with --verbose option to get more information
Unexpected error: cannot connect to 'ldaps://ipa01.iad2.fedoraproject.org:636': Transport endpoint is not connected
```
Backup the machine and start replication process from ipa02
Playbook run finished without error and the ipa server seems to be running
Everything was redirected back to ipa01
CA renewal role was moved to ipa01
### ipa02
After unsucesfully trying to update to RHEL9 (this is where the accident on ipa01 happened) restored from backup on vmhost:
```
$ virsh define ipa02.iad2.fedoraproject.org-2024-01-25.xml
$ lvrename /dev/vg_guests/ipa02.iad2.fedoraproject.org_2024-01-25_-el8 /dev/vg_guests/ipa02.iad2.fedoraproject.org
```
It seems to be working without issue
https://acounts.fedoraproject.org got redirected to ipa02 and started working
fasjson is now redirected as well
Did backup `ipa-backup --online --data`
CA renewal role was moved from ipa02 to ipa01
Backup the machine and start migration to RHEL9
Playbook run finished without error and the ipa server seems to be running fine
### ipa03
kinit doesn't work with error:
```
kinit: Generic error (see e-text) while getting initial credentials
```
`ipactl status` hangs indefinitelly
`reboot` didn't help
Error in journal `Jan 25 14:42:49 ipa03.iad2.fedoraproject.org ns-slapd[1597]: GSSAPI Error: No credentials were supplied, or the credentials were unavailable or inaccessible (Cannot contact any KDC for realm 'FEDORAPROJECT.ORG')`
Shut down from vmhost `virsh shutdown ipa03.iad2.fedoraproject.org`
Backup the machine and start replication process from ipa02
Playbook run finished without error and the ipa server seems to be running
## Plan of action
1. Redirect everything to ipa02 - Done
2. Backup ipa01 on vmhost-x86-02
```
$ virsh dumpxml ipa01.iad2.fedoraproject.org > ipa01.iad2.fedoraproject.org_YYYY-MM-DD.xml
$ lvrename /dev/vg_guests/ipa01.iad1.fedoraproject.org /dev/vg_guests/ipa01.iad2.fedoraproject.org_YYYY-MM-DD
```
3. Remove ipa01 from replication agreement on ipa02 `ipa server-del ipa01.iad2.fedoraproject.org`
4. Replicate ipa01 from ipa02
```
$ ansible-playbook /srv/web/infra/ansible/playbooks/destroy_virt_inst.yml -e target=ipa01.iad2.fedoraproject.org
$ ansible-playbook /srv/web/infra/ansible/playbooks/groups/ipa.yml -l ipa01.iad\*
```
5. Backup ipa03 on vmhost-x86-06
```
$ virsh dumpxml ipa03.iad2.fedoraproject.org > ipa03.iad2.fedoraproject.org_YYYY-MM-DD.xml
$ lvrename /dev/vg_guests/ipa03.iad2.fedoraproject.org /dev/vg_guests/ipa03.iad2.fedoraproject.org_YYYY-MM-DD
```
6. Remove ipa03 from replication agreement on ipa02 `ipa server-del ipa03.iad2.fedoraproject.org`
7. Replicate ipa03 from ipa02
```
$ ansible-playbook /srv/web/infra/ansible/playbooks/destroy_virt_inst.yml -e target=ipa03.iad2.fedoraproject.org
$ ansible-playbook /srv/web/infra/ansible/playbooks/groups/ipa.yml -l ipa03.iad\*
```
8. Redirect everything to ipa01
fasjson, noggin, haproxy
9. Assign CA Renewal role to ipa01
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/migrating_to_identity_management_on_rhel_9/index#assigning-the-ca-renewal-server-role-to-the-rhel-9-idm-server_assembly_migrating-your-idm-environment-from-rhel-8-servers-to-rhel-9-servers
10. Backup ipa02 on vmhost-x86-03
```
$ virsh dumpxml ipa02.iad2.fedoraproject.org > ipa02.iad2.fedoraproject.org_YYYY-MM-DD.xml
$ lvrename /dev/vg_guests/ipa02.iad1.fedoraproject.org /dev/vg_guests/ipa02.iad2.fedoraproject.org_YYYY-MM-DD
```
11. Remove ipa02 from replication agreement on ipa01 `ipa server-del ipa02.iad2.fedoraproject.org --force`
12. Replicate ipa02 from ipa01 on RHEL9
```
$ ansible-playbook /srv/web/infra/ansible/playbooks/destroy_virt_inst.yml -e target=ipa02.iad2.fedoraproject.org
$ ansible-playbook /srv/web/infra/ansible/playbooks/groups/ipa.yml -l ipa02.iad\*
```
## Post plan actions
New accounts couldn't log in, this was caused by missing ID range and SID. Solved in https://pagure.io/fedora-infrastructure/issue/11740