Try   HackMD

Post-mortem for broken RHEL 9 mockbuild in osbuild CI

What happened

Today, RHEL 9 mockbuild broke in osbuild CI. Basically, when mock was being installed, a lot of packages were updated, and during these updates, ssh connection dropped and couldn't be established again.

Moar context

Earlier this week, a mass RHEL 9 rebuild happened. As RHEL 9 is still in devel, this can break backward compatibility and updates.

Also earlier this week, we redeployed composer.osbuild.org. There was kinda big change for the official RHEL 9 guest images - we now build them using v2 manifests (so basically the whole definition was rewritten from scratch).

Our CI runs guest images in AWS (ec2 images are not yet available in RHEL 9). Yes, we build them on composer.osbuild.org. Yes, these are the images from the previous paragraph.

My first (I think correct) assumption

My first assumption was that the mass rebuild indeed broke updates. As it doesn't make much sense to care about older RHEL 9 content, the best fix is to just update our images that Schutzbot uses.

So we indeed tried that with Jakub. Unfortunately, we couldn't ssh into EC2 instances booted from these images. Shortly after that, I discovered that these images are not booting.

My second (and fatally wrong) assumption

"OMFG, I think we broke something with the v1->v2 manifest migration." Yeah, I panicked a little (j/k, I panicked a lot), declared our new image definitions for RHEL 9 broken, spammed several people, and left my computer because I sadly had some errands.

The things got even weirder some after that because Christian tried booting the latest RHEL 9 images and they worked fine for him

The realization

After I read the message from Christian, I started to think what we are doing differently in EC2. Then I realized the whole truth: We uploaded qcow2 images into EC2! But it doesn't work that way! EC2 cannot consume qcow2! We always (conditions apply) have to use raw images! So I tried converting the qcow2 into raw and uploading it, and voila - the image happily booted!

So the images are still bootable and we are fine. \o/ I just forgot one step. Sorry.

Narrator: This wasn't the end though.

The changed image definition strikes back

Unfortunately, even when the image booted, it failed to install s3cmd. s3cmd isn't available in RHEL 9 so we have to install it from our private copr. The important thing is that s3cmd requires python-magic. Because python-magic is also not available in RHEL 9, it also lives in our internal COPR.

However, after the update, we weren't able to install s3cmd because python-file-magic was already installed and it conflicts with python-magic.

My third (and also wrong) assumption

Ah! This must be something with the mass-rebuild, right? They surely changed something in the Python packaging and dependencies are now broken. So I started looking at all brew builds of s3cmd, python-magic, and python-file-magic. But nothing changed during the mass rebuild

The second realization

After some time I realized what changed: I was the one who added python-file-magic into the guest image! Actually, I didn't add specifically python-file-magic but I added insights-client that requires python-file-magic. I added it for the parity with 8.5 images (it also makes sense to include it and the reason why it wasn't there before is that it didn't exist when Jozef originally created the RHEL 9 definition). So that's how python-file-magic got into there and was conflicting with python-magic required by s3cmd.

The fix

After some digging, I found out that python-file-magic and python-magic provides basically the same API (that's why they are conflicting each other). I also found out that s3cmd in Fedora turns off the dependency generator and just pulls python-file-magic directly. But it does that only for Fedora. So I extended the condition also for RHEL 9 and voila! s3cmd now requires python-file-magic instead of python-magic so there's now no conflict and everybody's happy.

Summary

So yeah, in the end, both the mass-rebuild, and the change of the image definition indeed broke us but in a slightly different way than I originally thought. Images are fun. :)