owned this note changed 17 hours ago
Linked with GitHub

Fedora Zabbix planing doc

Situation

There is a desire to migrate from Nagios to Zabbix in the Fedora Infrastructure. Ticket 11393 has some details and history.

What happened last time

The last effort here got as far as building out Zabbix in staging, and having it working in principle. However (talking with David, James, Kevin and Steve), it seems that we had a couple of issues that got in the way of completing the migration:

  • Sheer volume of alerts caused people to ignore Zabbix
  • Adoption of Zabbix from applications was low/none,
  • New things were still added to Nagios instead
  • Sheer amount of items that need to be migrated once the prototype is built

Risks

There's a few things we'll want to be carefult of:

  • Repeating the mistakes of last time
  • Missing important checks during migration
  • Splitting attention between Nagios and Zabbix

Challenges

Size-wise, we have ~500 hosts and ~2000 services in Nagios. That's not too bad, especially as all this will belong to hostgroups. It is likely that each change we make to Zabbix/Ansible will migrate significant chunks of that, at least initially.

We've also got way too many problems listed in the Staging Zabbix, so we need to get on top of that - but perhaps the way to do so is to nuke it from orbit?

On top of the above, I think there's some aspiration stuff to think about:

Koji Builders (added 19/3)

Apparently we need a much lighter touch on the Koji builders, because they are often maxxed out. Because Zabbix is hierarchical, that probably means a really basic auto-register template, and then
a "proper" base role that is applied by Ansible to most hosts, but not Koji

Harmonizing with CentOS

Fabian has done a lot of work to automate Zabbix in the CentOS infrastructure. I've spent the last couple of weeks working with that, and I think it would make a lot of sense to lift as much of that structure as we can into this project, if only to save us time.

This would mostly be about how to structure the Ansible code rather than the specific scripts/checks to execute, as they will be Fedora specific.

Example:

Going further with Ansible

We can go even further than CentOS, perhaps According to the Zabbix Ansible Collection there are modules for Template, Discovery, and Prototype objects. Currently CentOS handles that by hand in the UI, but as a stretch goal we could look at this too.

EDIT 19/3: We do have some code for adding the templates to git, so perhaps we can expand on this.

Plan / Phases

We'll want to break this down. I'm assuming a "normal" migration style here, where we move checks over piecemeal. See below for notes on a possible alternative

This is pretty high-level, I'm not going into detail of specific applications or templates here. In general my principles are (a) reuse as much as possible from CentOS since they already have a fully functioning Zabbix setup, and (b) reuse Staging to Prod. Both can be done via Import/Export in Zabbix, with any needed edits along the way.

Phase 0 - Setup Staging

Here we build out Staging with the Zabbix Server and any necessary Proxies (I'm not yet familiar enough with the network layout to see what we need here, please input!). Much of this already exists, and Mark has been working to update Zabbix to 7.0 in Staging.

  • Build/update Zabbix Server playbook and run it
  • Build/update Zabbix Proxy playbook and run it
  • Build/update Zabbix Agent playbook and run it
  • Copy basic host templates from CentOS because they work
  • Setup notification channels
    • We may want multiple Matrix channels for info/alert/critical so that we can get the most eyes on critical things.
    • Or perhaps multple bots in one channel, so one can mute the noiser ones
  • Test & review basic host monitoring

EDIT 19/3 - Looking through staging, we have a LOT of unused templates that make it hard to see what's what. We might want to clean that up Ideally I'd like to see templates we're actually using, rather than 7 pages of empty defaults :)

We also need to get the notification spam down here, that's probably part of reviewing the base templates.

Phase 1 - Application monitoring in Staging

Identify services we want to set up monitoring on (help wanted here, my familiarity with our stack is low), and port their checks from Nagios to Zabbix (or write new ones).

Per application:

  • Zabbix template created, either by hand or by Ansible
    • Item / Trigger prototypes for the service to monitor
  • Ansible code to add a server to that template
  • Ansible code to deploy LLD config for the prototypes
  • Ansible code to deploy checks & report metrics
  • Test notifications get to the right channels on Matrix (and elsewhere?)

At this point we can repeat this for enough applications that we have confidence in the approach and the Zabbix-Ansible codebase.

Note that we can't remove Nagios in Staging, because during this project we'll still want to test changes to Nagios before rolling them out to Prod.

Phase 2 - Initial rollout in Prod

Basically repeat Phase 0 but in Prod. This should be straightforward changes to the Ansible codebase to manage both environments.

  • Update Ansible to enable Prod
  • Rollout Zabbix Server & Proxies
    • Dump templates from Staging and import in Prod
    • Notifications - Create new Matrix channels / bots
  • Rollout at least one Agent and test
    • Add test agent server to the necessary hostgroups
  • Rollout agents evenerywhere to enable basic monitoring

Phase 3 - Application rollout in Prod

Notably, to avoid the issues we encountered last time, when reaching this phase, we should freeze Nagios config here. Once we start migrating applications to Zabbix, no new things should be added to Nagios. As applications are migrated in Prod, then Ansible/Hosts/Services can be entirely removed from Nagios in both envs

EDIT: After some discussion, it seems the consensus is not to remove things from Nagios, but rather to add things to Zabbix until it is largely complete, and then shutdown Nagios entirely.

Again, largely a repeat of Step 1. This should always be tested in Staging first, per application, so it's mostly about enabling Ansible for prod and copying config from the Staging Zabbix.

  • Dump/import Template from Staging Zabbix
    • Add host to hostgroup
    • Update Ansible to deploy LLD to prod
    • Update Ansible to deploy checks to Prod
  • Check deployment looks the same as Staging
  • Test notifications are working
  • Remove application / host from Nagios

Definition of Done

As Phase 3 rolls out, Nagios will get smaller and quieter. Thus the natural Definition of Done is when Nagios is empty :)

Open Questions

Datacenter move

Kevin has suggested that we could consider not deploying Nagios in the new datacentre.

That would change Phases 2 & 3, and would require us to have a confident execution of Phases 0 & 1 in Staging before the move begins, but would mean we have significantly less work to do in getting rid of Nagios inertia later.

This feels like a tight goal given the current dates of the migration - but they may slip. It's a bit of a race, but perhaps worth aiming for?

Ansible Structure

CentOS uses a separate Git repo per role, which we can't do, but I think it should still be possible to use a similar & consistent pattern of putting monitoring tasks in monitoring.yml for each role. This make it easy to identify and maintain in the future. It will also make it clear when we've done a role, since we don't use this pattern currently (AFAICT we have monitoring tasks in main.yml)

I haven't explored the use of Ansible for Zabbix templates/lld/etc yet, and give the datacenter move timeline in the previous point, I'd suggest we revist this once the core build is done. It should be easy to export the hand-made templates and verify Ansible generates the same thing. We could at least commit the exported templates to a git repo, to keep a record of them

Openshift? (added 19/3)

How do we monitor Openshift? As more work moves to clustered approaches, we want to make sure we have templates for this. Needs a research spike.

Distribution of work / SIG

We already have some people interested, and we should work out how to allocate things. These are:

  • Definitely interested in working on this:
    • Greg - new to the infra, but has full-time availability to work on this (minus time need to spend on CentOS, but assume min 3 days/week on avg )
    • Mark - knows most of the infra but has a day job to do, time limited
  • Maybe working on it a bit?
    • James & David - fulltime, were involved in the last attempt, but now have other workloads to deal with too. Have lots of background, useful as advisors at least?
  • Advisory:
    • Kevin - Knows everything :)

Steve suggested we might consider forming a SIG for this, as it's a big effort, and would formalise some of the structure. Thoughts on this are welcome, I'm not sure how I feel on this yet.

Next Steps

Mark has created PR#2504 for upgarding the Zabbix setup in Staging from 6.0LTS to 7.0LTS. Reviewing the base checks for all hosts and matching them to Nagios is probably next. We also probably want to sort the notification channels/bots out and bring the spam levels down a bit.

In terms of comms, we probably want to get this posted to Discourse once we're happy with it, I can do that.

From a tracking perspective, once we agree on a plan, we'll probably need to track a bunch of this in Pagure tickets - doing it all in the one ticket might be a bit much.

- Meeting logs

2025-03-14

Action items

General Tasks and Notes

Tasks:

  • Need to give zabbix user NOPASSWD sudo over /usr/bin/nmap
  • Update community.zabbix to 3.0.0 to fix playbook issues
  • Cleanup templates (dump first?) as most are unused
  • Work on simple koji-acceptable auto-register template

Notes:

  • Any config backups to be in /srv/backups/

  • To backup the DB run sudo -u zabbix pg_dump -U zabbix -d zabbix -f /srv/backups/zabbix-$(date +%s).sql

  • /run/zabbix did not exist in prod after reboot (28/3) although the one in staging did, weirdly. Greg created it by hand for now. Might want to Ansible it, just to be sure.

  • Mark had to create Greg's user in prod by hand, needs Ansible-ing

Select a repo