Fedora Zabbix planing doc

Situation

There is a desire to migrate from Nagios to Zabbix in the Fedora Infrastructure. Ticket 11393 has some details and history.

What happened last time

The last effort here got as far as building out Zabbix in staging, and having it working in principle. However (talking with David, James, Kevin and Steve), it seems that we had a couple of issues that got in the way of completing the migration:

Sheer volume of alerts caused people to ignore Zabbix
Adoption of Zabbix from applications was low/none,
New things were still added to Nagios instead
Sheer amount of items that need to be migrated once the prototype is built

Risks

There's a few things we'll want to be carefult of:

Repeating the mistakes of last time
Missing important checks during migration
Splitting attention between Nagios and Zabbix

Challenges

Size-wise, we have ~500 hosts and ~2000 services in Nagios. That's not too bad, especially as all this will belong to hostgroups. It is likely that each change we make to Zabbix/Ansible will migrate significant chunks of that, at least initially.

We've also got way too many problems listed in the Staging Zabbix, so we need to get on top of that - but perhaps the way to do so is to nuke it from orbit?

On top of the above, I think there's some aspiration stuff to think about:

Koji Builders (added 19/3)

Apparently we need a much lighter touch on the Koji builders, because they are often maxxed out. Because Zabbix is hierarchical, that probably means a really basic auto-register template, and then
a "proper" base role that is applied by Ansible to most hosts, but not Koji

Harmonizing with CentOS

Fabian has done a lot of work to automate Zabbix in the CentOS infrastructure. I've spent the last couple of weeks working with that, and I think it would make a lot of sense to lift as much of that structure as we can into this project, if only to save us time.

This would mostly be about how to structure the Ansible code rather than the specific scripts/checks to execute, as they will be Fedora specific.

Example:

Going further with Ansible

We can go even further than CentOS, perhaps… According to the Zabbix Ansible Collection there are modules for Template, Discovery, and Prototype objects. Currently CentOS handles that by hand in the UI, but as a stretch goal we could look at this too.

EDIT 19/3: We do have some code for adding the templates to git, so perhaps we can expand on this.

Plan / Phases

We'll want to break this down. I'm assuming a "normal" migration style here, where we move checks over piecemeal. See below for notes on a possible alternative…

This is pretty high-level, I'm not going into detail of specific applications or templates here. In general my principles are (a) reuse as much as possible from CentOS since they already have a fully functioning Zabbix setup, and (b) reuse Staging to Prod. Both can be done via Import/Export in Zabbix, with any needed edits along the way.

Phase 0 - Setup Staging

Here we build out Staging with the Zabbix Server and any necessary Proxies (I'm not yet familiar enough with the network layout to see what we need here, please input!). Much of this already exists, and Mark has been working to update Zabbix to 7.0 in Staging.

Build/update Zabbix Server playbook and run it
Build/update Zabbix Proxy playbook and run it
Build/update Zabbix Agent playbook and run it
Copy basic host templates from CentOS because they work
Setup notification channels
- We may want multiple Matrix channels for info/alert/critical so that we can get the most eyes on critical things.
- Or perhaps multple bots in one channel, so one can mute the noiser ones…
Test & review basic host monitoring

EDIT 19/3 - Looking through staging, we have a LOT of unused templates that make it hard to see what's what. We might want to clean that up… Ideally I'd like to see templates we're actually using, rather than 7 pages of empty defaults :)

We also need to get the notification spam down here, that's probably part of reviewing the base templates.

Phase 1 - Application monitoring in Staging

Identify services we want to set up monitoring on (help wanted here, my familiarity with our stack is low), and port their checks from Nagios to Zabbix (or write new ones).

Per application:

Zabbix template created, either by hand or by Ansible
- Item / Trigger prototypes for the service to monitor
Ansible code to add a server to that template
Ansible code to deploy LLD config for the prototypes
Ansible code to deploy checks & report metrics
Test notifications get to the right channels on Matrix (and elsewhere?)

At this point we can repeat this for enough applications that we have confidence in the approach and the Zabbix-Ansible codebase.

Note that we can't remove Nagios in Staging, because during this project we'll still want to test changes to Nagios before rolling them out to Prod.

Phase 2 - Initial rollout in Prod

Basically repeat Phase 0 but in Prod. This should be straightforward changes to the Ansible codebase to manage both environments.

Update Ansible to enable Prod
Rollout Zabbix Server & Proxies
- Dump templates from Staging and import in Prod
- Notifications - Create new Matrix channels / bots
Rollout at least one Agent and test
- Add test agent server to the necessary hostgroups
Rollout agents evenerywhere to enable basic monitoring

Phase 3 - Application rollout in Prod

Notably, to avoid the issues we encountered last time, when reaching this phase, we should freeze Nagios config here. Once we start migrating applications to Zabbix, no new things should be added to Nagios. As applications are migrated in Prod, then Ansible/Hosts/Services can be entirely removed from Nagios in both envs

EDIT: After some discussion, it seems the consensus is not to remove things from Nagios, but rather to add things to Zabbix until it is largely complete, and then shutdown Nagios entirely.

Again, largely a repeat of Step 1. This should always be tested in Staging first, per application, so it's mostly about enabling Ansible for prod and copying config from the Staging Zabbix.

Dump/import Template from Staging Zabbix
- Add host to hostgroup
- Update Ansible to deploy LLD to prod
- Update Ansible to deploy checks to Prod
Check deployment looks the same as Staging
Test notifications are working
Remove application / host from Nagios

Definition of Done

As Phase 3 rolls out, Nagios will get smaller and quieter. Thus the natural Definition of Done is when Nagios is empty :)

Open Questions

Datacenter move

Kevin has suggested that we could consider not deploying Nagios in the new datacentre.

That would change Phases 2 & 3, and would require us to have a confident execution of Phases 0 & 1 in Staging before the move begins, but would mean we have significantly less work to do in getting rid of Nagios inertia later.

This feels like a tight goal given the current dates of the migration - but they may slip. It's a bit of a race, but perhaps worth aiming for?

Ansible Structure

CentOS uses a separate Git repo per role, which we can't do, but I think it should still be possible to use a similar & consistent pattern of putting monitoring tasks in monitoring.yml for each role. This make it easy to identify and maintain in the future. It will also make it clear when we've done a role, since we don't use this pattern currently (AFAICT we have monitoring tasks in main.yml)

I haven't explored the use of Ansible for Zabbix templates/lld/etc yet, and give the datacenter move timeline in the previous point, I'd suggest we revist this once the core build is done. It should be easy to export the hand-made templates and verify Ansible generates the same thing. We could at least commit the exported templates to a git repo, to keep a record of them …

Openshift? (added 19/3)

How do we monitor Openshift? As more work moves to clustered approaches, we want to make sure we have templates for this. Needs a research spike.

Distribution of work / SIG

We already have some people interested, and we should work out how to allocate things. These are:

Definitely interested in working on this:
- Greg - new to the infra, but has full-time availability to work on this (minus time need to spend on CentOS, but assume min 3 days/week on avg )
- Mark - knows most of the infra but has a day job to do, time limited
Maybe working on it a bit?
- James & David - fulltime, were involved in the last attempt, but now have other workloads to deal with too. Have lots of background, useful as advisors at least?
Advisory:
- Kevin - Knows everything :)

Steve suggested we might consider forming a SIG for this, as it's a big effort, and would formalise some of the structure. Thoughts on this are welcome, I'm not sure how I feel on this yet.

Next Steps

Mark has created PR#2504 for upgarding the Zabbix setup in Staging from 6.0LTS to 7.0LTS. Reviewing the base checks for all hosts and matching them to Nagios is probably next. We also probably want to sort the notification channels/bots out and bring the spam levels down a bit.

In terms of comms, we probably want to get this posted to Discourse once we're happy with it, I can do that.

From a tracking perspective, once we agree on a plan, we'll probably need to track a bunch of this in Pagure tickets - doing it all in the one ticket might be a bit much.

–- Meeting logs

2025-03-14

Action items

Greg to post summary of the plan to Discourse
Mark / David to rollout 7.0
- Merged PR from Mark: https://pagure.io/fedora-infra/ansible/pull-request/2504
- Deployed Zabbix 7.0 to the staging instance.
- Merged PR to enable in production: https://pagure.io/fedora-infra/ansible/pull-request/2515
- Deployed Zabbix 7.0 to the production instance.
Mark to look at FAS integration etc https://www.zabbix.com/documentation/current/en/manual/web_interface/frontend_sections/users/authentication/ldap
~~Greg to start porting gnarly custom Nagios checks and see if they work~~
- Replaced (per discussion with Smooge) with simplifying templates and working on template inheritance (eg Pagure would not implement DB checks, the DB template would do that)

General Tasks and Notes

Tasks:

Need to give zabbix user NOPASSWD sudo over /usr/bin/nmap
Update community.zabbix to 3.0.0 to fix playbook issues
- Fixed by Ansible PR #2523
Cleanup templates (dump first?) as most are unused
Work on simple koji-acceptable auto-register template

Notes:

Any config backups to be in /srv/backups/
To backup the DB run sudo -u zabbix pg_dump -U zabbix -d zabbix -f /srv/backups/zabbix-$(date +%s).sql
/run/zabbix did not exist in prod after reboot (28/3) although the one in staging did, weirdly. Greg created it by hand for now. Might want to Ansible it, just to be sure.
Mark had to create Greg's user in prod by hand, needs Ansible-ing

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.