There is a desire to migrate from Nagios to Zabbix in the Fedora Infrastructure. Ticket 11393 has some details and history.
The last effort here got as far as building out Zabbix in staging, and having it working in principle. However (talking with David, James, Kevin and Steve), it seems that we had a couple of issues that got in the way of completing the migration:
There's a few things we'll want to be carefult of:
Size-wise, we have ~500 hosts and ~2000 services in Nagios. That's not too bad, especially as all this will belong to hostgroups. It is likely that each change we make to Zabbix/Ansible will migrate significant chunks of that, at least initially.
We've also got way too many problems listed in the Staging Zabbix, so we need to get on top of that - but perhaps the way to do so is to nuke it from orbit?
On top of the above, I think there's some aspiration stuff to think about:
Apparently we need a much lighter touch on the Koji builders, because they are often maxxed out. Because Zabbix is hierarchical, that probably means a really basic auto-register template, and then
a "proper" base role that is applied by Ansible to most hosts, but not Koji
Fabian has done a lot of work to automate Zabbix in the CentOS infrastructure. I've spent the last couple of weeks working with that, and I think it would make a lot of sense to lift as much of that structure as we can into this project, if only to save us time.
This would mostly be about how to structure the Ansible code rather than the specific scripts/checks to execute, as they will be Fedora specific.
Example:
We can go even further than CentOS, perhaps… According to the Zabbix Ansible Collection there are modules for Template, Discovery, and Prototype objects. Currently CentOS handles that by hand in the UI, but as a stretch goal we could look at this too.
EDIT 19/3: We do have some code for adding the templates to git, so perhaps we can expand on this.
We'll want to break this down. I'm assuming a "normal" migration style here, where we move checks over piecemeal. See below for notes on a possible alternative…
This is pretty high-level, I'm not going into detail of specific applications or templates here. In general my principles are (a) reuse as much as possible from CentOS since they already have a fully functioning Zabbix setup, and (b) reuse Staging to Prod. Both can be done via Import/Export in Zabbix, with any needed edits along the way.
Here we build out Staging with the Zabbix Server and any necessary Proxies (I'm not yet familiar enough with the network layout to see what we need here, please input!). Much of this already exists, and Mark has been working to update Zabbix to 7.0 in Staging.
EDIT 19/3 - Looking through staging, we have a LOT of unused templates that make it hard to see what's what. We might want to clean that up… Ideally I'd like to see templates we're actually using, rather than 7 pages of empty defaults :)
We also need to get the notification spam down here, that's probably part of reviewing the base templates.
Identify services we want to set up monitoring on (help wanted here, my familiarity with our stack is low), and port their checks from Nagios to Zabbix (or write new ones).
Per application:
At this point we can repeat this for enough applications that we have confidence in the approach and the Zabbix-Ansible codebase.
Note that we can't remove Nagios in Staging, because during this project we'll still want to test changes to Nagios before rolling them out to Prod.
Basically repeat Phase 0 but in Prod. This should be straightforward changes to the Ansible codebase to manage both environments.
Notably, to avoid the issues we encountered last time, when reaching this phase, we should freeze Nagios config here. Once we start migrating applications to Zabbix, no new things should be added to Nagios. As applications are migrated in Prod, then Ansible/Hosts/Services can be entirely removed from Nagios in both envs
EDIT: After some discussion, it seems the consensus is not to remove things from Nagios, but rather to add things to Zabbix until it is largely complete, and then shutdown Nagios entirely.
Again, largely a repeat of Step 1. This should always be tested in Staging first, per application, so it's mostly about enabling Ansible for prod and copying config from the Staging Zabbix.
As Phase 3 rolls out, Nagios will get smaller and quieter. Thus the natural Definition of Done is when Nagios is empty :)
Kevin has suggested that we could consider not deploying Nagios in the new datacentre.
That would change Phases 2 & 3, and would require us to have a confident execution of Phases 0 & 1 in Staging before the move begins, but would mean we have significantly less work to do in getting rid of Nagios inertia later.
This feels like a tight goal given the current dates of the migration - but they may slip. It's a bit of a race, but perhaps worth aiming for?
CentOS uses a separate Git repo per role, which we can't do, but I think it should still be possible to use a similar & consistent pattern of putting monitoring tasks in monitoring.yml
for each role. This make it easy to identify and maintain in the future. It will also make it clear when we've done a role, since we don't use this pattern currently (AFAICT we have monitoring tasks in main.yml
)
I haven't explored the use of Ansible for Zabbix templates/lld/etc yet, and give the datacenter move timeline in the previous point, I'd suggest we revist this once the core build is done. It should be easy to export the hand-made templates and verify Ansible generates the same thing. We could at least commit the exported templates to a git repo, to keep a record of them …
How do we monitor Openshift? As more work moves to clustered approaches, we want to make sure we have templates for this. Needs a research spike.
We already have some people interested, and we should work out how to allocate things. These are:
Steve suggested we might consider forming a SIG for this, as it's a big effort, and would formalise some of the structure. Thoughts on this are welcome, I'm not sure how I feel on this yet.
Mark has created PR#2504 for upgarding the Zabbix setup in Staging from 6.0LTS to 7.0LTS. Reviewing the base checks for all hosts and matching them to Nagios is probably next. We also probably want to sort the notification channels/bots out and bring the spam levels down a bit.
In terms of comms, we probably want to get this posted to Discourse once we're happy with it, I can do that.
From a tracking perspective, once we agree on a plan, we'll probably need to track a bunch of this in Pagure tickets - doing it all in the one ticket might be a bit much.
–- Meeting logs
Action items
Tasks:
Notes:
Any config backups to be in /srv/backups/
To backup the DB run sudo -u zabbix pg_dump -U zabbix -d zabbix -f /srv/backups/zabbix-$(date +%s).sql
/run/zabbix did not exist in prod after reboot (28/3) although the one in staging did, weirdly. Greg created it by hand for now. Might want to Ansible it, just to be sure.
Mark had to create Greg's user in prod by hand, needs Ansible-ing