gwmngilfen
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Fedora Zabbix planing doc ## Situation There is a desire to migrate from Nagios to Zabbix in the Fedora Infrastructure. [Ticket 11393](https://pagure.io/fedora-infrastructure/issue/11393) has some details and history. ## What happened last time The last effort here got as far as building out Zabbix in staging, and having it working *in principle*. However (talking with David, James, Kevin and Steve), it seems that we had a couple of issues that got in the way of completing the migration: - Sheer volume of alerts caused people to ignore Zabbix - Adoption of Zabbix from applications was low/none, - New things were still added to Nagios instead - Sheer amount of items that need to be migrated once the prototype is built ## Risks There's a few things we'll want to be carefult of: - Repeating the mistakes of last time - Missing important checks during migration - Splitting attention between Nagios and Zabbix ## Challenges Size-wise, we have ~500 hosts and ~2000 services in Nagios. That's not *too* bad, especially as all this will belong to hostgroups. It is likely that each change we make to Zabbix/Ansible will migrate significant chunks of that, at least initially. We've also got *way* too many problems listed in the Staging Zabbix, so we need to get on top of that - but perhaps the way to do so is to nuke it from orbit? On top of the above, I think there's some aspiration stuff to think about: #### Koji Builders (added 19/3) Apparently we need a *much* lighter touch on the Koji builders, because they are often maxxed out. Because Zabbix is hierarchical, that probably means a really basic auto-register template, and then a "proper" base role that is applied by Ansible to *most* hosts, but not Koji #### Harmonizing with CentOS Fabian has done a lot of work to automate Zabbix in the CentOS infrastructure. I've spent the last couple of weeks working with that, and I think it would make a lot of sense to lift as much of that structure as we can into this project, if only to save us time. This would mostly be about how to structure the Ansible code rather than the specific scripts/checks to execute, as they will be Fedora specific. Example: - [Role main.yml calling out for monitoring](https://github.com/CentOS/ansible-role-vdo-host/blob/master/tasks/main.yml) - [Same role registering it's monitoring tasks](https://github.com/CentOS/ansible-role-vdo-host/blob/master/tasks/monitoring.yml) #### Going further with Ansible We can go even further than CentOS, perhaps... According to [the Zabbix Ansible Collection](https://docs.ansible.com/ansible/latest/collections/community/zabbix/index.html) there are modules for Template, Discovery, and Prototype objects. Currently CentOS handles that by hand in the UI, but as a stretch goal we could look at this too. EDIT 19/3: We do have some code for adding the templates to git, so perhaps we can expand on this. ## Plan / Phases We'll want to break this down. I'm assuming a "normal" migration style here, where we move checks over piecemeal. See below for notes on a possible alternative... This is pretty high-level, I'm not going into detail of specific applications or templates here. In general my principles are (a) reuse as much as possible from CentOS since they already have a fully functioning Zabbix setup, and (b) reuse Staging to Prod. Both can be done via Import/Export in Zabbix, with any needed edits along the way. ### Phase 0 - Setup Staging Here we build out Staging with the Zabbix Server and any necessary Proxies (I'm not yet familiar enough with the network layout to see what we need here, please input!). Much of this already exists, and Mark has been working to update Zabbix to 7.0 in Staging. - [x] Build/update Zabbix Server playbook and run it - [ ] Build/update Zabbix Proxy playbook and run it - [ ] Build/update Zabbix Agent playbook and run it - [ ] ~~Copy basic host templates from CentOS because they work~~ - [x] Actually reworked the upstream "Linux by Zabbix Agent active" template into to templates - [x] Setup notification channels - [ ] We may want multiple Matrix channels for info/alert/critical so that we can get the most eyes on critical things. - [ ] Or perhaps multple bots in one channel, so one can mute the noiser ones... - [x] Test & review basic host monitoring EDIT 19/3 - Looking through staging, we have a LOT of unused templates that make it hard to see what's what. We might want to clean that up.... Ideally I'd like to see templates we're *actually* using, rather than 7 pages of empty defaults :) We also need to get the notification spam down here, that's probably part of reviewing the base templates. EDIT 4/4 - We have a basic level of templates created that reflect the reality of Koji. "Linux Autoregistration" is *most* of the upstream base template, but the CPU/memory/disk IO/etc checks have been moved to "Linux Hosts" which inherits from "Linux Autoregistration". This allows all hosts to auto register, but only non-Koji hosts go to "Linux Hosts" ### Phase 1 - Application monitoring in Staging Identify services we want to set up monitoring on (help wanted here, my familiarity with our stack is low), and port their checks from Nagios to Zabbix (or write new ones). Per application: - [ ] Zabbix template created, either by hand or by Ansible - [ ] Item / Trigger prototypes for the service to monitor - [ ] Ansible code to add a server to that template - [ ] Ansible code to deploy LLD config for the prototypes - [ ] Ansible code to deploy checks & report metrics - [ ] Test notifications get to the right channels on Matrix (and elsewhere?) At this point we can repeat this for enough applications that we have confidence in the approach and the Zabbix-Ansible codebase. Note that we can't remove Nagios in Staging, because during this project we'll still want to test changes to Nagios before rolling them out to Prod. ### Phase 2 - Initial rollout in Prod Basically repeat Phase 0 but in Prod. This should be straightforward changes to the Ansible codebase to manage both environments. - [ ] Update Ansible to enable Prod - [ ] Rollout Zabbix Server & Proxies - [ ] Dump templates from Staging and import in Prod - [ ] Notifications - Create new Matrix channels / bots - [ ] Rollout at least one Agent and test - [ ] Add test agent server to the necessary hostgroups - [ ] Rollout agents evenerywhere to enable basic monitoring ### Phase 3 - Application rollout in Prod Notably, to avoid the issues we encountered last time, when reaching this phase, we should *freeze Nagios config here*. Once we start migrating applications to Zabbix, no new things should be added to Nagios. As applications are migrated in Prod, then Ansible/Hosts/Services can be entirely removed from Nagios in both envs EDIT: After some discussion, it seems the consensus is not to remove things from Nagios, but rather to add things to Zabbix until it is *largely* complete, and then shutdown Nagios entirely. Again, largely a repeat of Step 1. This should always be tested in Staging first, per application, so it's mostly about enabling Ansible for prod and copying config from the Staging Zabbix. - [ ] Dump/import Template from Staging Zabbix - [ ] Add host to hostgroup - [ ] Update Ansible to deploy LLD to prod - [ ] Update Ansible to deploy checks to Prod - [ ] Check deployment looks the same as Staging - [ ] Test notifications are working - [ ] Remove application / host from Nagios ## Definition of Done As Phase 3 rolls out, Nagios will get smaller and quieter. Thus the natural Definition of Done is when Nagios is empty :) ## Open Questions #### Datacenter move Kevin has [suggested](https://hackmd.io/54xmtW6IQoKNKnbRXySxSg#0-planning--footprint-reduction--changes) that we could consider *not* deploying Nagios in the new datacentre. That would change Phases 2 & 3, and would require us to have a confident execution of Phases 0 & 1 in Staging *before* the move begins, but would mean we have significantly less work to do in getting rid of Nagios inertia later. This feels like a tight goal given the current dates of the migration - but they may slip. It's a bit of a race, but perhaps worth aiming for? #### Ansible Structure CentOS uses a separate Git repo per role, which we can't do, but I think it should still be possible to use a similar & consistent pattern of putting monitoring tasks in `monitoring.yml` for each role. This make it easy to identify and maintain in the future. It will also make it clear when we've done a role, since we don't use this pattern currently (AFAICT we have monitoring tasks in `main.yml`) I haven't explored the use of Ansible for Zabbix templates/lld/etc yet, and give the datacenter move timeline in the previous point, I'd suggest we revist this once the core build is done. It *should* be easy to export the hand-made templates and verify Ansible generates the same thing. We could at least commit the exported templates to a git repo, to keep a record of them ... #### Openshift? (added 19/3) How do we monitor Openshift? As more work moves to clustered approaches, we want to make sure we have templates for this. Needs a research spike. #### Distribution of work / SIG We already have some people interested, and we should work out how to allocate things. These are: - Definitely interested in working on this: - Greg - new to the infra, but has full-time availability to work on this (minus time need to spend on CentOS, but assume min 3 days/week on avg ) - Mark - knows most of the infra but has a day job to do, time limited - Maybe working on it a bit? - James & David - fulltime, were involved in the last attempt, but now have other workloads to deal with too. Have lots of background, useful as advisors at least? - Advisory: - Kevin - Knows everything :) Steve suggested we might consider forming a SIG for this, as it's a big effort, and would formalise some of the structure. Thoughts on this are welcome, I'm not sure how I feel on this yet. ## Next Steps Mark has created [PR#2504](https://pagure.io/fedora-infra/ansible/pull-request/2504) for upgarding the Zabbix setup in Staging from 6.0LTS to 7.0LTS. Reviewing the base checks for all hosts and matching them to Nagios is probably next. We also probably want to sort the notification channels/bots out and bring the spam levels down a bit. In terms of comms, we probably want to get this posted to Discourse once we're happy with it, I can do that. From a tracking perspective, once we agree on a plan, we'll probably need to track a bunch of this in Pagure tickets - doing it all in the one ticket might be a bit much. --- Meeting logs ### 2025-03-14 Action items - [x] Greg to post summary of the plan to Discourse - [x] Mark / David to rollout 7.0 - [x] Merged PR from Mark: https://pagure.io/fedora-infra/ansible/pull-request/2504 - [x] Deployed Zabbix 7.0 to the staging instance. - [x] Merged PR to enable in production: https://pagure.io/fedora-infra/ansible/pull-request/2515 - [x] Deployed Zabbix 7.0 to the production instance. - [ ] Mark to look at FAS integration etc https://www.zabbix.com/documentation/current/en/manual/web_interface/frontend_sections/users/authentication/ldap - [ ] ~~Greg to start porting gnarly custom Nagios checks and see if they work~~ - [ ] Replaced (per discussion with Smooge) with simplifying templates and working on template inheritance (eg Pagure would not implement DB checks, the DB template would do that) ## General Tasks and Notes Tasks: - [ ] Need to give zabbix user NOPASSWD sudo over /usr/bin/nmap - [x] Update community.zabbix to 3.0.0 to fix playbook issues - Fixed by Ansible [PR #2523 ](https://pagure.io/fedora-infra/ansible/pull-request/2523) - [x] Cleanup templates (dump first?) as most are unused - [x] Work on simple koji-acceptable auto-register template Notes: - Any config backups to be in /srv/backups/ - To backup the DB run sudo -u zabbix pg_dump -U zabbix -d zabbix -f /srv/backups/zabbix-$(date +%s).sql - /run/zabbix did not exist in prod after reboot (28/3) although the one in staging did, weirdly. Greg created it by hand for now. Might want to Ansible it, just to be sure. - Mark had to create Greg's user in prod by hand, needs Ansible-ing - 4/4 First rollout of new Ansible code to staging. Some minor issues, notably: - [ ] User creation/management is disabled until we can upgrade to community.zabbix 3.3.0, after freeze

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully