owned this note
owned this note
Published
Linked with GitHub
# Monitoring Fedora Releng systems with Zabbix
This SOP documents step by step, of what was required [14] to add Zabbix monitoring to the Fedora Releng systems. This will hopefully act as a guide which community members might follow if they wanted to help roll out Zabbix to the wider Fedora Infra systems (or help maintain going forward). Once complete this SOP will live in [3],[4].
### Resources
- 1 [Ansible Zabbix module for managing templates](https://docs.ansible.com/ansible/latest/collections/community/zabbix/zabbix_template_module.html)
- 2 [Zabbix Sender pushing metrics](https://www.zabbix.com/documentation/6.0/en/manpages/zabbix_sender)
- 3 [Fedora Infra Docs](https://docs.fedoraproject.org/en-US/infra/sysadmin_guide/#_standard_operating_procedures)
- 4 [Fedora Infra Docs Git Repo](https://pagure.io/infra-docs-fpo)
- 5 [Zabbix Staging Server](https://zabbix.stg.fedoraproject.org)
- 6 [Targetting groups for specific Ansible tasks](https://stackoverflow.com/questions/21008083/run-task-only-if-host-does-not-belong-to-a-group)
- 7 [Zabbix Plugins](https://www.zabbix.com/documentation/guidelines/en/plugins)
- 8 [Zabbix Scripts](https://www.zabbix.com/documentation/6.0/en/manual/web_interface/frontend_sections/administration/scripts)
- 9 [Monitor running time of specific process with Zabbix](https://www.zabbix.com/forum/zabbix-help/13350-monitor-running-time-of-a-specific-process)
- 10 [Zabbix proc num](https://www.zabbix.com/documentation/current/en/manual/appendix/items/proc_mem_num_notes)
- 11 [Zabbix added to Releng Hosts PR](https://pagure.io/fedora-infra/ansible/pull-request/1653#)
- 12 [Zabbix kvm Virtual Host template](https://www.zabbix.com/integrations/kvm)
- 13 [Releng Ansible cronjob installation](https://pagure.io/fedora-infra/ansible/blob/main/f/roles/releng/tasks/main.yml)
- 14 [Fedora Infra Ticket](https://pagure.io/fedora-infrastructure/issue/11577)
- 15 [Failed Compose Monitoring](https://pagure.io/releng/failed-composes/issues)
- 16 [Fedora Infra Releng Compose Monitoring Software](https://pagure.io/releng/compose-tracker)
- 17 [Zabbix Production Server](https://zabbix.fedoraproject.org)
- 18 [Zabbix ansible host group](https://docs.ansible.com/ansible/latest/collections/community/zabbix/zabbix_group_module.html)
### Releng Machine List
The following machines are those which are relevant to Releng.
```
machines:
[releng_compose]
compose-x86-01.iad2.fedoraproject.org
compose-branched01.iad2.fedoraproject.org
compose-rawhide01.iad2.fedoraproject.org
compose-iot01.iad2.fedoraproject.org
[releng_compose_stg]
compose-x86-01.stg.iad2.fedoraproject.org
```
First install the Zabbix agent on these `releng_compose:releng_compose_stg` hosts via the `zabbix/zabbix_agent` ansible role [11]. We targetted the`groups/releng-compose.yml` playbook as this is responsible for targetting these hosts.
```
diff --git a/playbooks/groups/releng-compose.yml b/playbooks/groups/releng-compose.yml
index 04b68aba4f..69c0acdad3 100644
--- a/playbooks/groups/releng-compose.yml
+++ b/playbooks/groups/releng-compose.yml
@@ -28,6 +28,8 @@
- ipa/client
- rkhunter
- nagios_client
+ - zabbix/zabbix_agent
- collectd/base
- sudo
- role: keytab/service
```
Run the playbook like so `sudo rbac-playbook groups/releng-compose.yml` on the `batcave01` host. Then check the Zabbix console hosts section to ensure the new hosts have been picked up by Zabbix[5][17]. To get access to the Zabbix server, your FAS user must be a member of the group `sysadmin-noc`, then run the playbook `sudo rbac-playbook groups/zabbix.yml`. Once run you can then authenticate via FAS on the Zabbix web console.
### Requirements
There is no compose being run in the staging environment at all, so this is unfortunately going to be need to be implemented on the production environment only.
Existing monitoring is in place to track composes fails or finishes with success, however there is currently no monitoring to track when a compose hangs.
Cronjobs are installed on the releng hosts via the following ansible task[13]. There are a total of 8 cronjobs in total.
- 1: ftbfs weekly cron job `"ftbfs.cron" /etc/cron.weekly/ on compose-x86-01`
- 2: branched compose cron `"branched" /etc/cron.d/branched on compose-branched01.iad2`
- 3: rawhide compose cron `"rawhide" etc/cron.d/rawhide on compose-rawhide01.iad2`
- 4: cloud-updates compose cron `"cloud-updates" /etc/cron.d/cloud-updates on compose-x86-01.iad2`
- 5: container-updates compose cron `"container-updates" /etc/cron.d/container-updates on compose-x86-01.iad2`
- 6: clean-amis cron `"clean-amis.j2" /etc/cron.d/clean-amis on compose-x86-01.iad2`
- 7: rawhide-iot compose cron `"rawhide-iot" /etc/cron.d/rawhide-iot on compose-iot-01.iad2`
- 8: sig_policy cron `"sig_policy.j2" /etc/cron.d/sig_policy on compose-x86-01.iad2'`
Need at least one Zabbix check per cronjob. The Zabbix check should do the following.
- When a cronjob starts:
-- create a file in `/tmp/name-of-cron-job`
- When a cronjob ends:
-- delete the file in `/tmp/name-of-cron-job`
- If file exists, assume cron is running and if file exists for more than a set period, assume the cron job is stalled.
### Implementation
- Create a custom template called `fedora releng compose cronjobs`.
- Create a host group called `fedora releng compose`.
- Add the ansible hosts from the group `releng_compose` in production only since we currently don't do composes in staging, to this host group.
- In this template create an item, one for each cronjob.
- In this template create a trigger, one for each cronjob. Initially set the trigger to alert when the item returns true for more than 1 hour. This can be changed later when we understand just how long these cron jobs run for.
- Implement this template in JSON see [12] for inspiration and format examples. This template can then be placed in `roles/zabbix/zabbix_server/files/zabbix_templates/releng_compose_cronjobs.json`.
- Create a task in the `roles/zabbix/zabbix_server/tasks` to make use of the zabbix_api key to create this template on the server see [1].
- Use the community Ansible role for adding this template to the releng hosts.
- Update each cronjob in Ansible, to create the files such as `/tmp/name-of-cron-job` when starting, and deleting when completed.
### Create a custom template
Using the zabbix ansible role `community.zabbix.zabbix_template`, create a template:
```
# Create template
- name: Set API token
tags: always
set_fact:
ansible_zabbix_auth_key: XXXXAPIKEYXXXX
ansible_network_os: community.zabbix.zabbix
ansible_connection: httpapi
ansible_httpapi_port: 443
ansible_httpapi_use_ssl: true
ansible_httpapi_validate_certs: false
ansible_host: ZABBIX_SERVER_HOSTNAME
ansible_zabbix_url_path: "" # If Zabbix WebUI runs on non-default zabbix path (/),e.g. http://<FQDN>/zabbix
# If implementing the template in the Zabbix GUI, can export the completed template with the following two
- name: Get Zabbix template as JSON
community.zabbix.zabbix_template_info:
template_name: fedora releng compose cronjobs
format: json
omit_date: yes
register: zabbix_template_json
- name: Write Zabbix templte to JSON file
local_action:
module: copy
content: "{{ zabbix_template_json['template_json'] }}"
dest: "roles/zabbix_server/files/zabbix_templates/releng_compose_cronjobs.json"
# Import the template JSON
- name: Import Zabbix templates from JSON
community.zabbix.zabbix_template:
template_json: "{{ lookup('file', 'files/zabbix_templates/releng_compose_cronjobs.json') }}"
state: present
```
### Create a host group
```
# Create host-group
- name: Set API token
tags: always
set_fact:
ansible_zabbix_auth_key: XXXXAPIKEYXXXX
ansible_network_os: community.zabbix.zabbix
ansible_connection: httpapi
ansible_httpapi_port: 443
ansible_httpapi_use_ssl: true
ansible_httpapi_validate_certs: false
ansible_host: ZABBIX_SERVER_HOSTNAME
ansible_zabbix_url_path: "" # If Zabbix WebUI runs on non-default zabbix path (/),e.g. http://<FQDN>/zabbix
- name: Create host groups
# set task level variables as we change ansible_connection plugin here
community.zabbix.zabbix_group:
state: present
host_groups:
- fedora releng compose
```
### Add production releng_compose hosts to the Zabbix host group
```
- name: Set API token
tags: always
set_fact:
ansible_zabbix_auth_key: XXXXAPIKEYXXXX
ansible_network_os: community.zabbix.zabbix
ansible_connection: httpapi
ansible_httpapi_port: 443
ansible_httpapi_use_ssl: true
ansible_httpapi_validate_certs: false
ansible_host: ZABBIX_SERVER_HOSTNAME
ansible_zabbix_url_path: "" # If Zabbix WebUI runs on non-default zabbix path (/),e.g. http://<FQDN>/zabbix
- name: Create host groups
community.zabbix.zabbix_host:
host_name: "{{ item }}"
host_groups:
- fedora releng compose
link_templates:
- fedora releng compose cronjobs
force: false
with_items:
- compose-branched01.iad2.fedoraproject.org
- compose-iot01.iad2.fedoraproject.org
- compose-rawhide01.iad2.fedoraproject.org
- compose-x86-01.iad2.fedoraproject.org
```
### In this template create an item, one for each cronjob
- Configure the type as zabbix agent active
- Configure history to 7d for 1 week
- Configure the resolution to 1m to check every minute
- Configure the key to match something like the following, changing the * to what ever the name of the cronjob is, eg rawhide
```
vfs.file.exists[/tmp/fedora-compose-*]
```
### In this template create a trigger, one for each cronjob.
- Configure the trigger to 1 hour initially.
- Configure the severity to high
- In this example the `fedora releng compose cronjobs` is the hostgroup that the hosts are members of, and have this template attached.
- Configure the expression to something like the following, changing the * to what ever the name of the file in the key in the matching item
```
last(/fedora releng compose cronjobs/vfs.file.exists[/tmp/fedora-compose-*])=1 and min(/fedora releng compose cronjobs/vfs.file.exists[/tmp/fedora-compose-*],1h)>0
```
### Modify each cronjob in ansible
- When a cronjob starts:
-- create a file in `/tmp/name-of-cron-job`
- When a cronjob ends:
-- delete the file in `/tmp/name-of-cron-job`
- If file exists, assume cron is running and if file exists for more than a set period, assume the cron job is stalled.
### Future Work
Replace these custom releng monitoring things with zabbix.
- https://pagure.io/releng/failed-composes/issues
- https://pagure.io/releng/compose-tracker