Simple OpenNMS/Minion Environment using Kafka in Azure

This lab starts an OpenNMS instance and a 3 node ZK/Kafka cluster in the cloud and two Minions on your machine, using Kafka for communication through Multipass and Azure, for learning purposes.

The lab doesn't cover security by default (user authentication and encryption), which is crucial if we ever want to expose the Kafka cluster to the Internet. A separate section covers the required changes for this.

Keep in mind that nothing prevents us from skipping using the cloud provider and do everything with Multipass (or VirtualBox, or Hyper-V, or VMWare). The reason for using a cloud provider is to prove that OpenNMS can monitor unreachable devices via Minion. Similarly, we could use any other cloud provider instead of Azure. However I won't explain how to port the solution here.

Time synchronization across all the instances involved in this solution is mandatory. Failing on this could lead to undesired side effects. This is essentially guaranteed when using a cloud provider, which is why I do not include explicit instructions for it, but please be aware of it.


The scripts used through this tutorial use envsubst, make sure to have it installed.

Make sure to log into Azure using az login prior creating the VM.

If you have a restricted account in Azure, make sure you have the Network Contributor role and the Virtual Machine Contributor role associated with your Azure AD account for the resource group where you want to create the VM. Of course, either Owner or Contributor at the resource group level are welcome.

All the following assume you have a macOS or Linux machine or VM from which you can issue all the commands.

Create common Environment Variables

export PREFIX="$USER" # String to prepend to the name of all Azure resources export RG_NAME="OpenNMS" # Change it to use a shared one export LOCATION="eastus" # Azure Region export DOMAIN="$" # Public Azure DNS Domain export TIMEZONE="America/New_York" export VNET_CIDR="" export VNET_SUBNET="" export VNET_NAME="$PREFIX-vnet" export VNET_SUBNET_NAME="subnet1" export KAFKA_VM_SIZE="Standard_D2s_v3" # 2 VCPU, 8 GB of RAM export ZK_HEAP_SIZE="1G" # Must fit KAFKA_VM_SIZE export KAFKA_URL="" export KAFKA_JAVA_VERSION="11" # 8 for < 2.1.0; 11 for > 2.1.0 export KAFKA_HEAP_SIZE="2G" # Must fit KAFKA_VM_SIZE export KAFKA_PARTITIONS="9" # > Number of Minions per location export KAFKA_CLUSTER_SIZE="3" # Total instances of Kafka+ZK export KAFKA_RF="2" # < KAFKA_CLUSTER_SIZE export ONMS_VM_NAME="$PREFIX-onms01" export ONMS_VM_SIZE="Standard_D2s_v3" # 2 VCPU, 8 GB of RAM export ONMS_HEAP_SIZE="4096" # Expressed in MB and must fit ONMS_VM_SIZE export MINION_LOCATION="Durham" export MINION_HEAP_SIZE="1G" # Must fit VM RAM

We haven't tested 3.0.0, so please use 2.8.x or older for now.

Feel free to change the content and keep in mind that $PREFIX is what we will use throughout this tutorial to identify all the resources we will create in Azure uniquely.

Do not confuse the Azure Location or Region with the Minion Location; they are both unrelated things.

We're going to leverage the Azure DNS services to avoid the need to remember and using Public IP addresses, which helps if we're interested in having HTTPS with valid certificates as explained here not only for OpenNMS, but also to enable SSL/TLS in Kafka.

In Azure, the default public DNS follow the same pattern:


To make the VMs FQDN unique, we're going to add the username to the VM name. For instance, the OpenNMS FQDN would be:

The above is what we can use to access the VM via SSH and to configure Minions.

Create the Azure Resource Group

This is a necessary step, as every resource in Azure must belong to a resource group and a location.

However, you can omit the following command and use an existing one if you prefer. In that case, make sure to adjust the environment variable RG_NAME so the subsequent commands will target the correct group.

az group create -n $RG_NAME -l $LOCATION --tags Owner=$USER

Create the Virtual Network

I prefer to create the VNET myself instead of letting Azure do it for me, especially when we want to guarantee that all the VMs will exist in the same one.

az network vnet create -g $RG_NAME \ --name $VNET_NAME \ --address-prefix $VNET_CIDR \ --subnet-name $VNET_SUBNET_NAME \ --subnet-prefix $VNET_SUBNET \ --tags Owner=$USER \ --output table

Create cloud-init configuration template for Kafka

The following cloud-init template assumes a 3 node cluster, where each VM would have Zookeeper and Kafka configured and running in Ubuntu LTS.

For simplicity, Zookeeper and Kafka will be running on each machine. In production, each cluster should have its own instances, as Zookeeper should not grow the same way as Kafka would grow, for multiple reasons such as a ZK cluster should always have an odd number of members (which is not the case of Kafka); traffic across ZK members grows exponentially with the number of instances (a ZK cluster of 5 members can manage multiple dozens of Kafka members, with 7 it can manage hundreds, and with 9 it can manage thousands).

For the 3-node cluster, each VM will be named like follows:

  • agalue-kafka-1
  • agalue-kafka-2
  • agalue-kafka-3

Note the hostnames include the chosen username to make them unique, which is mandatory for shared resource groups and the default Azure DNS public domain on the chosen region.

Remember that each VM in Azure is reachable within the same VNet from any other VM through its hostname.

From all the environment variables you'll encounter in the upcoming template, there are two crucial ones:


For, we must replace the environment variable PUBLIC_FQDN in the advertised.listeners with the public FQDN or IP of the VM when configuring the application before running it for the first time. With that in mind, there will be two listeners, one to be used within the VNet (which is what OpenNMS would use, on port 9092), and another associated with the Public FQDN (on port 9094), to be used by external Minions (outside Azure).

Similarly, we must replace INSTANCE_ID with a unique numeric value per instance for the in for Kafka and the myid file for Zookeeper, which are the mandatory requirements to identify each instance in their respective cluster.

The number of topic partitions must be greater than the number of Minions on a given location and greater than the number of brokers in the cluster.

Create a YAML file called /tmp/kafka-template.yaml with the following content:

#cloud-config package_upgrade: true timezone: $TIMEZONE users: - default - name: kafka write_files: - owner: root:root path: /etc/security/limits.d/kafka.conf content: | * soft nofile 100000 * hard nofile 100000 - owner: root:root path: /etc/sysctl.d/99-kafka.conf content: | net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_intvl=10 net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.core.rmem_default=16777216 net.core.wmem_default=16777216 net.core.optmem_max=40960 net.ipv4.tcp_rmem=4096 87380 16777216 net.ipv4.tcp_wmem=4096 65536 16777216 net.ipv4.tcp_window_scaling=1 net.core.netdev_max_backlog=2500 net.core.somaxconn=65000 vm.swappiness=1 vm.zone_reclaim_mode=0 vm.max_map_count=1048575 - owner: root:root permissions: '0400' path: /etc/snmp/snmpd.conf content: | rocommunity public default syslocation Azure - $LOCATION syscontact $USER dontLogTCPWrappersConnects yes disk / - owner: root:root path: /etc/systemd/system/zookeeper.service content: | [Unit] Description=Apache Zookeeper server Documentation= [Service] Type=simple User=kafka Group=kafka Environment="KAFKA_HEAP_OPTS=-Xmx$ZK_HEAP_SIZE -Xms$ZK_HEAP_SIZE" ExecStart=/opt/kafka/bin/ /opt/kafka/config/ ExecStop=/opt/kafka/bin/ [Install] - owner: root:root path: /etc/systemd/system/kafka.service content: | [Unit] Description=Apache Kafka Server Documentation= Wants=zookeeper.service After=zookeeper.service [Service] Type=simple User=kafka Group=kafka LimitNOFILE=100000 Environment="KAFKA_HEAP_OPTS=-Xmx$KAFKA_HEAP_SIZE -Xms$KAFKA_HEAP_SIZE" Environment=" -Djava.rmi.server.hostname=%H" Environment="JMX_PORT=9999" ExecStart=/opt/kafka/bin/ /opt/kafka/config/ ExecStop=/opt/kafka/bin/ [Install] - owner: root:root path: /tmp/ # Designed for a 3-node ZK cluster content: | dataDir=/data/zookeeper tickTime=2000 clientPort=2181 initLimit=10 syncLimit=5 # Cluster Members server.1=$PREFIX-kafka-1:2888:3888;2181 server.2=$PREFIX-kafka-2:2888:3888;2181 server.3=$PREFIX-kafka-3:2888:3888;2181 - owner: root:root path: /tmp/ # Designed for a 3-node ZK cluster content: |$INSTANCE_ID log.dirs=/data/kafka zookeeper.connect=$PREFIX-kafka-1:2181,$PREFIX-kafka-2:2181,$PREFIX-kafka-3:2181 # Connection advertised.listeners=INSIDE://:9092,OUTSIDE://$PUBLIC_FQDN:9094 listeners=INSIDE://:9092,OUTSIDE://:9094,OUTSIDE:PLAINTEXT # Replication offsets.topic.replication.factor=$KAFKA_RF default.replication.factor=$KAFKA_RF min.insync.replicas=1 # Must be greater than number of Minions per Location num.partitions=$KAFKA_PARTITIONS # Recommended for the OpenNMS Kafka Producer message.max.bytes=5000000 replica.fetch.max.bytes=5000000 compression.type=producer # Cleanup (remove segments older than a week) log.retention.hours=168 log.retention.bytes=-1 # Required for OpenNMS and Minions auto.create.topics.enable=true # Recommended to avoid disrupting messages workflow delete.topic.enable=false packages: - snmp - snmpd - jq - openjdk-$KAFKA_JAVA_VERSION-jre-headless runcmd: - sysctl --system - wget -O /tmp/kafka.tar.gz $KAFKA_URL - cd /opt - mkdir kafka - tar -xvzf /tmp/kafka.tar.gz -C kafka --strip-components 1 - mv -f /tmp/*.properties /opt/kafka/config/ - mkdir -p /data/zookeeper /data/kafka - chown -R kafka:kafka /data /opt/kafka* - echo $INSTANCE_ID > /data/zookeeper/myid - systemctl daemon-reload - systemctl --now enable zookeeper - systemctl --now enable kafka - systemctl --now enable snmpd

The reason for increasing the message size (message.max.bytes, replica.fetch.max.bytes) is to avoid problems when forwarding collected metrics to Kafka via the Kafka Producer feature of OpenNMS, which I'm planning to enable.

If you for instance wants to use an older version of Kafka, you can tune the JDK package and the Kafka URL, so the template can apply the correct one, for instance:

export KAFKA_URL="" export KAFKA_JAVA_VERSION="8"

Also, edit the template and remove ;2181 from the server entries from as expressing the client port that way expects Zookeeper 3.5 or newer.

Start Broker Instances

for i in {1..$KAFKA_CLUSTER_SIZE}; do VM_NAME="$PREFIX-kafka-$i" echo "Creating VM $VM_NAME..." export INSTANCE_ID="$i" export PUBLIC_FQDN="$VM_NAME.$DOMAIN" envsubst < /tmp/kafka-template.yaml > $VM_NAME.yaml az vm create --resource-group $RG_NAME --name $VM_NAME \ --size $KAFKA_VM_SIZE \ --image canonical:0001-com-ubuntu-server-focal:20_04-lts:latest \ --admin-username $USER \ --ssh-key-values ~/.ssh/ \ --vnet-name $VNET_NAME \ --subnet $VNET_SUBNET_NAME \ --public-ip-sku Standard \ --public-ip-address-dns-name $VM_NAME \ --custom-data $VM_NAME.yaml \ --tags Owner=$USER \ --no-wait done

Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at ~/.ssh/, or update the az vm create command.

The above will start all the VMs simultaneously using public IP addresses and FQDNs, to avoid access problems with external Minions and reconfiguration issues with the Kafka advertised listeners. However, like the public IPs, the private IPs will be dynamic. Fortunately, this is not going to be a problem as we're going to use DNS to access Kafka.

Keep in mind that the cloud-init process starts once the VM is running, meaning we should wait a few minutes after the VMs are ready to use.

Then, allow access for remote Minions:

for i in {1..$KAFKA_CLUSTER_SIZE}; do VM_NAME="$PREFIX-kafka-$i" az vm open-port -g $RG_NAME -n $VM_NAME \ --port 9094 --priority 100 --output table done

You can inspect the generated YAML files to see the final content used on each VM (after applying the env-var substitutions).

In case there is a problem, SSH into the VM using the public IP and the provided credentials and check /var/log/cloud-init-output.log to verify the progress and the status of the cloud-init execution.

Validate Zookeeper and Kafka status

To make sure the Zookeeper cluster started, we can use the "4 letter words" commands via the embedded web server, available when using version 3.5 or newer for instance:

curl http://$(hostname):8080/commands/monitor

The above gives us general information, including the server_state, which can be leader or follower.

To get statistics:

curl http://$(hostname):8080/commands/stats

For Zookeeper version 3.4 or older (for instance, when using older versions of Kafka), you can still use the deprecated way to verify:

echo stat | nc $(hostname) 2181; echo

From Kafka's perspective, we can verify how each broker has registered via Zookeeper or follow this guide to create a topic and use the console producer and consumer to validate its functionality.

List Broker IDs:

/opt/kafka/bin/ $(hostname) ls /brokers/ids

We should get:

[1, 2, 3]

If that's not the case, SSH the broker that is not listed and make sure Kafka is running. It is possible that Kafka is not properly registered to Zookeeper, and it fails to start due to how the VMs are initialized. That's because Zookeeper should start first (the whole cluster), then Kafka, but as we're not guaranteeing that, some instances might fail to start on their own. The procedure was designed to avoid this as much as possible this situation.

Get the broker basic configuration:

/opt/kafka/bin/ $(hostname) get /brokers/ids/1 | egrep '^\{' | jq

If we run it from the first instance, we should get:

{ "features": {}, "listener_security_protocol_map": { "INSIDE": "PLAINTEXT", "OUTSIDE": "PLAINTEXT" }, "endpoints": [ "INSIDE://agalue-kafka-1:9092", "OUTSIDE://" ], "jmx_port": 9999, "port": 9092, "host": "agalue-kafka-1", "version": 5, "timestamp": "1616265688431" }

Note the two listeners. Clients within Azure, like OpenNMS, would use the INSIDE one on port 9092, pointing to the local FQDN or hostname of the VM (and remember they are resolvable via DNS within the same VNet). In contrast, clients outside Azure, like Minions, would use the OUTSIDE one on port 9094 pointing to the Public FQDN of each Kafka instance (accessible thanks to the NSG associated with each VM).

Kafka defaults to the hostname or FQDN of the primary interface when we don't explicitly specify it on the listener.

As Azure DNS works by default, hostnames are resolvable by all VMs within the same VNET. For this reason, Kafka will use the correct one.

However, if you're using another cloud provider or using bare-metal, make sure to either have DNS working across all the VMs. Otherwise, change the INSIDE listener to explicitly point to the private IP address of the VM and the OUTSIDE listener to point to the public IP address of the VM; and make sure to use static IPs if you're going to rely on them.

Verification for newer versions of Kafka

Another way to verify the behavior is using the console producer and console consumer to verify that we can send and receive messages through a given topic.

To do that, for recent versions of Kafka, let's create a Test topic:

/opt/kafka/bin/ \ --bootstrap-server $(hostname):9092 \ --create --topic Test --replication-factor 2 --partitions 3

Then, start a console producer from one of the brokers:

/opt/kafka/bin/ \ --bootstrap-server $(hostname):9092 --topic Test

From another broker (separate SSH session), start a console consumer:

/opt/kafka/bin/ \ --bootstrap-server $(hostname):9092 --topic Test

Go back to the terminal on which the console producer is running, type a message, and hit enter. Then, switch to the console consumer terminal, and we should see the message sent. Use Ctrl+C to stop the producer and consumer.

A more comprehensive test would be to download Kafka locally on your machine and run either the producer or the consumer there (use port 9094 and the public FQDN or IP of one of the brokers). That serves to test connectivity from the Internet.

Verification for older versions of Kafka

To create the Test topic:

/opt/kafka/bin/ \ --zookeeper $(hostname):2181 \ --create --topic Test --replication-factor 2 --partitions 3

As you can see, the difference is talking against Zookeeper directly (using --zookeeper), instead of reaching Kafka (using --bootstrap-server).

For the producer use --broker-list instead of --bootstrap-server, for instance:

/opt/kafka/bin/ \ --broker-list $(hostname):9092 --topic Test

For the client, it is the same as newer versions:

/opt/kafka/bin/ \ --bootstrap-server $(hostname):9092 --topic Test

Topic settings

The retention settings are the default (for instance, log.retention.hours and log.retention.bytes at the broker level; or and retention.bytes at the topic level), but it is recommended to reduce them for the RPC topics, as due to the TTL, it doesn't worth keeping them for longer times. That's why 1 hour is more than enough.

Having said that, data pruning happens on closed segments only, meaning Kafka won't delete old records from the active segment (the one currently being updated with new records). That means you should also change the segment.bytes or at the topic level to allow deletion. These can be equal to or less than the expected retention. Of course, it is crucial to have the single-topic feature enabled for RPC in both Minion and OpenNMS.

However, we must fix that after the topics are created by either OpenNMS or the Minions, using the Kafka CLI tools or specialized applications like topicctl or CMAK.

For instance, on newer versions of Kafka:

/opt/kafka/bin/ --alter \ --bootstrap-server $(hostname):9092 \ --entity-type topics \ --entity-name OpenNMS.rpc-response \ --add-config \ --add-config \

For older versions:

/opt/kafka/bin/ --alter \ --zookeeper $(hostname):2181 \ --entity-type topics \ --entity-name OpenNMS.rpc-response \ --add-config \ --add-config

Note that topic level settings and broker level settings are slightly different. The topic level settings override the broker level settings when they exist.

Be careful when setting the number of partitions per topic if you're planning to have a massive number of Minion locations or share the cluster across multiple OpenNMS instances with a high number of locations. This is why having the single-topic enabled in OpenNMS and Minion is the best approach (the default in H28).

Each lead partition (and each replica the broker maintains) will have a directory in the data directory, and Kafka will maintain a file descriptor per segment. Each segment contains two files, the index and the data itself. For more information, check this blog post.

It is recommended to have a dedicated file system for the data directory formatted using XFS with noatime and nodiratime in production.

Create an Azure VM for OpenNMS

Create a cloud-init script with the following content to deploy PostgreSQL, the latest OpenNMS Horizon, and CMAK in Ubuntu LTS and store it at /tmp/opennms-template.yaml:

#cloud-config package_upgrade: true timezone: $TIMEZONE write_files: - owner: root:root path: /etc/opennms-overlay/featuresBoot.d/features.boot content: | opennms-kafka-producer # OpenNMS RRD Settings - owner: root:root path: /etc/opennms-overlay/ content: | org.opennms.rrd.storeByGroup=true org.opennms.rrd.storeByForeignSource=true org.opennms.rrd.strategyClass=org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy org.opennms.rrd.interfaceJar=/usr/share/java/jrrd2.jar opennms.library.jrrd2=/usr/lib/jni/ # OpenNMS Sink and RPC API - owner: root:root path: /etc/opennms-overlay/ content: | # Disable internal ActiveMQ # Sink org.opennms.core.ipc.sink.strategy=kafka org.opennms.core.ipc.sink.kafka.bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092 org.opennms.core.ipc.sink.kafka.acks=1 # RPC org.opennms.core.ipc.rpc.strategy=kafka org.opennms.core.ipc.rpc.kafka.bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092 org.opennms.core.ipc.rpc.kafka.ttl=30000 org.opennms.core.ipc.rpc.kafka.single-topic=true # OpenNMS Kafka Producer Client - owner: root:root path: /etc/opennms-overlay/org.opennms.features.kafka.producer.client.cfg content: | bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092 compression.type=zstd max.request.size=5000000 # OpenNMS Kafka Producer Settings - owner: root:root path: /etc/opennms-overlay/org.opennms.features.kafka.producer.cfg content: | topologyProtocols=bridge,cdp,isis,lldp,ospf suppressIncrementalAlarms=true forward.metrics=true nodeRefreshTimeoutMs=300000 alarmSyncIntervalMs=300000 kafkaSendQueueCapacity=1000 nodeTopic=OpenNMS_nodes alarmTopic=OpenNMS_alarms eventTopic=OpenNMS_events metricTopic=OpenNMS_metrics alarmFeedbackTopic=OpenNMS_alarms_feedback topologyVertexTopic=OpenNMS_topology_vertices topologyEdgeTopic=OpenNMS_edges - owner: root:root permissions: '0400' path: /etc/snmp/snmpd.conf content: | rocommunity public default syslocation Azure - $LOCATION syscontact $USER dontLogTCPWrappersConnects yes disk / apt: preserve_sources_list: true sources: opennms: source: deb stable main docker: source: deb bionic stable packages: - snmp - snmpd - jq - jrrd2 - opennms - opennms-webapp-hawtio - opennms-helm - docker-ce - docker-ce-cli - bootcmd: - curl -s | apt-key add - - curl -fsSL | apt-key add - runcmd: # Configure PostgreSQL - systemctl --now enable postgresql - sudo -u postgres createuser opennms - sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'postgres';" - sudo -u postgres psql -c "ALTER USER opennms WITH PASSWORD 'opennms';" - sed -r -i 's/password=""/password="postgres"/' /etc/opennms/opennms-datasources.xml # Configure OpenNMS - sed -r -i '/enabled="false"/{$!{N;s/ enabled="false"[>]\n(.*OpenNMS:Name=Syslogd.*)/>\n\1/}}' /etc/opennms/service-configuration.xml - echo "JAVA_HEAP_SIZE=$ONMS_HEAP_SIZE" > /etc/opennms/opennms.conf - rsync -avr /etc/opennms-overlay/ /etc/opennms/ - /usr/share/opennms/bin/runjava -s - /usr/share/opennms/bin/fix-permissions - /usr/share/opennms/bin/install -dis - systemctl --now enable opennms # Start CMAK using Docker - usermod -aG docker ubuntu - docker run --name cmak -d -e ZK_HOSTS="$PREFIX-kafka-1:2181" -e APPLICATION_SECRET="opennms" -p 9000:9000 hlebalbau/kafka-manager:stable # Upgrade Grafana - sudo apt-get install -y adduser libfontconfig1 - wget - sudo dpkg -i grafana_7.5.11_amd64.deb

We don't need to specify Kafka Brokers' whole list as part of the bootstrap.servers entry. The whole topology will be discovered through the first one that responds, and the client will use what's configured as the advertised listener to talk to each broker. I added two in case the first one is unavailable (as a backup).

If you're using an older version of Kafka, make sure to set the appropriate version when adding your cluster to CMAK.

The above installs the latest OpenJDK 11, the latest PostgreSQL, and the latest OpenNMS Horizon to the VM. It also install Kafka Manager or CMAK via Docker. I added the most basic configuration for PostgreSQL to work with authentication. Kafka will be enabled for Sink/RPC as well as the Kafka Producer. As mentioned, Azure VMs can reach each other through hostnames.

Create an Ubuntu VM for OpenNMS:

envsubst < /tmp/opennms-template.yaml > /tmp/opennms.yaml az vm create --resource-group $RG_NAME --name $ONMS_VM_NAME \ --size $ONMS_VM_SIZE \ --image canonical:0001-com-ubuntu-server-focal:20_04-lts:latest \ --admin-username $USER \ --ssh-key-values ~/.ssh/ \ --vnet-name $VNET_NAME \ --subnet $VNET_SUBNET_NAME \ --public-ip-address-dns-name $ONMS_VM_NAME \ --public-ip-sku Standard \ --custom-data /tmp/opennms.yaml \ --tags Owner=$USER \ --output table az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \ --port 8980 --priority 200 --output table az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \ --port 3000 --priority 200 --output table az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \ --port 9000 --priority 300 --output table

Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at ~/.ssh/, or update the az vm create command.

Keep in mind that the cloud-init process starts once the VM is running, meaning we should wait about 5 minutes after the az vm create is finished to see OpenNMS up and running.

In case there is a problem, SSH into the VM using the public IP and the provided credentials and check /var/log/cloud-init-output.log to verify the progress and the status of the cloud-init execution.

Monitor the infrastructure

Wait until OpenNMS is up and running and then execute the following, to start monitoring all the ZK/Kafka servers, and the OpenNMS server via SNMP and JMX.

ONMS_FQDN="$ONMS_VM_NAME.$DOMAIN" cat <<EOF >/tmp/OpenNMS.xml <?xml version="1.0"?> <model-import date-stamp="$(date +"%Y-%m-%dT%T.000Z")" foreign-source="OpenNMS"> EOF for vm in $(az vm list -g $RG_NAME --query "[?contains(name,'$PREFIX-')].name" -o tsv); do ipaddr=$(az vm show -g $RG_NAME -n $vm -d --query privateIps -o tsv) cat <<EOF >>/tmp/OpenNMS.xml <node foreign-id="$vm" node-label="$vm"> EOF if [[ "$vm" == *"kafka"* ]]; then cat <<EOF >>/tmp/OpenNMS.xml <interface ip-addr="$ipaddr" status="1" snmp-primary="P"> <monitored-service service-name="JMX-Kafka"/> </interface> </node> EOF fi if [[ "$vm" == *"onms"* ]]; then cat <<EOF >>/tmp/OpenNMS.xml <interface ip-addr="$ipaddr" status="1" snmp-primary="P"/> <interface ip-addr="" status="1" snmp-primary="N"> <monitored-service service-name="OpenNMS-JVM"/> </interface> </node> EOF fi done cat <<EOF >>/tmp/OpenNMS.xml </model-import> EOF curl -v -u admin:admin \ -H 'Content-Type: application/xml' -d @/tmp/OpenNMS.xml \ http://$ONMS_FQDN:8980/opennms/rest/requisitions curl -v -u admin:admin -X PUT \ http://$ONMS_FQDN:8980/opennms/rest/requisitions/OpenNMS/import

Create Minion VMs using multipass

After verifying that OpenNMS is up and running, we can proceed to create the Minions.

Create a cloud-init script to deploy Minion in Ubuntu and save it at /tmp/minion-template.yaml:

#cloud-config package_upgrade: true timezone: $TIMEZONE write_files: - owner: root:root path: /etc/minion-overlay/org.opennms.minion.controller.cfg content: | location=$MINION_LOCATION id=$MINION_ID http-url=http://$ONMS_VM_NAME.$DOMAIN:8980/opennms - owner: root:root path: /etc/minion-overlay/featuresBoot.d/kafka.boot content: | !minion-jms !opennms-core-ipc-sink-camel !opennms-core-ipc-rpc-jms opennms-core-ipc-sink-kafka opennms-core-ipc-rpc-kafka - owner: root:root path: /etc/minion-overlay/org.opennms.core.ipc.sink.kafka.cfg content: | bootstrap.servers=$PREFIX-kafka-1.$DOMAIN:9094,$PREFIX-kafka-2.$DOMAIN:9094 - owner: root:root path: /etc/minion-overlay/org.opennms.core.ipc.rpc.kafka.cfg content: | bootstrap.servers=$PREFIX-kafka-1.$DOMAIN:9094,$PREFIX-kafka-2.$DOMAIN:9094 single-topic=true apt: preserve_sources_list: true sources: opennms: source: deb stable main packages: - opennms-minion bootcmd: - curl -s | apt-key add - runcmd: - rsync -avr /etc/minion-overlay/ /etc/minion/ - sed -i -r 's/# export JAVA_MIN_MEM=.*/export JAVA_MIN_MEM="$MINION_HEAP_SIZE"/' /etc/default/minion - sed -i -r 's/# export JAVA_MAX_MEM=.*/export JAVA_MAX_MEM="$MINION_HEAP_SIZE"/' /etc/default/minion - /usr/share/minion/bin/scvcli set opennms.http admin admin - /usr/share/minion/bin/scvcli set admin admin - systemctl --now enable minion

Note that I'm using the same content for bootstrap.servers as OpenNMS, making sure to use the Public FQDNs, as Minions won't be running in Azure.

Then, start the new Minion via multipass:

export MINION_ID=minion01 envsubst < /tmp/minion-template.yaml > /tmp/$MINION_ID.yaml multipass launch -c 1 -m 2G -n $MINION_ID --cloud-init /tmp/$MINION_ID.yaml

Optionally, create a second Minion in the same location:

export MINION_ID=minion02 envsubst < /tmp/minion-template.yaml > /tmp/$MINION_ID.yaml multipass launch -c 1 -m 2G -n $MINION_ID --cloud-init /tmp/$MINION_ID.yaml

In case there is a problem, access the VM (e.x., multipass shell minion01) and check /var/log/cloud-init-output.log to verify the progress and the status of the cloud-init execution.

Feel free to change the CPU and memory settings for your Minion, but make sure it is consistent with MINION_HEAP_SIZE. Make sure to validate communication using the health-check command from the Karaf Shell.

When having multiple Minions per location, they will become part of a consumer group from Kafka's perspective for the RPC requests topic. The group ID will be the name of the location.


As you can see, the location name is Durham (a.k.a. $MINION_LOCATION), and you should see the Minions on that location registered in OpenNMS.

SSH into the OpenNMS server and create a requisition with a node in the same network as the Minion VMs, and make sure to associate it with the appropriate location. For instance,

/usr/share/opennms/bin/ requisition add Test /usr/share/opennms/bin/ node add Test srv01 srv01 /usr/share/opennms/bin/ node set Test srv01 location Durham /usr/share/opennms/bin/ interface add Test srv01 /usr/share/opennms/bin/ interface set Test srv01 snmp-primary P /usr/share/opennms/bin/ requisition import Test

Ensure to replace with the IP of a working server in your network (reachable from the Minion VM, and preferable unreachable or nonexistent in Azure), and do not forget to use the same location as defined in $MINION_LOCATION.

Please keep in mind that Minions are VMs on your machine. is the IP of one of my machines which is why Minions can reach it (and vice versa). To access an external machine on your network, make sure to define static routes on that machine so it can reach the Minions through your machine (assuming you're running Linux or macOS).

OpenNMS which runs in Azure, and have no access to directly, should be able to collect data and monitor that node through any of the Minions. In fact, you can stop one of them, and OpenNMS would continue monitoring it.

To test asynchronous messages, you can send SNMP traps or Syslog messages to one of the Minions. Alternatively, you could use udpgen for this purpose. Usually, you could put a Load Balancer in front of the Minions and use its IP when sending messages from the monitored devices.

The machine that will be running udpgen must be part of the OpenNMS inventory. Then, find the IP of the Minion using multipass list, then execute the following from the machine added as a node above (the examples assumes the IP of the Minion is

To send SNMP Traps:

udpgen -h -x snmp -r 1 -p 1162

To send Syslog Messages:

udpgen -h -x syslog -r 1 -p 1514

The C++ version of udpgen only works on Linux. If you're on macOS, you can use the Go version of it. Unfortunately, Windows is not an option due to a lack of support for Syslog in Go.

Note that an event definition is required when using udpgen to send traps. Here is what you'd need for Eventd:

<events xmlns=""> <event> <mask> <maskelement> <mename>id</mename> <mevalue>.</mevalue> </maskelement> <maskelement> <mename>generic</mename> <mevalue>6</mevalue> </maskelement> <maskelement> <mename>specific</mename> <mevalue>1</mevalue> </maskelement> </mask> <uei></uei> <event-label>udpgen test trap</event-label> <descr>Sample Event %parm[all]%</descr> <logmsg dest="logndisplay">Sample Event %parm[all]%</logmsg> <severity>Warning</severity> </event> </events>

If you want to make the tests more interesting, add the following to the above definition:

<alarm-data reduction-key="%uei%:%dpname%:%nodeid%" alarm-type="3" auto-clean="false"/>

The Hawtio UI in OpenNMS can help visualize the relevant JMX metrics and understand what’s circulating between OpenNMS and the Minions.

For OpenNMS, Hawtio is available through :8980/hawtio if the package opennms-webapp-hawtio was installed (which is the case with the cloud-init template used).

For Minions, Hawtio is available through :8181/hawtio.


As mentioned, if time is not synchronized across all the instances, the Heartbeat sent by Minions via the Sink API won't be processed properly by OpenNMS, leading to having the Minion not registered or outages in the Minion-Heartbeat service.

We can inspect the traffic on the topics to see if the Minion is sending (or receiving) traffic to Kafka. However, as the payload is encoded within a Protobuf message, using the console consumer might not be as useful as we'd expect. Still, it works for troubleshooting purposes. For instance, from one of the Kafka brokers, we can do:

/opt/kafka/bin/ \ --bootstrap-server $(hostname):9092 \ --topic OpenNMS.Sink.Heartbeat

And we'll see:

$bce7b13e-d575-40b9-989a-3b5c6e7432c2 ~<minion>

As we can see, the actual payload within the Protobuf message is an indented XML.

The following application can be used to properly inspect the content without worrying about the non-readable content due to the Protobuf format:

For RPC in particular, we can access the Karaf Shell from the OpenNMS instance and use the opennms:stress-rpc command to verify communication against the Minions on a given location or against a specific Minion, and as the command name implies, to perform stress tests.

Useful Kafka Commands

For recent versions of Kafka, the following can help to get details about topics, lags, consumer groups and so on.

To verify the topic partitions and replica settings:

topics=$(/opt/kafka/bin/ --list --bootstrap-server $(hostname):9092) for topic in $topics; do /opt/kafka/bin/ \ --bootstrap-server $(hostname):9092 \ --describe --topic $topic done

To verify the current topic-level settings:

topics=$(/opt/kafka/bin/ --list --bootstrap-server $(hostname):9092) for topic in $topics; do /opt/kafka/bin/ \ --bootstrap-server $(hostname):9092 \ --describe --entity-type topics --entity-name $topic --all done

To verify offsets, topics lag and consumer groups:

/opt/kafka/bin/ \ --bootstrap-server $(hostname):9092 \ --describe --all-groups --all-topics

When enabling security (either SASL or TLS), you need to pass those settings to the commands.

For instance, let's say you have SASL enabled, you should pass:

--command-config /opt/kafka/config/

Where the content of would be:

security.protocol=SASL_PLAINTEXT sasl.mechanism=PLAIN required username="opennms" password="0p3nNM5";

For older versions of Kafka, the equivalent commands are the following:

To verify the topic partitions and replica settings:

topics=$(/opt/kafka/bin/ --list --zookeeper $(hostname):2181) for topic in $topics; do /opt/kafka/bin/ \ --zookeeper $(hostname):2181 \ --describe --topic $topic done

To verify the current topic-level settings:

topics=$(/opt/kafka/bin/ --list --zookeeper $(hostname):2181) for topic in $topics; do /opt/kafka/bin/ \ --zookeeper $(hostname):2181 \ --describe --entity-type topics --entity-name $topic done

To verify offsets, topics lag and consumer groups:

groups=$(/opt/kafka/bin/ --bootstrap-server $(hostname):9092 --list) for group in $groups; do /opt/kafka/bin/ \ --bootstrap-server $(hostname):9092 \ --describe --all-topics --group $group done

When passing the ZK host to --zookeeper, that has to be consistent with how zookeeper.connect was defined on each Kafka broker. If you used something like this zk1:2181,zk2:2181/kafka, you should then pass --zookeeper $(hostname):2181/kafka instead.

Sharing Kafka across multiple OpenNMS-Minion sets

In big environments, it is common to have multiple OpenNMS instances, each of them with its own fleet of Minions to monitor one of the multiple data centers or a section of it. In those scenarios, it is common to have a centralized Kafka cluster that can be shared across all of them (for more information, follow this link).

The above solution has to be modified to ensure each set of OpenNMS and Minions will use their own set of topics in Kafka to avoid collisions.

The topics' prefix (which defaults to OpenNMS) can be controlled via a system-wide property called Instance ID (a.k.a. We must configure this property in both places. For the OpenNMS, add it to a property file inside $OPENNMS_HOME/etc/; and for a Minion, add it to a property file inside $MINION_HOME/etc/

Add a Load Balancer in front of the Minions (Optional)

In production, when having multiple Minions per location, it is a good practice to put a Load Balancer in front of them so that the devices can use a single destination for SNMP Traps, Syslog, and Flows.

The following creates a cloud-init template for Ubuntu to start a basic LB using nginx through multipass for SNMP Traps (with a listener on port 162) and Syslog Messages (with a listener on port 514). Save the template at /tmp/nginx-template.yaml:

#cloud-config package_upgrade: true packages: - nginx write_files: - owner: root:root path: /etc/nginx/nginx.conf content: | user www-data; worker_processes auto; pid /run/; include /etc/nginx/modules-enabled/*.conf; events { worker_connections 768; } stream { upstream syslog_udp { server $MINION_IP1:1514; server $MINION_IP2:1514; } upstream trap_udp { server $MINION_IP1:1162; server $MINION_IP2:1162; } server { listen 514 udp; proxy_pass syslog_udp; proxy_responses 0; } server { listen 162 udp; proxy_pass trap_udp; proxy_responses 0; } } runcmd: - systemctl restart nginx

Note the usage of environment variables within the YAML template. We will substitute them before creating the VM.

Then, update the template and create the LB:

export MINION_IP1=$(multipass info $MINION_ID1 | grep IPv4 | awk '{print $2}') export MINION_IP2=$(multipass info $MINION_ID2 | grep IPv4 | awk '{print $2}') envsubst < /tmp/nginx-template.yaml > /tmp/nginx.yaml multipass launch -n nginx --cloud-init /tmp/nginx.yaml echo "Load Balancer $(multipass info nginx | grep IPv4)"

Flows are outside the scope of this test as that requires more configuration on Minions and OpenNMS besides having an Elasticsearch cluster up and running with the required plugin in place.

Securing Zookeeper and Kafka

The above procedure uses Kafka and Zookeeper in plain text without authentication or encryption. That works for testing purposes or perhaps for private clusters, where access to the servers is restricted and audited.

This example, in particular, exposes Kafka to the Internet, which requires having at least authentication in place. The following explains how to enable authentication and then the steps to enable encryption.

For a more comprehensive guide, follow this tutorial from Confluent.


This section explains how to enable authentication using SASL with SCRAM-SHA-512 for Kafka and DIGEST for Zookeeper (as Zookeeper doesn't support SCRAM). Because this guide's intention is learning, I decided to add security as a separate or optional module. That's due to the extra complexity associated with this advanced topic.

Here are the high-level changes:

  • Create the SCRAM credentials for Kafka through one of the brokers. The credentials are stored in Zookeeper.
  • Update and the systemd service definition on each Kafka broker to enable and use SASL.
  • Update and the systemd service definition on each ZK instance to enable and use SASL.
  • Stop Kafka Cluster, restart Zookeeper cluster, start Kafka Cluster.
  • Update OpenNMS to use SASL for the Sink API, the RPC API, and the Kafka Producer and restart.
  • Update Minion to use SASL for the Sink API and the RPC API and restart.

Access one of the brokers and execute the following command:

ONMS_USER="opennms" # To be used by Kafka, OpenNMS and Minions ONMS_PASSWD="0p3nNM5;" # To be used by Kafka, OpenNMS and Minions /opt/kafka/bin/ --bootstrap-server $(hostname):9092 \ --alter \ --add-config "SCRAM-SHA-256=[password=$ONMS_PASSWD],SCRAM-SHA-512=[password=$ONMS_PASSWD]" \ --entity-type users \ --entity-name $ONMS_USER

On each Zookeeper instance, update to enable SASL:

cat <<EOF | sudo tee -a /opt/kafka/config/ authProvider.sasl=org.apache.zookeeper.server.auth.SASLAuthenticationProvider requireClientAuthScheme=sasl EOF

On each Kafka broker instance, update to enable SASL/SCRAM:

sudo sed -i -r '/' /opt/kafka/config/ cat <<EOF | sudo tee -a /opt/kafka/config/ # Enable Security,OUTSIDE:SASL_PLAINTEXT sasl.enabled.mechanisms=SCRAM-SHA-256,SCRAM-SHA-512 EOF

Note that already exists in that file, which is why I removed it prior adding the required changes.

In theory, there is no need to enable both SCRAM-SHA-256 and SCRAM-SHA-512. I did that for compatibility purposes, but I'll use SCRAM-SHA-512 for all subsequent configurations.

On each Zookeeper instance, create the JAAS configuration file with the credentials:

ZK_USER="zkonms" ZK_PASSWD="zk0p3nNM5;" cat <<EOF | sudo tee /opt/kafka/config/zookeeper_jaas.conf Server { org.apache.zookeeper.server.auth.DigestLoginModule required user_$ZK_USER="$ZK_PASSWD"; }; EOF sudo chown kafka:kafka /opt/kafka/config/zookeeper_jaas.conf sudo chmod 0600 /opt/kafka/config/zookeeper_jaas.conf

On each Kafka broker, create the JAAS configuration file with the credentials:

ZK_USER="zkonms" # Must match zookeeper_jaas.conf ZK_PASSWD="zk0p3nNM5;" # Must match zookeeper_jaas.conf ONMS_USER="opennms" # Must match scram user ONMS_PASSWD="0p3nNM5;" # Must match scram user cat <<EOF | sudo tee /opt/kafka/config/kafka_jaas.conf KafkaServer { required username="$ONMS_USER" password="$ONMS_PASSWD"; }; Client { org.apache.zookeeper.server.auth.DigestLoginModule required username="$ZK_USER" password="$ZK_PASSWD"; }; EOF sudo chown kafka:kafka /opt/kafka/config/kafka_jaas.conf sudo chmod 0600 /opt/kafka/config/kafka_jaas.conf

On each Zookeeper instance, update the systemd service definition to load the JAAS settings via KAFKA_OPTS:

OPTS='Environment=""' sudo sed -i -r -e "/^ExecStart=.*/i $OPTS" /etc/systemd/system/zookeeper.service sudo systemctl daemon-reload

On each Kafka broker, update the systemd service definition to load the JAAS settings via KAFKA_OPTS:

OPTS='Environment=""' sudo sed -i -r -e "/^ExecStart=.*/i $OPTS" /etc/systemd/system/kafka.service sudo systemctl daemon-reload

Restart the cluster in the following order:

  • Stop Kafka on each server.
  • Restart Zookeeper on each server.
  • Start Kafka on each server.

At this point, you should pass the SASL credentials to all Kafka CLI Tools. For instance,

ONMS_USER="opennms" # Must match scram user ONMS_PASSWD="0p3nNM5;" # Must match scram user cat <<EOF | sudo tee -a /opt/kafka/config/ # Security security.protocol=SASL_PLAINTEXT sasl.mechanism=SCRAM-SHA-512 required username="$ONMS_USER" password="$ONMS_PASSWD"; EOF /opt/kafka/bin/ --list \ --bootstrap-server $(hostname):9092 \ --command-config /opt/kafka/config/

Note how we pass the consumer settings. The above command should list all the topics in the cluster. If you can see the list, then SASL is working. Keep in mind that not passing --command-config, the command should timeout, as the tool cannot communicate to Kafka without the credentials.

On the OpenNMS instance, update /opt/opennms/etc/ and /opt/opennms/etc/org.opennms.features.kafka.producer.cfg to use SASL, and restart OpenNMS. For instance:

ONMS_USER="opennms" # Must match scram user ONMS_PASSWD="0p3nNM5;" # Must match scram user for module in sink rpc; do cat <<EOF | sudo tee -a /etc/opennms/ # Security for $module org.opennms.core.ipc.$ org.opennms.core.ipc.$module.kafka.sasl.mechanism=SCRAM-SHA-512 org.opennms.core.ipc.$ required username="$ONMS_USER" password="$ONMS_PASSWD"; EOF done cat <<EOF | sudo tee -a /etc/opennms/org.opennms.features.kafka.producer.client.cfg # Security security.protocol=SASL_PLAINTEXT sasl.mechanism=SCRAM-SHA-512 required username="$ONMS_USER" password="$ONMS_PASSWD"; EOF sudo systemctl restart opennms

On each Minion, update /etc/minion/org.opennms.core.ipc.sink.kafka.cfg and /etc/minion/org.opennms.core.rpc.sink.kafka.cfg to use SASL, and restart Minion. For instance:

ONMS_USER="opennms" # Must match scram user ONMS_PASSWD="0p3nNM5;" # Must match scram user for module in sink rpc; do cat <<EOF | sudo tee -a /etc/minion/org.opennms.core.ipc.$module.kafka.cfg # Security security.protocol=SASL_PLAINTEXT sasl.mechanism=SCRAM-SHA-512 required username="$ONMS_USER" password="$ONMS_PASSWD"; EOF done sudo systemctl restart minion

The solution works from OpenNMS and Minion perspective, despite the fact of seeing the following message repeatedly in /opt/kafka/logs/server.log on all brokers:

[2021-04-11 12:35:56,486] INFO [SocketServer brokerId=2] Failed authentication with / (Unexpected Kafka request of type METADATA during SASL handshake.) (

Where is the IP of the OpenNMS server.

At this point, we have SASL authentication enabled using SCRAM-512 for Kafka and DIGEST for Zookeeper, meaning credentials might be hard to crack when intercepting traffic (but perhaps not impossible). However, to make it more secure, encryption is recommended.

If you already configured CMAK, make sure to enable the SASL/SCRAM mechanism for your cluster.


Please keep in mind that enabling SSL/TLS will increase CPU demand on each broker and the clients, which is why using OpenJDK 11 over JDK 8 is encouraged.

To enable TLS, and because each Kafka Broker must be exposed and reachable through a public DNS entry, I'm going to use LetsEncrypt to generate the certificates. That will save a few steps because the certificates will be publicly valid, so we won't need to set up a Trust Store.

A Trust Store is mandatory when using private CAs or self-signed certificates to configure every entity that touches Kafka directly or indirectly.

The Certbot utility used to create and validate the certificate will start a temporary web server on the instance (for the validation process). For this reason, we should temporary allow access through port TCP 80:

for i in {1..$KAFKA_CLUSTER_SIZE}; do VM_NAME="$PREFIX-kafka-$i" az vm open-port -g $RG_NAME -n $VM_NAME \ --port 80 --priority 101 --output table done

Then, on each Kafka Broker (one by one), we must do the following to enable TLS:

FQDN="$(hostname)" EMAIL="" PASSWD="0p3nNM5" sudo snap install --classic certbot sudo ln -s /snap/bin/certbot /usr/bin/certbot sudo certbot certonly --standalone -d $FQDN -m $EMAIL \ --non-interactive --agree-tos TEMP_P12="/tmp/ssl.p12.$(date +%s)" TEMP_KEYSTORE="/tmp/ssl.keystore.$(date +%s)" TARGET_KEYSTORE="/opt/kafka/config/letsencrypt.jks" sudo openssl pkcs12 -export \ -in /etc/letsencrypt/live/$FQDN/fullchain.pem \ -inkey /etc/letsencrypt/live/$FQDN/privkey.pem \ -out $TEMP_P12 -name kafka -password "pass:$PASSWD" sudo keytool -importkeystore -alias kafka \ -deststorepass "$PASSWD" -destkeypass "$PASSWD" -destkeystore $TEMP_KEYSTORE \ -srckeystore $TEMP_P12 -srcstoretype PKCS12 -srcstorepass "$PASSWD" sudo cp $TEMP_KEYSTORE $TARGET_KEYSTORE sudo chmod 440 $TARGET_KEYSTORE sudo chown kafka:kafka $TARGET_KEYSTORE sudo rm -f $TEMP_P12 $TEMP_KEYSTORE CONFIG="/opt/kafka/config/" sudo sed -i -r '/' $CONFIG cat <<EOF | sudo tee -a $CONFIG,OUTSIDE:SASL_SSL ssl.keystore.location=$TARGET_KEYSTORE ssl.keystore.password=$PASSWD ssl.key.password=$PASSWD EOF sudo systemctl restart kafka

Please use your own email, and keep in mind that the Azure location is hardcoded in the command; if you're using a different one, update the FQDN.

Note that SSL was only enabled for the OUTSIDE listener, meaning we should only modify the Minions (and was changed because of that), as OpenNMS won't use it because it lives in the same protected network as the Kafka cluster.

To verify, you can retrieve the broker configuration via Zookeeper:

/opt/kafka/bin/ $(hostname) get /brokers/ids/1 | egrep '^\{' | jq

If everything went well, you should get something like this:

{ "features": {}, "listener_security_protocol_map": { "INSIDE": "SASL_PLAINTEXT", "OUTSIDE": "SASL_SSL" }, "endpoints": [ "INSIDE://agalue-kafka-1:9092", "OUTSIDE://" ], "jmx_port": 9999, "port": -1, "host": null, "version": 5, "timestamp": "1622658498210" }

Note that SASL_SSL applies to OUTSIDE. Now it is time to update the Minions.

On each Minion, do the following:

for module in sink rpc; do cfg="/etc/minion/org.opennms.core.ipc.$module.kafka.cfg" sudo sed -i -r '/security.protocol/s/SASL_PLAINTEXT/SASL_SSL/' $cfg done sudo systemctl restart minion

While you're there, you can check if TLS is actually enabled by running:

openssl s_client -connect

There is no need to modify anything else as we're using valid certificates signed by a well-known public entity. When using private certificates or private CAs, you would have to create Trust Store via keytool for the clients and the brokers.

As an challenge to the reader, update the /tmp/kafka-template.yaml, /tmp/opennms-template.yaml, and /tmp/minion-template.yaml to include all the SASL and SSL/TLS configuration and start the whole environment from scratch with authentication and encryption enabled.

Securing OpenNMS

The following is inspired by this guide to enable TLS with Nginx for the OpenNMS WebUI and Grafana. However, as we're using Ubuntu here, I'll describe the required changes.

Allow access via TCP 80 and 443:

az vm open-port -g $RG_NAME -n $ONMS_VM_NAME --port 443 --priority 110 -o table az vm open-port -g $RG_NAME -n $ONMS_VM_NAME --port 80 --priority 120 -o table

SSH the OpenNMS server and then:

export EMAIL="" export LOCATION=$(curl -H Metadata:true --noproxy "*" "" 2>/dev/null | jq -r '.compute.location') export FQDN=$(hostname).$ sudo apt install -y nginx sudo mkdir -p /var/www/$FQDN/.well-known sudo chown nginx:nginx /var/www/$FQDN cfg="/etc/nginx/sites-available/default" cat <<EOF | sudo tee $cfg server { listen 80; server_name $FQDN; # maintain the .well-known directory alias for lets encrypt renewals location /.well-known { alias /var/www/$FQDN/.well-known; } location /hawtio/ { proxy_pass http://localhost:8980/hawtio/; } location /grafana/ { proxy_pass http://localhost:3000/; } location /opennms/ { proxy_set_header Host \$host; proxy_set_header X-Real-IP \$remote_addr; proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto \$scheme; proxy_set_header Upgrade \$http_upgrade; proxy_set_header Connection "Upgrade"; proxy_pass http://localhost:8980/opennms/; proxy_redirect default; proxy_read_timeout 90; } } EOF sudo systemctl restart nginx sudo systemctl enable nginx sudo snap install core sudo snap refresh core sudo snap install --classic certbot sudo ln -s /snap/bin/certbot /usr/bin/certbot sudo certbot --nginx -d $FQDN --non-interactive --agree-tos -m $EMAIL cat <<EOF | sudo tee /etc/opennms/ = opennms.web.base-url = https://%x%c/ EOF sudo systemctl restart opennms sudo sed -i -r "s|^;domain =.*|domain = $FQDN|" /etc/grafana/grafana.ini sudo sed -i -r "s|^;root_url =.*|root_url = %(protocol)s://%(domain)s:%(http_port)s/grafana/|" /etc/grafana/grafana.ini sudo systemctl restart grafana-server

Make sure to use a valid content for $EMAIL, as that's required by LetsEncrypt (as we did for Kafka).

Note that cmak (or Kafka Manager) is not present due to the complexity of having it working behind a proxy.

You can remove the NSG rules for ports 8980 and 3000.

az network nsg rule delete -g $RG_NAME \ --nsg-name ${ONMS_VM_NAME}NSG -n open-port-8980 az network nsg rule delete -g $RG_NAME \ --nsg-name ${ONMS_VM_NAME}NSG -n open-port-3000


Work in progress

Some circumstances could introduce unexpected behavior to the solution. Besides the traditional monitoring to ensure that all the components are behaving as expected in CPU, Memory, Java Heap Memory, Java GC, and IO (covered as part of this tutorial), you sometimes need to dig deeper to understand what's happening.

OpenNMS added OpenTracing support via Jaeger to understand how much time messages sent via the broker are taking to be produced and consumed.

The official documentation has a guide about how to configure it.

As we have Docker running in the OpenNMS server, we can start an All-In-One Jaeger Instance through it very easily. To do that, SSH into the OpenNMS server and run the following:

docker run -d --name jaeger \ -p 6831:6831/udp \ -p 6832:6832/udp \ -p 16686:16686 \ jaegertracing/all-in-one:1.24

OpenNMS would have direct access as it runs on the same machine accessible via localhost and should be configured as instructed in the official docs.

For the Minions, you would need to open the UDP ports 6831 and 6832 in the NSG associated with the OpenNMS server, as well as TCP 16686 to access the Jaeger WebUI:

az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \ --port 6831-6832,16686 --priority 400 --output table

Then, configure the minion as instructed in the official docs, using the OpenNMS FQDN and the port mentioned above.

Clean Up

When we're done, make sure to delete the cloud resources.

If you created the resource group for this exercise, you could remove all the resources with the following command:

az group delete -g $RG_NAME

If you're using an existing resource group that you cannot remove, make sure only to remove all the resources created in this tutorial. All of them should be easily identified as they will contain the username and the VM name as part of the resource name. The easiest way is to use the Azure Portal for this operation. Alternatively,

IDS=($(az resource list \ --resource-group $RG_NAME \ --query "[?contains(name,'$PREFIX-') && type!='Microsoft.Compute/disks']".id \ --output tsv | tr '\n' ' ')) for id in "${IDS[@]}"; do echo "Removing $id" az resource delete --ids "$id" --verbose done DISKS=($(az resource list \ --resource-group $RG_NAME \ --query "[?contains(name,'$PREFIX-') && type=='Microsoft.Compute/disks']".id \ --output tsv | tr '\n' ' ')) for id in "${DISKS[@]}"; do echo "Removing $id" az resource delete --ids "$id" --verbose done

The reason to have two sets of deletion groups is that, by default, the list contains disks initially, which cannot be removed before the VMs. For this reason, we exclude the disks on the first set, and then we remove the disks.

Note that because all the resource names are prefixed with the chosen username, we can use it to identify them and remove them uniquely.

Then clean the local resources:

multipass delete $MINION_ID1 $MINION_ID2 multipass purge

Remember to remove the nginx instance if you decided to use it.