This lab starts an OpenNMS instance and a 3 node ZK/Kafka cluster in the cloud and two Minions on your machine, using Kafka for communication through Multipass and Azure, for learning purposes.
The lab doesn't cover security by default (user authentication and encryption), which is crucial if we ever want to expose the Kafka cluster to the Internet. A separate section covers the required changes for this.
Keep in mind that nothing prevents us from skipping using the cloud provider and do everything with Multipass
(or VirtualBox
, or Hyper-V
, or VMWare
). The reason for using a cloud provider is to prove that OpenNMS can monitor unreachable devices via Minion. Similarly, we could use any other cloud provider instead of Azure. However I won't explain how to port the solution here.
Time synchronization across all the instances involved in this solution is mandatory. Failing on this could lead to undesired side effects. This is essentially guaranteed when using a cloud provider, which is why I do not include explicit instructions for it, but please be aware of it.
The scripts used through this tutorial use envsubst, make sure to have it installed.
Make sure to log into Azure using az login
prior creating the VM.
If you have a restricted account in Azure, make sure you have the Network Contributor
role and the Virtual Machine Contributor
role associated with your Azure AD account for the resource group where you want to create the VM. Of course, either Owner
or Contributor
at the resource group level are welcome.
All the following assume you have a macOS or Linux machine or VM from which you can issue all the commands.
We haven't tested 3.0.0
, so please use 2.8.x
or older for now.
Feel free to change the content and keep in mind that $PREFIX
is what we will use throughout this tutorial to identify all the resources we will create in Azure uniquely.
Do not confuse the Azure Location or Region with the Minion Location; they are both unrelated things.
We're going to leverage the Azure DNS services to avoid the need to remember and using Public IP addresses, which helps if we're interested in having HTTPS with valid certificates as explained here not only for OpenNMS, but also to enable SSL/TLS in Kafka.
In Azure, the default public DNS follow the same pattern:
To make the VMs FQDN unique, we're going to add the username to the VM name. For instance, the OpenNMS FQDN would be:
The above is what we can use to access the VM via SSH and to configure Minions.
This is a necessary step, as every resource in Azure must belong to a resource group and a location.
However, you can omit the following command and use an existing one if you prefer. In that case, make sure to adjust the environment variable RG_NAME
so the subsequent commands will target the correct group.
I prefer to create the VNET myself instead of letting Azure do it for me, especially when we want to guarantee that all the VMs will exist in the same one.
The following cloud-init template assumes a 3 node cluster, where each VM would have Zookeeper and Kafka configured and running in Ubuntu LTS.
For simplicity, Zookeeper and Kafka will be running on each machine. In production, each cluster should have its own instances, as Zookeeper should not grow the same way as Kafka would grow, for multiple reasons such as a ZK cluster should always have an odd number of members (which is not the case of Kafka); traffic across ZK members grows exponentially with the number of instances (a ZK cluster of 5 members can manage multiple dozens of Kafka members, with 7 it can manage hundreds, and with 9 it can manage thousands).
For the 3-node cluster, each VM will be named like follows:
Note the hostnames include the chosen username to make them unique, which is mandatory for shared resource groups and the default Azure DNS public domain on the chosen region.
Remember that each VM in Azure is reachable within the same VNet from any other VM through its hostname.
From all the environment variables you'll encounter in the upcoming template, there are two crucial ones:
For server.properties
, we must replace the environment variable PUBLIC_FQDN
in the advertised.listeners
with the public FQDN or IP of the VM when configuring the application before running it for the first time. With that in mind, there will be two listeners, one to be used within the VNet (which is what OpenNMS would use, on port 9092), and another associated with the Public FQDN (on port 9094), to be used by external Minions (outside Azure).
Similarly, we must replace INSTANCE_ID
with a unique numeric value per instance for the broker.id
in server.properties
for Kafka and the myid
file for Zookeeper, which are the mandatory requirements to identify each instance in their respective cluster.
The number of topic partitions must be greater than the number of Minions on a given location and greater than the number of brokers in the cluster.
Create a YAML file called /tmp/kafka-template.yaml
with the following content:
The reason for increasing the message size (message.max.bytes
, replica.fetch.max.bytes
) is to avoid problems when forwarding collected metrics to Kafka via the Kafka Producer feature of OpenNMS, which I'm planning to enable.
If you for instance wants to use an older version of Kafka, you can tune the JDK package and the Kafka URL, so the template can apply the correct one, for instance:
Also, edit the template and remove ;2181
from the server
entries from zookeeper.properties
as expressing the client port that way expects Zookeeper 3.5 or newer.
Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at ~/.ssh/id_rsa.pub
, or update the az vm create
command.
The above will start all the VMs simultaneously using public IP addresses and FQDNs, to avoid access problems with external Minions and reconfiguration issues with the Kafka advertised listeners. However, like the public IPs, the private IPs will be dynamic. Fortunately, this is not going to be a problem as we're going to use DNS to access Kafka.
Keep in mind that the cloud-init
process starts once the VM is running, meaning we should wait a few minutes after the VMs are ready to use.
Then, allow access for remote Minions:
You can inspect the generated YAML files to see the final content used on each VM (after applying the env-var substitutions).
In case there is a problem, SSH into the VM using the public IP and the provided credentials and check /var/log/cloud-init-output.log
to verify the progress and the status of the cloud-init execution.
To make sure the Zookeeper cluster started, we can use the "4 letter words" commands via the embedded web server, available when using version 3.5 or newer for instance:
The above gives us general information, including the server_state
, which can be leader
or follower
.
To get statistics:
For Zookeeper version 3.4 or older (for instance, when using older versions of Kafka), you can still use the deprecated way to verify:
From Kafka's perspective, we can verify how each broker has registered via Zookeeper or follow this guide to create a topic and use the console producer and consumer to validate its functionality.
List Broker IDs:
We should get:
If that's not the case, SSH the broker that is not listed and make sure Kafka is running. It is possible that Kafka is not properly registered to Zookeeper, and it fails to start due to how the VMs are initialized. That's because Zookeeper should start first (the whole cluster), then Kafka, but as we're not guaranteeing that, some instances might fail to start on their own. The procedure was designed to avoid this as much as possible this situation.
Get the broker basic configuration:
If we run it from the first instance, we should get:
Note the two listeners. Clients within Azure, like OpenNMS, would use the INSIDE
one on port 9092, pointing to the local FQDN or hostname of the VM (and remember they are resolvable via DNS within the same VNet). In contrast, clients outside Azure, like Minions, would use the OUTSIDE
one on port 9094 pointing to the Public FQDN of each Kafka instance (accessible thanks to the NSG associated with each VM).
Kafka defaults to the hostname
or FQDN
of the primary interface when we don't explicitly specify it on the listener.
As Azure DNS works by default, hostnames are resolvable by all VMs within the same VNET. For this reason, Kafka will use the correct one.
However, if you're using another cloud provider or using bare-metal, make sure to either have DNS working across all the VMs. Otherwise, change the INSIDE
listener to explicitly point to the private IP address of the VM and the OUTSIDE
listener to point to the public IP address of the VM; and make sure to use static IPs if you're going to rely on them.
Another way to verify the behavior is using the console producer and console consumer to verify that we can send and receive messages through a given topic.
To do that, for recent versions of Kafka, let's create a Test
topic:
Then, start a console producer from one of the brokers:
From another broker (separate SSH session), start a console consumer:
Go back to the terminal on which the console producer is running, type a message, and hit enter. Then, switch to the console consumer terminal, and we should see the message sent. Use Ctrl+C
to stop the producer and consumer.
A more comprehensive test would be to download Kafka locally on your machine and run either the producer or the consumer there (use port 9094 and the public FQDN or IP of one of the brokers). That serves to test connectivity from the Internet.
To create the Test
topic:
As you can see, the difference is talking against Zookeeper directly (using --zookeeper
), instead of reaching Kafka (using --bootstrap-server
).
For the producer use --broker-list
instead of --bootstrap-server
, for instance:
For the client, it is the same as newer versions:
The retention settings are the default (for instance, log.retention.hours
and log.retention.bytes
at the broker level; or retention.ms
and retention.bytes
at the topic level), but it is recommended to reduce them for the RPC topics, as due to the TTL, it doesn't worth keeping them for longer times. That's why 1 hour is more than enough.
Having said that, data pruning happens on closed segments only, meaning Kafka won't delete old records from the active segment (the one currently being updated with new records). That means you should also change the segment.bytes
or segment.ms
at the topic level to allow deletion. These can be equal to or less than the expected retention. Of course, it is crucial to have the single-topic
feature enabled for RPC in both Minion and OpenNMS.
However, we must fix that after the topics are created by either OpenNMS or the Minions, using the Kafka CLI tools or specialized applications like topicctl or CMAK.
For instance, on newer versions of Kafka:
For older versions:
Note that topic level settings and broker level settings are slightly different. The topic level settings override the broker level settings when they exist.
Be careful when setting the number of partitions per topic if you're planning to have a massive number of Minion locations or share the cluster across multiple OpenNMS instances with a high number of locations. This is why having the single-topic
enabled in OpenNMS and Minion is the best approach (the default in H28).
Each lead partition (and each replica the broker maintains) will have a directory in the data directory, and Kafka will maintain a file descriptor per segment. Each segment contains two files, the index and the data itself. For more information, check this blog post.
It is recommended to have a dedicated file system for the data directory formatted using XFS with noatime
and nodiratime
in production.
Create a cloud-init script with the following content to deploy PostgreSQL, the latest OpenNMS Horizon, and CMAK in Ubuntu LTS and store it at /tmp/opennms-template.yaml
:
We don't need to specify Kafka Brokers' whole list as part of the bootstrap.servers
entry. The whole topology will be discovered through the first one that responds, and the client will use what's configured as the advertised listener to talk to each broker. I added two in case the first one is unavailable (as a backup).
If you're using an older version of Kafka, make sure to set the appropriate version when adding your cluster to CMAK.
The above installs the latest OpenJDK 11, the latest PostgreSQL, and the latest OpenNMS Horizon to the VM. It also install Kafka Manager or CMAK via Docker. I added the most basic configuration for PostgreSQL to work with authentication. Kafka will be enabled for Sink/RPC as well as the Kafka Producer. As mentioned, Azure VMs can reach each other through hostnames.
Create an Ubuntu VM for OpenNMS:
Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at ~/.ssh/id_rsa.pub
, or update the az vm create
command.
Keep in mind that the cloud-init
process starts once the VM is running, meaning we should wait about 5 minutes after the az vm create
is finished to see OpenNMS up and running.
In case there is a problem, SSH into the VM using the public IP and the provided credentials and check /var/log/cloud-init-output.log
to verify the progress and the status of the cloud-init execution.
Wait until OpenNMS is up and running and then execute the following, to start monitoring all the ZK/Kafka servers, and the OpenNMS server via SNMP and JMX.
multipass
After verifying that OpenNMS is up and running, we can proceed to create the Minions.
Create a cloud-init script to deploy Minion in Ubuntu and save it at /tmp/minion-template.yaml
:
Note that I'm using the same content for bootstrap.servers
as OpenNMS, making sure to use the Public FQDNs, as Minions won't be running in Azure.
Then, start the new Minion via multipass
:
Optionally, create a second Minion in the same location:
In case there is a problem, access the VM (e.x., multipass shell minion01
) and check /var/log/cloud-init-output.log
to verify the progress and the status of the cloud-init execution.
Feel free to change the CPU and memory settings for your Minion, but make sure it is consistent with MINION_HEAP_SIZE
. Make sure to validate communication using the health-check
command from the Karaf Shell.
When having multiple Minions per location, they will become part of a consumer group from Kafka's perspective for the RPC requests topic. The group ID will be the name of the location.
As you can see, the location name is Durham
(a.k.a. $MINION_LOCATION
), and you should see the Minions on that location registered in OpenNMS.
SSH into the OpenNMS server and create a requisition with a node in the same network as the Minion VMs, and make sure to associate it with the appropriate location. For instance,
Ensure to replace 192.168.0.40
with the IP of a working server in your network (reachable from the Minion VM, and preferable unreachable or nonexistent in Azure), and do not forget to use the same location as defined in $MINION_LOCATION
.
Please keep in mind that Minions are VMs on your machine. 192.168.0.40
is the IP of one of my machines which is why Minions can reach it (and vice versa). To access an external machine on your network, make sure to define static routes on that machine so it can reach the Minions through your machine (assuming you're running Linux or macOS).
OpenNMS which runs in Azure, and have no access to 192.168.0.40
directly, should be able to collect data and monitor that node through any of the Minions. In fact, you can stop one of them, and OpenNMS would continue monitoring it.
To test asynchronous messages, you can send SNMP traps or Syslog messages to one of the Minions. Alternatively, you could use udpgen for this purpose. Usually, you could put a Load Balancer in front of the Minions and use its IP when sending messages from the monitored devices.
The machine that will be running udpgen
must be part of the OpenNMS inventory. Then, find the IP of the Minion using multipass list
, then execute the following from the machine added as a node above (the examples assumes the IP of the Minion is 192.168.75.16
):
To send SNMP Traps:
To send Syslog Messages:
The C++ version of udpgen
only works on Linux. If you're on macOS, you can use the Go version of it. Unfortunately, Windows is not an option due to a lack of support for Syslog in Go.
Note that an event definition is required when using udpgen
to send traps. Here is what you'd need for Eventd
:
If you want to make the tests more interesting, add the following to the above definition:
The Hawtio UI in OpenNMS can help visualize the relevant JMX metrics and understand what’s circulating between OpenNMS and the Minions.
For OpenNMS, Hawtio is available through :8980/hawtio
if the package opennms-webapp-hawtio
was installed (which is the case with the cloud-init
template used).
For Minions, Hawtio is available through :8181/hawtio
.
As mentioned, if time is not synchronized across all the instances, the Heartbeat sent by Minions via the Sink API won't be processed properly by OpenNMS, leading to having the Minion not registered or outages in the Minion-Heartbeat
service.
We can inspect the traffic on the topics to see if the Minion is sending (or receiving) traffic to Kafka. However, as the payload is encoded within a Protobuf message, using the console consumer might not be as useful as we'd expect. Still, it works for troubleshooting purposes. For instance, from one of the Kafka brokers, we can do:
And we'll see:
As we can see, the actual payload within the Protobuf message is an indented XML.
The following application can be used to properly inspect the content without worrying about the non-readable content due to the Protobuf format:
https://github.com/agalue/onms-kafka-ipc-receiver
For RPC in particular, we can access the Karaf Shell from the OpenNMS instance and use the opennms:stress-rpc
command to verify communication against the Minions on a given location or against a specific Minion, and as the command name implies, to perform stress tests.
For recent versions of Kafka, the following can help to get details about topics, lags, consumer groups and so on.
To verify the topic partitions and replica settings:
To verify the current topic-level settings:
To verify offsets, topics lag and consumer groups:
When enabling security (either SASL or TLS), you need to pass those settings to the commands.
For instance, let's say you have SASL enabled, you should pass:
Where the content of consumer.properties
would be:
For older versions of Kafka, the equivalent commands are the following:
To verify the topic partitions and replica settings:
To verify the current topic-level settings:
To verify offsets, topics lag and consumer groups:
When passing the ZK host to --zookeeper
, that has to be consistent with how zookeeper.connect
was defined on each Kafka broker. If you used something like this zk1:2181,zk2:2181/kafka
, you should then pass --zookeeper $(hostname):2181/kafka
instead.
In big environments, it is common to have multiple OpenNMS instances, each of them with its own fleet of Minions to monitor one of the multiple data centers or a section of it. In those scenarios, it is common to have a centralized Kafka cluster that can be shared across all of them (for more information, follow this link).
The above solution has to be modified to ensure each set of OpenNMS and Minions will use their own set of topics in Kafka to avoid collisions.
The topics' prefix (which defaults to OpenNMS
) can be controlled via a system-wide property called Instance ID (a.k.a. org.opennms.instance.id
). We must configure this property in both places. For the OpenNMS, add it to a property file inside $OPENNMS_HOME/etc/opennms.properties.d
; and for a Minion, add it to a property file inside $MINION_HOME/etc/custom.system.properties
.
In production, when having multiple Minions per location, it is a good practice to put a Load Balancer in front of them so that the devices can use a single destination for SNMP Traps, Syslog, and Flows.
The following creates a cloud-init template for Ubuntu to start a basic LB using nginx
through multipass
for SNMP Traps (with a listener on port 162) and Syslog Messages (with a listener on port 514). Save the template at /tmp/nginx-template.yaml
:
Note the usage of environment variables within the YAML template. We will substitute them before creating the VM.
Then, update the template and create the LB:
Flows are outside the scope of this test as that requires more configuration on Minions and OpenNMS besides having an Elasticsearch cluster up and running with the required plugin in place.
The above procedure uses Kafka and Zookeeper in plain text without authentication or encryption. That works for testing purposes or perhaps for private clusters, where access to the servers is restricted and audited.
This example, in particular, exposes Kafka to the Internet, which requires having at least authentication in place. The following explains how to enable authentication and then the steps to enable encryption.
For a more comprehensive guide, follow this tutorial from Confluent.
This section explains how to enable authentication using SASL with SCRAM-SHA-512 for Kafka and DIGEST
for Zookeeper (as Zookeeper doesn't support SCRAM
). Because this guide's intention is learning, I decided to add security as a separate or optional module. That's due to the extra complexity associated with this advanced topic.
Here are the high-level changes:
server.properties
and the systemd
service definition on each Kafka broker to enable and use SASL.zookeeper.properties
and the systemd
service definition on each ZK instance to enable and use SASL.Access one of the brokers and execute the following command:
On each Zookeeper instance, update zookeeper.properties
to enable SASL:
On each Kafka broker instance, update server.properties
to enable SASL/SCRAM:
Note that listener.security.protocol.map
already exists in that file, which is why I removed it prior adding the required changes.
In theory, there is no need to enable both SCRAM-SHA-256
and SCRAM-SHA-512
. I did that for compatibility purposes, but I'll use SCRAM-SHA-512
for all subsequent configurations.
On each Zookeeper instance, create the JAAS
configuration file with the credentials:
On each Kafka broker, create the JAAS
configuration file with the credentials:
On each Zookeeper instance, update the systemd
service definition to load the JAAS settings via KAFKA_OPTS
:
On each Kafka broker, update the systemd
service definition to load the JAAS settings via KAFKA_OPTS
:
Restart the cluster in the following order:
At this point, you should pass the SASL credentials to all Kafka CLI Tools. For instance,
Note how we pass the consumer settings. The above command should list all the topics in the cluster. If you can see the list, then SASL is working. Keep in mind that not passing --command-config
, the command should timeout, as the tool cannot communicate to Kafka without the credentials.
On the OpenNMS instance, update /opt/opennms/etc/opennms.properties.d/kafka.properties
and /opt/opennms/etc/org.opennms.features.kafka.producer.cfg
to use SASL, and restart OpenNMS. For instance:
On each Minion, update /etc/minion/org.opennms.core.ipc.sink.kafka.cfg
and /etc/minion/org.opennms.core.rpc.sink.kafka.cfg
to use SASL, and restart Minion. For instance:
The solution works from OpenNMS and Minion perspective, despite the fact of seeing the following message repeatedly in /opt/kafka/logs/server.log
on all brokers:
Where 13.0.1.7
is the IP of the OpenNMS server.
At this point, we have SASL
authentication enabled using SCRAM-512
for Kafka and DIGEST
for Zookeeper, meaning credentials might be hard to crack when intercepting traffic (but perhaps not impossible). However, to make it more secure, encryption is recommended.
If you already configured CMAK
, make sure to enable the SASL/SCRAM mechanism for your cluster.
Please keep in mind that enabling SSL/TLS will increase CPU demand on each broker and the clients, which is why using OpenJDK 11 over JDK 8 is encouraged.
To enable TLS, and because each Kafka Broker must be exposed and reachable through a public DNS entry, I'm going to use LetsEncrypt to generate the certificates. That will save a few steps because the certificates will be publicly valid, so we won't need to set up a Trust Store.
A Trust Store is mandatory when using private CAs or self-signed certificates to configure every entity that touches Kafka directly or indirectly.
The Certbot utility used to create and validate the certificate will start a temporary web server on the instance (for the validation process). For this reason, we should temporary allow access through port TCP 80:
Then, on each Kafka Broker (one by one), we must do the following to enable TLS:
Please use your own email, and keep in mind that the Azure location is hardcoded in the command; if you're using a different one, update the FQDN.
Note that SSL was only enabled for the OUTSIDE
listener, meaning we should only modify the Minions (and listener.security.protocol.map
was changed because of that), as OpenNMS won't use it because it lives in the same protected network as the Kafka cluster.
To verify, you can retrieve the broker configuration via Zookeeper:
If everything went well, you should get something like this:
Note that SASL_SSL
applies to OUTSIDE
. Now it is time to update the Minions.
On each Minion, do the following:
While you're there, you can check if TLS is actually enabled by running:
There is no need to modify anything else as we're using valid certificates signed by a well-known public entity. When using private certificates or private CAs, you would have to create Trust Store via keytool
for the clients and the brokers.
As an challenge to the reader, update the /tmp/kafka-template.yaml
, /tmp/opennms-template.yaml
, and /tmp/minion-template.yaml
to include all the SASL and SSL/TLS configuration and start the whole environment from scratch with authentication and encryption enabled.
The following is inspired by this guide to enable TLS with Nginx for the OpenNMS WebUI and Grafana. However, as we're using Ubuntu here, I'll describe the required changes.
Allow access via TCP 80 and 443:
SSH the OpenNMS server and then:
Make sure to use a valid content for $EMAIL
, as that's required by LetsEncrypt (as we did for Kafka).
Note that cmak
(or Kafka Manager) is not present due to the complexity of having it working behind a proxy.
You can remove the NSG rules for ports 8980 and 3000.
Work in progress…
Some circumstances could introduce unexpected behavior to the solution. Besides the traditional monitoring to ensure that all the components are behaving as expected in CPU, Memory, Java Heap Memory, Java GC, and IO (covered as part of this tutorial), you sometimes need to dig deeper to understand what's happening.
OpenNMS added OpenTracing support via Jaeger to understand how much time messages sent via the broker are taking to be produced and consumed.
The official documentation has a guide about how to configure it.
As we have Docker running in the OpenNMS server, we can start an All-In-One Jaeger Instance through it very easily. To do that, SSH into the OpenNMS server and run the following:
OpenNMS would have direct access as it runs on the same machine accessible via localhost and should be configured as instructed in the official docs.
For the Minions, you would need to open the UDP ports 6831 and 6832 in the NSG associated with the OpenNMS server, as well as TCP 16686 to access the Jaeger WebUI:
Then, configure the minion as instructed in the official docs, using the OpenNMS FQDN and the port mentioned above.
When we're done, make sure to delete the cloud resources.
If you created the resource group for this exercise, you could remove all the resources with the following command:
If you're using an existing resource group that you cannot remove, make sure only to remove all the resources created in this tutorial. All of them should be easily identified as they will contain the username and the VM name as part of the resource name. The easiest way is to use the Azure Portal for this operation. Alternatively,
The reason to have two sets of deletion groups is that, by default, the list contains disks initially, which cannot be removed before the VMs. For this reason, we exclude the disks on the first set, and then we remove the disks.
Note that because all the resource names are prefixed with the chosen username, we can use it to identify them and remove them uniquely.
Then clean the local resources:
Remember to remove the nginx
instance if you decided to use it.