This lab starts an OpenNMS instance and a 3 node ZK/Kafka cluster in the cloud and two Minions on your machine, using Kafka for communication through Multipass and Azure, for learning purposes.
The lab doesn't cover security by default (user authentication and encryption), which is crucial if we ever want to expose the Kafka cluster to the Internet. A separate section covers the required changes for this.
Keep in mind that nothing prevents us from skipping using the cloud provider and do everything with Multipass
(or VirtualBox
, or Hyper-V
, or VMWare
). The reason for using a cloud provider is to prove that OpenNMS can monitor unreachable devices via Minion. Similarly, we could use any other cloud provider instead of Azure. However I won't explain how to port the solution here.
Time synchronization across all the instances involved in this solution is mandatory. Failing on this could lead to undesired side effects. This is essentially guaranteed when using a cloud provider, which is why I do not include explicit instructions for it, but please be aware of it.
The scripts used through this tutorial use envsubst, make sure to have it installed.
Make sure to log into Azure using az login
prior creating the VM.
If you have a restricted account in Azure, make sure you have the Network Contributor
role and the Virtual Machine Contributor
role associated with your Azure AD account for the resource group where you want to create the VM. Of course, either Owner
or Contributor
at the resource group level are welcome.
All the following assume you have a macOS or Linux machine or VM from which you can issue all the commands.
export PREFIX="$USER" # String to prepend to the name of all Azure resources
export RG_NAME="OpenNMS" # Change it to use a shared one
export LOCATION="eastus" # Azure Region
export DOMAIN="$LOCATION.cloudapp.azure.com" # Public Azure DNS Domain
export TIMEZONE="America/New_York"
export VNET_CIDR="13.0.0.0/16"
export VNET_SUBNET="13.0.1.0/24"
export VNET_NAME="$PREFIX-vnet"
export VNET_SUBNET_NAME="subnet1"
export KAFKA_VM_SIZE="Standard_D2s_v3" # 2 VCPU, 8 GB of RAM
export ZK_HEAP_SIZE="1G" # Must fit KAFKA_VM_SIZE
export KAFKA_URL="https://downloads.apache.org/kafka/2.8.1/kafka_2.13-2.8.1.tgz"
export KAFKA_JAVA_VERSION="11" # 8 for < 2.1.0; 11 for > 2.1.0
export KAFKA_HEAP_SIZE="2G" # Must fit KAFKA_VM_SIZE
export KAFKA_PARTITIONS="9" # > Number of Minions per location
export KAFKA_CLUSTER_SIZE="3" # Total instances of Kafka+ZK
export KAFKA_RF="2" # < KAFKA_CLUSTER_SIZE
export ONMS_VM_NAME="$PREFIX-onms01"
export ONMS_VM_SIZE="Standard_D2s_v3" # 2 VCPU, 8 GB of RAM
export ONMS_HEAP_SIZE="4096" # Expressed in MB and must fit ONMS_VM_SIZE
export MINION_LOCATION="Durham"
export MINION_HEAP_SIZE="1G" # Must fit VM RAM
We haven't tested 3.0.0
, so please use 2.8.x
or older for now.
Feel free to change the content and keep in mind that $PREFIX
is what we will use throughout this tutorial to identify all the resources we will create in Azure uniquely.
Do not confuse the Azure Location or Region with the Minion Location; they are both unrelated things.
We're going to leverage the Azure DNS services to avoid the need to remember and using Public IP addresses, which helps if we're interested in having HTTPS with valid certificates as explained here not only for OpenNMS, but also to enable SSL/TLS in Kafka.
In Azure, the default public DNS follow the same pattern:
<vm-name>.<location>.cloudapp.azure.com
To make the VMs FQDN unique, we're going to add the username to the VM name. For instance, the OpenNMS FQDN would be:
agalue-onms01.eastus.cloudapp.azure.com
The above is what we can use to access the VM via SSH and to configure Minions.
This is a necessary step, as every resource in Azure must belong to a resource group and a location.
However, you can omit the following command and use an existing one if you prefer. In that case, make sure to adjust the environment variable RG_NAME
so the subsequent commands will target the correct group.
az group create -n $RG_NAME -l $LOCATION --tags Owner=$USER
I prefer to create the VNET myself instead of letting Azure do it for me, especially when we want to guarantee that all the VMs will exist in the same one.
az network vnet create -g $RG_NAME \
--name $VNET_NAME \
--address-prefix $VNET_CIDR \
--subnet-name $VNET_SUBNET_NAME \
--subnet-prefix $VNET_SUBNET \
--tags Owner=$USER \
--output table
The following cloud-init template assumes a 3 node cluster, where each VM would have Zookeeper and Kafka configured and running in Ubuntu LTS.
For simplicity, Zookeeper and Kafka will be running on each machine. In production, each cluster should have its own instances, as Zookeeper should not grow the same way as Kafka would grow, for multiple reasons such as a ZK cluster should always have an odd number of members (which is not the case of Kafka); traffic across ZK members grows exponentially with the number of instances (a ZK cluster of 5 members can manage multiple dozens of Kafka members, with 7 it can manage hundreds, and with 9 it can manage thousands).
For the 3-node cluster, each VM will be named like follows:
Note the hostnames include the chosen username to make them unique, which is mandatory for shared resource groups and the default Azure DNS public domain on the chosen region.
Remember that each VM in Azure is reachable within the same VNet from any other VM through its hostname.
From all the environment variables you'll encounter in the upcoming template, there are two crucial ones:
For server.properties
, we must replace the environment variable PUBLIC_FQDN
in the advertised.listeners
with the public FQDN or IP of the VM when configuring the application before running it for the first time. With that in mind, there will be two listeners, one to be used within the VNet (which is what OpenNMS would use, on port 9092), and another associated with the Public FQDN (on port 9094), to be used by external Minions (outside Azure).
Similarly, we must replace INSTANCE_ID
with a unique numeric value per instance for the broker.id
in server.properties
for Kafka and the myid
file for Zookeeper, which are the mandatory requirements to identify each instance in their respective cluster.
The number of topic partitions must be greater than the number of Minions on a given location and greater than the number of brokers in the cluster.
Create a YAML file called /tmp/kafka-template.yaml
with the following content:
#cloud-config
package_upgrade: true
timezone: $TIMEZONE
users:
- default
- name: kafka
write_files:
- owner: root:root
path: /etc/security/limits.d/kafka.conf
content: |
* soft nofile 100000
* hard nofile 100000
- owner: root:root
path: /etc/sysctl.d/99-kafka.conf
content: |
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_keepalive_intvl=10
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=40960
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.ipv4.tcp_window_scaling=1
net.core.netdev_max_backlog=2500
net.core.somaxconn=65000
vm.swappiness=1
vm.zone_reclaim_mode=0
vm.max_map_count=1048575
- owner: root:root
permissions: '0400'
path: /etc/snmp/snmpd.conf
content: |
rocommunity public default
syslocation Azure - $LOCATION
syscontact $USER
dontLogTCPWrappersConnects yes
disk /
- owner: root:root
path: /etc/systemd/system/zookeeper.service
content: |
[Unit]
Description=Apache Zookeeper server
Documentation=http://zookeeper.apache.org
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=kafka
Group=kafka
Environment="KAFKA_HEAP_OPTS=-Xmx$ZK_HEAP_SIZE -Xms$ZK_HEAP_SIZE"
ExecStart=/opt/kafka/bin/zookeeper-server-start.sh /opt/kafka/config/zookeeper.properties
ExecStop=/opt/kafka/bin/zookeeper-server-stop.sh
[Install]
WantedBy=multi-user.target
- owner: root:root
path: /etc/systemd/system/kafka.service
content: |
[Unit]
Description=Apache Kafka Server
Documentation=http://kafka.apache.org
Wants=zookeeper.service
After=zookeeper.service network-online.target
[Service]
Type=simple
User=kafka
Group=kafka
LimitNOFILE=100000
Environment="KAFKA_HEAP_OPTS=-Xmx$KAFKA_HEAP_SIZE -Xms$KAFKA_HEAP_SIZE"
Environment="KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.rmi.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=%H -Djava.net.preferIPv4Stack=true"
Environment="JMX_PORT=9999"
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
[Install]
WantedBy=multi-user.target
- owner: root:root
path: /tmp/zookeeper.properties # Designed for a 3-node ZK cluster
content: |
dataDir=/data/zookeeper
tickTime=2000
clientPort=2181
initLimit=10
syncLimit=5
# Cluster Members
server.1=$PREFIX-kafka-1:2888:3888;2181
server.2=$PREFIX-kafka-2:2888:3888;2181
server.3=$PREFIX-kafka-3:2888:3888;2181
- owner: root:root
path: /tmp/server.properties # Designed for a 3-node ZK cluster
content: |
broker.id=$INSTANCE_ID
log.dirs=/data/kafka
zookeeper.connect=$PREFIX-kafka-1:2181,$PREFIX-kafka-2:2181,$PREFIX-kafka-3:2181
zookeeper.connection.timeout.ms=30000
# Connection
advertised.listeners=INSIDE://:9092,OUTSIDE://$PUBLIC_FQDN:9094
listeners=INSIDE://:9092,OUTSIDE://:9094
listener.security.protocol.map=INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
inter.broker.listener.name=INSIDE
# Replication
offsets.topic.replication.factor=$KAFKA_RF
default.replication.factor=$KAFKA_RF
min.insync.replicas=1
# Must be greater than number of Minions per Location
num.partitions=$KAFKA_PARTITIONS
# Recommended for the OpenNMS Kafka Producer
message.max.bytes=5000000
replica.fetch.max.bytes=5000000
compression.type=producer
# Cleanup (remove segments older than a week)
log.retention.hours=168
log.retention.bytes=-1
# Required for OpenNMS and Minions
auto.create.topics.enable=true
# Recommended to avoid disrupting messages workflow
delete.topic.enable=false
packages:
- snmp
- snmpd
- jq
- openjdk-$KAFKA_JAVA_VERSION-jre-headless
runcmd:
- sysctl --system
- wget -O /tmp/kafka.tar.gz $KAFKA_URL
- cd /opt
- mkdir kafka
- tar -xvzf /tmp/kafka.tar.gz -C kafka --strip-components 1
- mv -f /tmp/*.properties /opt/kafka/config/
- mkdir -p /data/zookeeper /data/kafka
- chown -R kafka:kafka /data /opt/kafka*
- echo $INSTANCE_ID > /data/zookeeper/myid
- systemctl daemon-reload
- systemctl --now enable zookeeper
- systemctl --now enable kafka
- systemctl --now enable snmpd
The reason for increasing the message size (message.max.bytes
, replica.fetch.max.bytes
) is to avoid problems when forwarding collected metrics to Kafka via the Kafka Producer feature of OpenNMS, which I'm planning to enable.
If you for instance wants to use an older version of Kafka, you can tune the JDK package and the Kafka URL, so the template can apply the correct one, for instance:
export KAFKA_URL="https://archive.apache.org/dist/kafka/1.1.0/kafka_2.11-1.1.0.tgz"
export KAFKA_JAVA_VERSION="8"
Also, edit the template and remove ;2181
from the server
entries from zookeeper.properties
as expressing the client port that way expects Zookeeper 3.5 or newer.
for i in {1..$KAFKA_CLUSTER_SIZE}; do
VM_NAME="$PREFIX-kafka-$i"
echo "Creating VM $VM_NAME..."
export INSTANCE_ID="$i"
export PUBLIC_FQDN="$VM_NAME.$DOMAIN"
envsubst < /tmp/kafka-template.yaml > $VM_NAME.yaml
az vm create --resource-group $RG_NAME --name $VM_NAME \
--size $KAFKA_VM_SIZE \
--image canonical:0001-com-ubuntu-server-focal:20_04-lts:latest \
--admin-username $USER \
--ssh-key-values ~/.ssh/id_rsa.pub \
--vnet-name $VNET_NAME \
--subnet $VNET_SUBNET_NAME \
--public-ip-sku Standard \
--public-ip-address-dns-name $VM_NAME \
--custom-data $VM_NAME.yaml \
--tags Owner=$USER \
--no-wait
done
Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at ~/.ssh/id_rsa.pub
, or update the az vm create
command.
The above will start all the VMs simultaneously using public IP addresses and FQDNs, to avoid access problems with external Minions and reconfiguration issues with the Kafka advertised listeners. However, like the public IPs, the private IPs will be dynamic. Fortunately, this is not going to be a problem as we're going to use DNS to access Kafka.
Keep in mind that the cloud-init
process starts once the VM is running, meaning we should wait a few minutes after the VMs are ready to use.
Then, allow access for remote Minions:
for i in {1..$KAFKA_CLUSTER_SIZE}; do
VM_NAME="$PREFIX-kafka-$i"
az vm open-port -g $RG_NAME -n $VM_NAME \
--port 9094 --priority 100 --output table
done
You can inspect the generated YAML files to see the final content used on each VM (after applying the env-var substitutions).
In case there is a problem, SSH into the VM using the public IP and the provided credentials and check /var/log/cloud-init-output.log
to verify the progress and the status of the cloud-init execution.
To make sure the Zookeeper cluster started, we can use the "4 letter words" commands via the embedded web server, available when using version 3.5 or newer for instance:
curl http://$(hostname):8080/commands/monitor
The above gives us general information, including the server_state
, which can be leader
or follower
.
To get statistics:
curl http://$(hostname):8080/commands/stats
For Zookeeper version 3.4 or older (for instance, when using older versions of Kafka), you can still use the deprecated way to verify:
echo stat | nc $(hostname) 2181; echo
From Kafka's perspective, we can verify how each broker has registered via Zookeeper or follow this guide to create a topic and use the console producer and consumer to validate its functionality.
List Broker IDs:
/opt/kafka/bin/zookeeper-shell.sh $(hostname) ls /brokers/ids
We should get:
[1, 2, 3]
If that's not the case, SSH the broker that is not listed and make sure Kafka is running. It is possible that Kafka is not properly registered to Zookeeper, and it fails to start due to how the VMs are initialized. That's because Zookeeper should start first (the whole cluster), then Kafka, but as we're not guaranteeing that, some instances might fail to start on their own. The procedure was designed to avoid this as much as possible this situation.
Get the broker basic configuration:
/opt/kafka/bin/zookeeper-shell.sh $(hostname) get /brokers/ids/1 | egrep '^\{' | jq
If we run it from the first instance, we should get:
{
"features": {},
"listener_security_protocol_map": {
"INSIDE": "PLAINTEXT",
"OUTSIDE": "PLAINTEXT"
},
"endpoints": [
"INSIDE://agalue-kafka-1:9092",
"OUTSIDE://agalue-kafka-1.eastus.cloudapp.azure.com:9094"
],
"jmx_port": 9999,
"port": 9092,
"host": "agalue-kafka-1",
"version": 5,
"timestamp": "1616265688431"
}
Note the two listeners. Clients within Azure, like OpenNMS, would use the INSIDE
one on port 9092, pointing to the local FQDN or hostname of the VM (and remember they are resolvable via DNS within the same VNet). In contrast, clients outside Azure, like Minions, would use the OUTSIDE
one on port 9094 pointing to the Public FQDN of each Kafka instance (accessible thanks to the NSG associated with each VM).
Kafka defaults to the hostname
or FQDN
of the primary interface when we don't explicitly specify it on the listener.
As Azure DNS works by default, hostnames are resolvable by all VMs within the same VNET. For this reason, Kafka will use the correct one.
However, if you're using another cloud provider or using bare-metal, make sure to either have DNS working across all the VMs. Otherwise, change the INSIDE
listener to explicitly point to the private IP address of the VM and the OUTSIDE
listener to point to the public IP address of the VM; and make sure to use static IPs if you're going to rely on them.
Another way to verify the behavior is using the console producer and console consumer to verify that we can send and receive messages through a given topic.
To do that, for recent versions of Kafka, let's create a Test
topic:
/opt/kafka/bin/kafka-topics.sh \
--bootstrap-server $(hostname):9092 \
--create --topic Test --replication-factor 2 --partitions 3
Then, start a console producer from one of the brokers:
/opt/kafka/bin/kafka-console-producer.sh \
--bootstrap-server $(hostname):9092 --topic Test
From another broker (separate SSH session), start a console consumer:
/opt/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server $(hostname):9092 --topic Test
Go back to the terminal on which the console producer is running, type a message, and hit enter. Then, switch to the console consumer terminal, and we should see the message sent. Use Ctrl+C
to stop the producer and consumer.
A more comprehensive test would be to download Kafka locally on your machine and run either the producer or the consumer there (use port 9094 and the public FQDN or IP of one of the brokers). That serves to test connectivity from the Internet.
To create the Test
topic:
/opt/kafka/bin/kafka-topics.sh \
--zookeeper $(hostname):2181 \
--create --topic Test --replication-factor 2 --partitions 3
As you can see, the difference is talking against Zookeeper directly (using --zookeeper
), instead of reaching Kafka (using --bootstrap-server
).
For the producer use --broker-list
instead of --bootstrap-server
, for instance:
/opt/kafka/bin/kafka-console-producer.sh \
--broker-list $(hostname):9092 --topic Test
For the client, it is the same as newer versions:
/opt/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server $(hostname):9092 --topic Test
The retention settings are the default (for instance, log.retention.hours
and log.retention.bytes
at the broker level; or retention.ms
and retention.bytes
at the topic level), but it is recommended to reduce them for the RPC topics, as due to the TTL, it doesn't worth keeping them for longer times. That's why 1 hour is more than enough.
Having said that, data pruning happens on closed segments only, meaning Kafka won't delete old records from the active segment (the one currently being updated with new records). That means you should also change the segment.bytes
or segment.ms
at the topic level to allow deletion. These can be equal to or less than the expected retention. Of course, it is crucial to have the single-topic
feature enabled for RPC in both Minion and OpenNMS.
However, we must fix that after the topics are created by either OpenNMS or the Minions, using the Kafka CLI tools or specialized applications like topicctl or CMAK.
For instance, on newer versions of Kafka:
/opt/kafka/bin/kafka-configs.sh --alter \
--bootstrap-server $(hostname):9092 \
--entity-type topics \
--entity-name OpenNMS.rpc-response \
--add-config segment.ms=3600000 \
--add-config retention.ms=3600000 \
For older versions:
/opt/kafka/bin/kafka-configs.sh --alter \
--zookeeper $(hostname):2181 \
--entity-type topics \
--entity-name OpenNMS.rpc-response \
--add-config segment.ms=3600000 \
--add-config retention.ms=3600000
Note that topic level settings and broker level settings are slightly different. The topic level settings override the broker level settings when they exist.
Be careful when setting the number of partitions per topic if you're planning to have a massive number of Minion locations or share the cluster across multiple OpenNMS instances with a high number of locations. This is why having the single-topic
enabled in OpenNMS and Minion is the best approach (the default in H28).
Each lead partition (and each replica the broker maintains) will have a directory in the data directory, and Kafka will maintain a file descriptor per segment. Each segment contains two files, the index and the data itself. For more information, check this blog post.
It is recommended to have a dedicated file system for the data directory formatted using XFS with noatime
and nodiratime
in production.
Create a cloud-init script with the following content to deploy PostgreSQL, the latest OpenNMS Horizon, and CMAK in Ubuntu LTS and store it at /tmp/opennms-template.yaml
:
#cloud-config
package_upgrade: true
timezone: $TIMEZONE
write_files:
- owner: root:root
path: /etc/opennms-overlay/featuresBoot.d/features.boot
content: |
opennms-kafka-producer
# OpenNMS RRD Settings
- owner: root:root
path: /etc/opennms-overlay/opennms.properties.d/rrd.properties
content: |
org.opennms.rrd.storeByGroup=true
org.opennms.rrd.storeByForeignSource=true
org.opennms.rrd.strategyClass=org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy
org.opennms.rrd.interfaceJar=/usr/share/java/jrrd2.jar
opennms.library.jrrd2=/usr/lib/jni/libjrrd2.so
# OpenNMS Sink and RPC API
- owner: root:root
path: /etc/opennms-overlay/opennms.properties.d/kafka.properties
content: |
# Disable internal ActiveMQ
org.opennms.activemq.broker.disable=true
# Sink
org.opennms.core.ipc.sink.strategy=kafka
org.opennms.core.ipc.sink.kafka.bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092
org.opennms.core.ipc.sink.kafka.acks=1
# RPC
org.opennms.core.ipc.rpc.strategy=kafka
org.opennms.core.ipc.rpc.kafka.bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092
org.opennms.core.ipc.rpc.kafka.ttl=30000
org.opennms.core.ipc.rpc.kafka.single-topic=true
org.opennms.core.ipc.rpc.kafka.auto.offset.reset=latest
# OpenNMS Kafka Producer Client
- owner: root:root
path: /etc/opennms-overlay/org.opennms.features.kafka.producer.client.cfg
content: |
bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092
compression.type=zstd
timeout.ms=30000
max.request.size=5000000
# OpenNMS Kafka Producer Settings
- owner: root:root
path: /etc/opennms-overlay/org.opennms.features.kafka.producer.cfg
content: |
topologyProtocols=bridge,cdp,isis,lldp,ospf
suppressIncrementalAlarms=true
forward.metrics=true
nodeRefreshTimeoutMs=300000
alarmSyncIntervalMs=300000
kafkaSendQueueCapacity=1000
nodeTopic=OpenNMS_nodes
alarmTopic=OpenNMS_alarms
eventTopic=OpenNMS_events
metricTopic=OpenNMS_metrics
alarmFeedbackTopic=OpenNMS_alarms_feedback
topologyVertexTopic=OpenNMS_topology_vertices
topologyEdgeTopic=OpenNMS_edges
- owner: root:root
permissions: '0400'
path: /etc/snmp/snmpd.conf
content: |
rocommunity public default
syslocation Azure - $LOCATION
syscontact $USER
dontLogTCPWrappersConnects yes
disk /
apt:
preserve_sources_list: true
sources:
opennms:
source: deb https://debian.opennms.org stable main
docker:
source: deb https://download.docker.com/linux/ubuntu bionic stable
packages:
- snmp
- snmpd
- jq
- jrrd2
- opennms
- opennms-webapp-hawtio
- opennms-helm
- docker-ce
- docker-ce-cli
- containerd.io
bootcmd:
- curl -s https://debian.opennms.org/OPENNMS-GPG-KEY | apt-key add -
- curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
runcmd:
# Configure PostgreSQL
- systemctl --now enable postgresql
- sudo -u postgres createuser opennms
- sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'postgres';"
- sudo -u postgres psql -c "ALTER USER opennms WITH PASSWORD 'opennms';"
- sed -r -i 's/password=""/password="postgres"/' /etc/opennms/opennms-datasources.xml
# Configure OpenNMS
- sed -r -i '/enabled="false"/{$!{N;s/ enabled="false"[>]\n(.*OpenNMS:Name=Syslogd.*)/>\n\1/}}' /etc/opennms/service-configuration.xml
- echo "JAVA_HEAP_SIZE=$ONMS_HEAP_SIZE" > /etc/opennms/opennms.conf
- rsync -avr /etc/opennms-overlay/ /etc/opennms/
- /usr/share/opennms/bin/runjava -s
- /usr/share/opennms/bin/fix-permissions
- /usr/share/opennms/bin/install -dis
- systemctl --now enable opennms
# Start CMAK using Docker
- usermod -aG docker ubuntu
- docker run --name cmak -d -e ZK_HOSTS="$PREFIX-kafka-1:2181" -e APPLICATION_SECRET="opennms" -p 9000:9000 hlebalbau/kafka-manager:stable
# Upgrade Grafana
- sudo apt-get install -y adduser libfontconfig1
- wget https://dl.grafana.com/oss/release/grafana_7.5.11_amd64.deb
- sudo dpkg -i grafana_7.5.11_amd64.deb
We don't need to specify Kafka Brokers' whole list as part of the bootstrap.servers
entry. The whole topology will be discovered through the first one that responds, and the client will use what's configured as the advertised listener to talk to each broker. I added two in case the first one is unavailable (as a backup).
If you're using an older version of Kafka, make sure to set the appropriate version when adding your cluster to CMAK.
The above installs the latest OpenJDK 11, the latest PostgreSQL, and the latest OpenNMS Horizon to the VM. It also install Kafka Manager or CMAK via Docker. I added the most basic configuration for PostgreSQL to work with authentication. Kafka will be enabled for Sink/RPC as well as the Kafka Producer. As mentioned, Azure VMs can reach each other through hostnames.
Create an Ubuntu VM for OpenNMS:
envsubst < /tmp/opennms-template.yaml > /tmp/opennms.yaml
az vm create --resource-group $RG_NAME --name $ONMS_VM_NAME \
--size $ONMS_VM_SIZE \
--image canonical:0001-com-ubuntu-server-focal:20_04-lts:latest \
--admin-username $USER \
--ssh-key-values ~/.ssh/id_rsa.pub \
--vnet-name $VNET_NAME \
--subnet $VNET_SUBNET_NAME \
--public-ip-address-dns-name $ONMS_VM_NAME \
--public-ip-sku Standard \
--custom-data /tmp/opennms.yaml \
--tags Owner=$USER \
--output table
az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \
--port 8980 --priority 200 --output table
az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \
--port 3000 --priority 200 --output table
az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \
--port 9000 --priority 300 --output table
Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at ~/.ssh/id_rsa.pub
, or update the az vm create
command.
Keep in mind that the cloud-init
process starts once the VM is running, meaning we should wait about 5 minutes after the az vm create
is finished to see OpenNMS up and running.
In case there is a problem, SSH into the VM using the public IP and the provided credentials and check /var/log/cloud-init-output.log
to verify the progress and the status of the cloud-init execution.
Wait until OpenNMS is up and running and then execute the following, to start monitoring all the ZK/Kafka servers, and the OpenNMS server via SNMP and JMX.
ONMS_FQDN="$ONMS_VM_NAME.$DOMAIN"
cat <<EOF >/tmp/OpenNMS.xml
<?xml version="1.0"?>
<model-import date-stamp="$(date +"%Y-%m-%dT%T.000Z")" foreign-source="OpenNMS">
EOF
for vm in $(az vm list -g $RG_NAME --query "[?contains(name,'$PREFIX-')].name" -o tsv); do
ipaddr=$(az vm show -g $RG_NAME -n $vm -d --query privateIps -o tsv)
cat <<EOF >>/tmp/OpenNMS.xml
<node foreign-id="$vm" node-label="$vm">
EOF
if [[ "$vm" == *"kafka"* ]]; then
cat <<EOF >>/tmp/OpenNMS.xml
<interface ip-addr="$ipaddr" status="1" snmp-primary="P">
<monitored-service service-name="JMX-Kafka"/>
</interface>
</node>
EOF
fi
if [[ "$vm" == *"onms"* ]]; then
cat <<EOF >>/tmp/OpenNMS.xml
<interface ip-addr="$ipaddr" status="1" snmp-primary="P"/>
<interface ip-addr="127.0.0.1" status="1" snmp-primary="N">
<monitored-service service-name="OpenNMS-JVM"/>
</interface>
</node>
EOF
fi
done
cat <<EOF >>/tmp/OpenNMS.xml
</model-import>
EOF
curl -v -u admin:admin \
-H 'Content-Type: application/xml' -d @/tmp/OpenNMS.xml \
http://$ONMS_FQDN:8980/opennms/rest/requisitions
curl -v -u admin:admin -X PUT \
http://$ONMS_FQDN:8980/opennms/rest/requisitions/OpenNMS/import
multipass
After verifying that OpenNMS is up and running, we can proceed to create the Minions.
Create a cloud-init script to deploy Minion in Ubuntu and save it at /tmp/minion-template.yaml
:
#cloud-config
package_upgrade: true
timezone: $TIMEZONE
write_files:
- owner: root:root
path: /etc/minion-overlay/org.opennms.minion.controller.cfg
content: |
location=$MINION_LOCATION
id=$MINION_ID
http-url=http://$ONMS_VM_NAME.$DOMAIN:8980/opennms
- owner: root:root
path: /etc/minion-overlay/featuresBoot.d/kafka.boot
content: |
!minion-jms
!opennms-core-ipc-sink-camel
!opennms-core-ipc-rpc-jms
opennms-core-ipc-sink-kafka
opennms-core-ipc-rpc-kafka
- owner: root:root
path: /etc/minion-overlay/org.opennms.core.ipc.sink.kafka.cfg
content: |
bootstrap.servers=$PREFIX-kafka-1.$DOMAIN:9094,$PREFIX-kafka-2.$DOMAIN:9094
- owner: root:root
path: /etc/minion-overlay/org.opennms.core.ipc.rpc.kafka.cfg
content: |
bootstrap.servers=$PREFIX-kafka-1.$DOMAIN:9094,$PREFIX-kafka-2.$DOMAIN:9094
single-topic=true
apt:
preserve_sources_list: true
sources:
opennms:
source: deb https://debian.opennms.org stable main
packages:
- opennms-minion
bootcmd:
- curl -s https://debian.opennms.org/OPENNMS-GPG-KEY | apt-key add -
runcmd:
- rsync -avr /etc/minion-overlay/ /etc/minion/
- sed -i -r 's/# export JAVA_MIN_MEM=.*/export JAVA_MIN_MEM="$MINION_HEAP_SIZE"/' /etc/default/minion
- sed -i -r 's/# export JAVA_MAX_MEM=.*/export JAVA_MAX_MEM="$MINION_HEAP_SIZE"/' /etc/default/minion
- /usr/share/minion/bin/scvcli set opennms.http admin admin
- /usr/share/minion/bin/scvcli set opennms.broker admin admin
- systemctl --now enable minion
Note that I'm using the same content for bootstrap.servers
as OpenNMS, making sure to use the Public FQDNs, as Minions won't be running in Azure.
Then, start the new Minion via multipass
:
export MINION_ID=minion01
envsubst < /tmp/minion-template.yaml > /tmp/$MINION_ID.yaml
multipass launch -c 1 -m 2G -n $MINION_ID --cloud-init /tmp/$MINION_ID.yaml
Optionally, create a second Minion in the same location:
export MINION_ID=minion02
envsubst < /tmp/minion-template.yaml > /tmp/$MINION_ID.yaml
multipass launch -c 1 -m 2G -n $MINION_ID --cloud-init /tmp/$MINION_ID.yaml
In case there is a problem, access the VM (e.x., multipass shell minion01
) and check /var/log/cloud-init-output.log
to verify the progress and the status of the cloud-init execution.
Feel free to change the CPU and memory settings for your Minion, but make sure it is consistent with MINION_HEAP_SIZE
. Make sure to validate communication using the health-check
command from the Karaf Shell.
When having multiple Minions per location, they will become part of a consumer group from Kafka's perspective for the RPC requests topic. The group ID will be the name of the location.
As you can see, the location name is Durham
(a.k.a. $MINION_LOCATION
), and you should see the Minions on that location registered in OpenNMS.
SSH into the OpenNMS server and create a requisition with a node in the same network as the Minion VMs, and make sure to associate it with the appropriate location. For instance,
/usr/share/opennms/bin/provision.pl requisition add Test
/usr/share/opennms/bin/provision.pl node add Test srv01 srv01
/usr/share/opennms/bin/provision.pl node set Test srv01 location Durham
/usr/share/opennms/bin/provision.pl interface add Test srv01 192.168.0.40
/usr/share/opennms/bin/provision.pl interface set Test srv01 192.168.0.40 snmp-primary P
/usr/share/opennms/bin/provision.pl requisition import Test
Ensure to replace 192.168.0.40
with the IP of a working server in your network (reachable from the Minion VM, and preferable unreachable or nonexistent in Azure), and do not forget to use the same location as defined in $MINION_LOCATION
.
Please keep in mind that Minions are VMs on your machine. 192.168.0.40
is the IP of one of my machines which is why Minions can reach it (and vice versa). To access an external machine on your network, make sure to define static routes on that machine so it can reach the Minions through your machine (assuming you're running Linux or macOS).
OpenNMS which runs in Azure, and have no access to 192.168.0.40
directly, should be able to collect data and monitor that node through any of the Minions. In fact, you can stop one of them, and OpenNMS would continue monitoring it.
To test asynchronous messages, you can send SNMP traps or Syslog messages to one of the Minions. Alternatively, you could use udpgen for this purpose. Usually, you could put a Load Balancer in front of the Minions and use its IP when sending messages from the monitored devices.
The machine that will be running udpgen
must be part of the OpenNMS inventory. Then, find the IP of the Minion using multipass list
, then execute the following from the machine added as a node above (the examples assumes the IP of the Minion is 192.168.75.16
):
To send SNMP Traps:
udpgen -h 192.168.75.16 -x snmp -r 1 -p 1162
To send Syslog Messages:
udpgen -h 192.168.75.16 -x syslog -r 1 -p 1514
The C++ version of udpgen
only works on Linux. If you're on macOS, you can use the Go version of it. Unfortunately, Windows is not an option due to a lack of support for Syslog in Go.
Note that an event definition is required when using udpgen
to send traps. Here is what you'd need for Eventd
:
<events xmlns="http://xmlns.opennms.org/xsd/eventconf">
<event>
<mask>
<maskelement>
<mename>id</mename>
<mevalue>.1.3.6.1.1.6.3.1.1.5</mevalue>
</maskelement>
<maskelement>
<mename>generic</mename>
<mevalue>6</mevalue>
</maskelement>
<maskelement>
<mename>specific</mename>
<mevalue>1</mevalue>
</maskelement>
</mask>
<uei>uei.opennms.org/udpgen/testTrap</uei>
<event-label>udpgen test trap</event-label>
<descr>Sample Event %parm[all]%</descr>
<logmsg dest="logndisplay">Sample Event %parm[all]%</logmsg>
<severity>Warning</severity>
</event>
</events>
If you want to make the tests more interesting, add the following to the above definition:
<alarm-data reduction-key="%uei%:%dpname%:%nodeid%"
alarm-type="3" auto-clean="false"/>
The Hawtio UI in OpenNMS can help visualize the relevant JMX metrics and understand what’s circulating between OpenNMS and the Minions.
For OpenNMS, Hawtio is available through :8980/hawtio
if the package opennms-webapp-hawtio
was installed (which is the case with the cloud-init
template used).
For Minions, Hawtio is available through :8181/hawtio
.
As mentioned, if time is not synchronized across all the instances, the Heartbeat sent by Minions via the Sink API won't be processed properly by OpenNMS, leading to having the Minion not registered or outages in the Minion-Heartbeat
service.
We can inspect the traffic on the topics to see if the Minion is sending (or receiving) traffic to Kafka. However, as the payload is encoded within a Protobuf message, using the console consumer might not be as useful as we'd expect. Still, it works for troubleshooting purposes. For instance, from one of the Kafka brokers, we can do:
/opt/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server $(hostname):9092 \
--topic OpenNMS.Sink.Heartbeat
And we'll see:
$bce7b13e-d575-40b9-989a-3b5c6e7432c2 ~<minion>
<id>minion01</id>
<location>Durham</location>
<timestamp>2021-03-26T12:19:55.752-07:00</timestamp>
</minion>
As we can see, the actual payload within the Protobuf message is an indented XML.
The following application can be used to properly inspect the content without worrying about the non-readable content due to the Protobuf format:
https://github.com/agalue/onms-kafka-ipc-receiver
For RPC in particular, we can access the Karaf Shell from the OpenNMS instance and use the opennms:stress-rpc
command to verify communication against the Minions on a given location or against a specific Minion, and as the command name implies, to perform stress tests.
For recent versions of Kafka, the following can help to get details about topics, lags, consumer groups and so on.
To verify the topic partitions and replica settings:
topics=$(/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server $(hostname):9092)
for topic in $topics; do
/opt/kafka/bin/kafka-topics.sh \
--bootstrap-server $(hostname):9092 \
--describe --topic $topic
done
To verify the current topic-level settings:
topics=$(/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server $(hostname):9092)
for topic in $topics; do
/opt/kafka/bin/kafka-configs.sh \
--bootstrap-server $(hostname):9092 \
--describe --entity-type topics --entity-name $topic --all
done
To verify offsets, topics lag and consumer groups:
/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server $(hostname):9092 \
--describe --all-groups --all-topics
When enabling security (either SASL or TLS), you need to pass those settings to the commands.
For instance, let's say you have SASL enabled, you should pass:
--command-config /opt/kafka/config/consumer.properties
Where the content of consumer.properties
would be:
security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="opennms" password="0p3nNM5";
For older versions of Kafka, the equivalent commands are the following:
To verify the topic partitions and replica settings:
topics=$(/opt/kafka/bin/kafka-topics.sh --list --zookeeper $(hostname):2181)
for topic in $topics; do
/opt/kafka/bin/kafka-topics.sh \
--zookeeper $(hostname):2181 \
--describe --topic $topic
done
To verify the current topic-level settings:
topics=$(/opt/kafka/bin/kafka-topics.sh --list --zookeeper $(hostname):2181)
for topic in $topics; do
/opt/kafka/bin/kafka-configs.sh \
--zookeeper $(hostname):2181 \
--describe --entity-type topics --entity-name $topic
done
To verify offsets, topics lag and consumer groups:
groups=$(/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $(hostname):9092 --list)
for group in $groups; do
/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server $(hostname):9092 \
--describe --all-topics --group $group
done
When passing the ZK host to --zookeeper
, that has to be consistent with how zookeeper.connect
was defined on each Kafka broker. If you used something like this zk1:2181,zk2:2181/kafka
, you should then pass --zookeeper $(hostname):2181/kafka
instead.
In big environments, it is common to have multiple OpenNMS instances, each of them with its own fleet of Minions to monitor one of the multiple data centers or a section of it. In those scenarios, it is common to have a centralized Kafka cluster that can be shared across all of them (for more information, follow this link).
The above solution has to be modified to ensure each set of OpenNMS and Minions will use their own set of topics in Kafka to avoid collisions.
The topics' prefix (which defaults to OpenNMS
) can be controlled via a system-wide property called Instance ID (a.k.a. org.opennms.instance.id
). We must configure this property in both places. For the OpenNMS, add it to a property file inside $OPENNMS_HOME/etc/opennms.properties.d
; and for a Minion, add it to a property file inside $MINION_HOME/etc/custom.system.properties
.
In production, when having multiple Minions per location, it is a good practice to put a Load Balancer in front of them so that the devices can use a single destination for SNMP Traps, Syslog, and Flows.
The following creates a cloud-init template for Ubuntu to start a basic LB using nginx
through multipass
for SNMP Traps (with a listener on port 162) and Syslog Messages (with a listener on port 514). Save the template at /tmp/nginx-template.yaml
:
#cloud-config
package_upgrade: true
packages:
- nginx
write_files:
- owner: root:root
path: /etc/nginx/nginx.conf
content: |
user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
events {
worker_connections 768;
}
stream {
upstream syslog_udp {
server $MINION_IP1:1514;
server $MINION_IP2:1514;
}
upstream trap_udp {
server $MINION_IP1:1162;
server $MINION_IP2:1162;
}
server {
listen 514 udp;
proxy_pass syslog_udp;
proxy_responses 0;
}
server {
listen 162 udp;
proxy_pass trap_udp;
proxy_responses 0;
}
}
runcmd:
- systemctl restart nginx
Note the usage of environment variables within the YAML template. We will substitute them before creating the VM.
Then, update the template and create the LB:
export MINION_IP1=$(multipass info $MINION_ID1 | grep IPv4 | awk '{print $2}')
export MINION_IP2=$(multipass info $MINION_ID2 | grep IPv4 | awk '{print $2}')
envsubst < /tmp/nginx-template.yaml > /tmp/nginx.yaml
multipass launch -n nginx --cloud-init /tmp/nginx.yaml
echo "Load Balancer $(multipass info nginx | grep IPv4)"
Flows are outside the scope of this test as that requires more configuration on Minions and OpenNMS besides having an Elasticsearch cluster up and running with the required plugin in place.
The above procedure uses Kafka and Zookeeper in plain text without authentication or encryption. That works for testing purposes or perhaps for private clusters, where access to the servers is restricted and audited.
This example, in particular, exposes Kafka to the Internet, which requires having at least authentication in place. The following explains how to enable authentication and then the steps to enable encryption.
For a more comprehensive guide, follow this tutorial from Confluent.
This section explains how to enable authentication using SASL with SCRAM-SHA-512 for Kafka and DIGEST
for Zookeeper (as Zookeeper doesn't support SCRAM
). Because this guide's intention is learning, I decided to add security as a separate or optional module. That's due to the extra complexity associated with this advanced topic.
Here are the high-level changes:
server.properties
and the systemd
service definition on each Kafka broker to enable and use SASL.zookeeper.properties
and the systemd
service definition on each ZK instance to enable and use SASL.Access one of the brokers and execute the following command:
ONMS_USER="opennms" # To be used by Kafka, OpenNMS and Minions
ONMS_PASSWD="0p3nNM5;" # To be used by Kafka, OpenNMS and Minions
/opt/kafka/bin/kafka-configs.sh --bootstrap-server $(hostname):9092 \
--alter \
--add-config "SCRAM-SHA-256=[password=$ONMS_PASSWD],SCRAM-SHA-512=[password=$ONMS_PASSWD]" \
--entity-type users \
--entity-name $ONMS_USER
On each Zookeeper instance, update zookeeper.properties
to enable SASL:
cat <<EOF | sudo tee -a /opt/kafka/config/zookeeper.properties
authProvider.sasl=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=sasl
EOF
On each Kafka broker instance, update server.properties
to enable SASL/SCRAM:
sudo sed -i -r '/listener.security.protocol.map/d' /opt/kafka/config/server.properties
cat <<EOF | sudo tee -a /opt/kafka/config/server.properties
# Enable Security
listener.security.protocol.map=INSIDE:SASL_PLAINTEXT,OUTSIDE:SASL_PLAINTEXT
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
sasl.enabled.mechanisms=SCRAM-SHA-256,SCRAM-SHA-512
EOF
Note that listener.security.protocol.map
already exists in that file, which is why I removed it prior adding the required changes.
In theory, there is no need to enable both SCRAM-SHA-256
and SCRAM-SHA-512
. I did that for compatibility purposes, but I'll use SCRAM-SHA-512
for all subsequent configurations.
On each Zookeeper instance, create the JAAS
configuration file with the credentials:
ZK_USER="zkonms"
ZK_PASSWD="zk0p3nNM5;"
cat <<EOF | sudo tee /opt/kafka/config/zookeeper_jaas.conf
Server {
org.apache.zookeeper.server.auth.DigestLoginModule required
user_$ZK_USER="$ZK_PASSWD";
};
EOF
sudo chown kafka:kafka /opt/kafka/config/zookeeper_jaas.conf
sudo chmod 0600 /opt/kafka/config/zookeeper_jaas.conf
On each Kafka broker, create the JAAS
configuration file with the credentials:
ZK_USER="zkonms" # Must match zookeeper_jaas.conf
ZK_PASSWD="zk0p3nNM5;" # Must match zookeeper_jaas.conf
ONMS_USER="opennms" # Must match scram user
ONMS_PASSWD="0p3nNM5;" # Must match scram user
cat <<EOF | sudo tee /opt/kafka/config/kafka_jaas.conf
KafkaServer {
org.apache.kafka.common.security.scram.ScramLoginModule required
username="$ONMS_USER"
password="$ONMS_PASSWD";
};
Client {
org.apache.zookeeper.server.auth.DigestLoginModule required
username="$ZK_USER"
password="$ZK_PASSWD";
};
EOF
sudo chown kafka:kafka /opt/kafka/config/kafka_jaas.conf
sudo chmod 0600 /opt/kafka/config/kafka_jaas.conf
On each Zookeeper instance, update the systemd
service definition to load the JAAS settings via KAFKA_OPTS
:
OPTS='Environment="KAFKA_OPTS=-Djava.security.auth.login.config=/opt/kafka/config/zookeeper_jaas.conf"'
sudo sed -i -r -e "/^ExecStart=.*/i $OPTS" /etc/systemd/system/zookeeper.service
sudo systemctl daemon-reload
On each Kafka broker, update the systemd
service definition to load the JAAS settings via KAFKA_OPTS
:
OPTS='Environment="KAFKA_OPTS=-Djava.security.auth.login.config=/opt/kafka/config/kafka_jaas.conf"'
sudo sed -i -r -e "/^ExecStart=.*/i $OPTS" /etc/systemd/system/kafka.service
sudo systemctl daemon-reload
Restart the cluster in the following order:
At this point, you should pass the SASL credentials to all Kafka CLI Tools. For instance,
ONMS_USER="opennms" # Must match scram user
ONMS_PASSWD="0p3nNM5;" # Must match scram user
cat <<EOF | sudo tee -a /opt/kafka/config/consumer.properties
# Security
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD";
EOF
/opt/kafka/bin/kafka-topics.sh --list \
--bootstrap-server $(hostname):9092 \
--command-config /opt/kafka/config/consumer.properties
Note how we pass the consumer settings. The above command should list all the topics in the cluster. If you can see the list, then SASL is working. Keep in mind that not passing --command-config
, the command should timeout, as the tool cannot communicate to Kafka without the credentials.
On the OpenNMS instance, update /opt/opennms/etc/opennms.properties.d/kafka.properties
and /opt/opennms/etc/org.opennms.features.kafka.producer.cfg
to use SASL, and restart OpenNMS. For instance:
ONMS_USER="opennms" # Must match scram user
ONMS_PASSWD="0p3nNM5;" # Must match scram user
for module in sink rpc; do
cat <<EOF | sudo tee -a /etc/opennms/opennms.properties.d/kafka.properties
# Security for $module
org.opennms.core.ipc.$module.kafka.security.protocol=SASL_PLAINTEXT
org.opennms.core.ipc.$module.kafka.sasl.mechanism=SCRAM-SHA-512
org.opennms.core.ipc.$module.kafka.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD";
EOF
done
cat <<EOF | sudo tee -a /etc/opennms/org.opennms.features.kafka.producer.client.cfg
# Security
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD";
EOF
sudo systemctl restart opennms
On each Minion, update /etc/minion/org.opennms.core.ipc.sink.kafka.cfg
and /etc/minion/org.opennms.core.rpc.sink.kafka.cfg
to use SASL, and restart Minion. For instance:
ONMS_USER="opennms" # Must match scram user
ONMS_PASSWD="0p3nNM5;" # Must match scram user
for module in sink rpc; do
cat <<EOF | sudo tee -a /etc/minion/org.opennms.core.ipc.$module.kafka.cfg
# Security
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD";
EOF
done
sudo systemctl restart minion
The solution works from OpenNMS and Minion perspective, despite the fact of seeing the following message repeatedly in /opt/kafka/logs/server.log
on all brokers:
[2021-04-11 12:35:56,486] INFO [SocketServer brokerId=2] Failed authentication with /13.0.1.7 (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)
Where 13.0.1.7
is the IP of the OpenNMS server.
At this point, we have SASL
authentication enabled using SCRAM-512
for Kafka and DIGEST
for Zookeeper, meaning credentials might be hard to crack when intercepting traffic (but perhaps not impossible). However, to make it more secure, encryption is recommended.
If you already configured CMAK
, make sure to enable the SASL/SCRAM mechanism for your cluster.
Please keep in mind that enabling SSL/TLS will increase CPU demand on each broker and the clients, which is why using OpenJDK 11 over JDK 8 is encouraged.
To enable TLS, and because each Kafka Broker must be exposed and reachable through a public DNS entry, I'm going to use LetsEncrypt to generate the certificates. That will save a few steps because the certificates will be publicly valid, so we won't need to set up a Trust Store.
A Trust Store is mandatory when using private CAs or self-signed certificates to configure every entity that touches Kafka directly or indirectly.
The Certbot utility used to create and validate the certificate will start a temporary web server on the instance (for the validation process). For this reason, we should temporary allow access through port TCP 80:
for i in {1..$KAFKA_CLUSTER_SIZE}; do
VM_NAME="$PREFIX-kafka-$i"
az vm open-port -g $RG_NAME -n $VM_NAME \
--port 80 --priority 101 --output table
done
Then, on each Kafka Broker (one by one), we must do the following to enable TLS:
FQDN="$(hostname).eastus.cloudapp.azure.com"
EMAIL="owner@example.com"
PASSWD="0p3nNM5"
sudo snap install --classic certbot
sudo ln -s /snap/bin/certbot /usr/bin/certbot
sudo certbot certonly --standalone -d $FQDN -m $EMAIL \
--non-interactive --agree-tos
TEMP_P12="/tmp/ssl.p12.$(date +%s)"
TEMP_KEYSTORE="/tmp/ssl.keystore.$(date +%s)"
TARGET_KEYSTORE="/opt/kafka/config/letsencrypt.jks"
sudo openssl pkcs12 -export \
-in /etc/letsencrypt/live/$FQDN/fullchain.pem \
-inkey /etc/letsencrypt/live/$FQDN/privkey.pem \
-out $TEMP_P12 -name kafka -password "pass:$PASSWD"
sudo keytool -importkeystore -alias kafka \
-deststorepass "$PASSWD" -destkeypass "$PASSWD" -destkeystore $TEMP_KEYSTORE \
-srckeystore $TEMP_P12 -srcstoretype PKCS12 -srcstorepass "$PASSWD"
sudo cp $TEMP_KEYSTORE $TARGET_KEYSTORE
sudo chmod 440 $TARGET_KEYSTORE
sudo chown kafka:kafka $TARGET_KEYSTORE
sudo rm -f $TEMP_P12 $TEMP_KEYSTORE
CONFIG="/opt/kafka/config/server.properties"
sudo sed -i -r '/listener.security.protocol.map/d' $CONFIG
cat <<EOF | sudo tee -a $CONFIG
listener.security.protocol.map=INSIDE:SASL_PLAINTEXT,OUTSIDE:SASL_SSL
ssl.keystore.location=$TARGET_KEYSTORE
ssl.keystore.password=$PASSWD
ssl.key.password=$PASSWD
EOF
sudo systemctl restart kafka
Please use your own email, and keep in mind that the Azure location is hardcoded in the command; if you're using a different one, update the FQDN.
Note that SSL was only enabled for the OUTSIDE
listener, meaning we should only modify the Minions (and listener.security.protocol.map
was changed because of that), as OpenNMS won't use it because it lives in the same protected network as the Kafka cluster.
To verify, you can retrieve the broker configuration via Zookeeper:
/opt/kafka/bin/zookeeper-shell.sh $(hostname) get /brokers/ids/1 | egrep '^\{' | jq
If everything went well, you should get something like this:
{
"features": {},
"listener_security_protocol_map": {
"INSIDE": "SASL_PLAINTEXT",
"OUTSIDE": "SASL_SSL"
},
"endpoints": [
"INSIDE://agalue-kafka-1:9092",
"OUTSIDE://agalue-kafka-1.eastus.cloudapp.azure.com:9094"
],
"jmx_port": 9999,
"port": -1,
"host": null,
"version": 5,
"timestamp": "1622658498210"
}
Note that SASL_SSL
applies to OUTSIDE
. Now it is time to update the Minions.
On each Minion, do the following:
for module in sink rpc; do
cfg="/etc/minion/org.opennms.core.ipc.$module.kafka.cfg"
sudo sed -i -r '/security.protocol/s/SASL_PLAINTEXT/SASL_SSL/' $cfg
done
sudo systemctl restart minion
While you're there, you can check if TLS is actually enabled by running:
openssl s_client -connect agalue-kafka-1.eastus.cloudapp.azure.com:9094
There is no need to modify anything else as we're using valid certificates signed by a well-known public entity. When using private certificates or private CAs, you would have to create Trust Store via keytool
for the clients and the brokers.
As an challenge to the reader, update the /tmp/kafka-template.yaml
, /tmp/opennms-template.yaml
, and /tmp/minion-template.yaml
to include all the SASL and SSL/TLS configuration and start the whole environment from scratch with authentication and encryption enabled.
The following is inspired by this guide to enable TLS with Nginx for the OpenNMS WebUI and Grafana. However, as we're using Ubuntu here, I'll describe the required changes.
Allow access via TCP 80 and 443:
az vm open-port -g $RG_NAME -n $ONMS_VM_NAME --port 443 --priority 110 -o table
az vm open-port -g $RG_NAME -n $ONMS_VM_NAME --port 80 --priority 120 -o table
SSH the OpenNMS server and then:
export EMAIL="user@example.com"
export LOCATION=$(curl -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2021-02-01" 2>/dev/null | jq -r '.compute.location')
export FQDN=$(hostname).$LOCATION.cloudapp.azure.com
sudo apt install -y nginx
sudo mkdir -p /var/www/$FQDN/.well-known
sudo chown nginx:nginx /var/www/$FQDN
cfg="/etc/nginx/sites-available/default"
cat <<EOF | sudo tee $cfg
server {
listen 80;
server_name $FQDN;
# maintain the .well-known directory alias for lets encrypt renewals
location /.well-known {
alias /var/www/$FQDN/.well-known;
}
location /hawtio/ {
proxy_pass http://localhost:8980/hawtio/;
}
location /grafana/ {
proxy_pass http://localhost:3000/;
}
location /opennms/ {
proxy_set_header Host \$host;
proxy_set_header X-Real-IP \$remote_addr;
proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto \$scheme;
proxy_set_header Upgrade \$http_upgrade;
proxy_set_header Connection "Upgrade";
proxy_pass http://localhost:8980/opennms/;
proxy_redirect default;
proxy_read_timeout 90;
}
}
EOF
sudo systemctl restart nginx
sudo systemctl enable nginx
sudo snap install core
sudo snap refresh core
sudo snap install --classic certbot
sudo ln -s /snap/bin/certbot /usr/bin/certbot
sudo certbot --nginx -d $FQDN --non-interactive --agree-tos -m $EMAIL
cat <<EOF | sudo tee /etc/opennms/opennms.properties.d/webui.properties
org.opennms.netmgt.jetty.host = 127.0.0.1
opennms.web.base-url = https://%x%c/
EOF
sudo systemctl restart opennms
sudo sed -i -r "s|^;domain =.*|domain = $FQDN|" /etc/grafana/grafana.ini
sudo sed -i -r "s|^;root_url =.*|root_url = %(protocol)s://%(domain)s:%(http_port)s/grafana/|" /etc/grafana/grafana.ini
sudo systemctl restart grafana-server
Make sure to use a valid content for $EMAIL
, as that's required by LetsEncrypt (as we did for Kafka).
Note that cmak
(or Kafka Manager) is not present due to the complexity of having it working behind a proxy.
You can remove the NSG rules for ports 8980 and 3000.
az network nsg rule delete -g $RG_NAME \
--nsg-name ${ONMS_VM_NAME}NSG -n open-port-8980
az network nsg rule delete -g $RG_NAME \
--nsg-name ${ONMS_VM_NAME}NSG -n open-port-3000
Work in progress…
Some circumstances could introduce unexpected behavior to the solution. Besides the traditional monitoring to ensure that all the components are behaving as expected in CPU, Memory, Java Heap Memory, Java GC, and IO (covered as part of this tutorial), you sometimes need to dig deeper to understand what's happening.
OpenNMS added OpenTracing support via Jaeger to understand how much time messages sent via the broker are taking to be produced and consumed.
The official documentation has a guide about how to configure it.
As we have Docker running in the OpenNMS server, we can start an All-In-One Jaeger Instance through it very easily. To do that, SSH into the OpenNMS server and run the following:
docker run -d --name jaeger \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 16686:16686 \
jaegertracing/all-in-one:1.24
OpenNMS would have direct access as it runs on the same machine accessible via localhost and should be configured as instructed in the official docs.
For the Minions, you would need to open the UDP ports 6831 and 6832 in the NSG associated with the OpenNMS server, as well as TCP 16686 to access the Jaeger WebUI:
az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \
--port 6831-6832,16686 --priority 400 --output table
Then, configure the minion as instructed in the official docs, using the OpenNMS FQDN and the port mentioned above.
When we're done, make sure to delete the cloud resources.
If you created the resource group for this exercise, you could remove all the resources with the following command:
az group delete -g $RG_NAME
If you're using an existing resource group that you cannot remove, make sure only to remove all the resources created in this tutorial. All of them should be easily identified as they will contain the username and the VM name as part of the resource name. The easiest way is to use the Azure Portal for this operation. Alternatively,
IDS=($(az resource list \
--resource-group $RG_NAME \
--query "[?contains(name,'$PREFIX-') && type!='Microsoft.Compute/disks']".id \
--output tsv | tr '\n' ' '))
for id in "${IDS[@]}"; do
echo "Removing $id"
az resource delete --ids "$id" --verbose
done
DISKS=($(az resource list \
--resource-group $RG_NAME \
--query "[?contains(name,'$PREFIX-') && type=='Microsoft.Compute/disks']".id \
--output tsv | tr '\n' ' '))
for id in "${DISKS[@]}"; do
echo "Removing $id"
az resource delete --ids "$id" --verbose
done
The reason to have two sets of deletion groups is that, by default, the list contains disks initially, which cannot be removed before the VMs. For this reason, we exclude the disks on the first set, and then we remove the disks.
Note that because all the resource names are prefixed with the chosen username, we can use it to identify them and remove them uniquely.
Then clean the local resources:
multipass delete $MINION_ID1 $MINION_ID2
multipass purge
Remember to remove the nginx
instance if you decided to use it.