Simple OpenNMS/Minion Environment using Kafka in Azure

This lab starts an OpenNMS instance and a 3 node ZK/Kafka cluster in the cloud and two Minions on your machine, using Kafka for communication through Multipass and Azure, for learning purposes.

The lab doesn't cover security by default (user authentication and encryption), which is crucial if we ever want to expose the Kafka cluster to the Internet. A separate section covers the required changes for this.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Keep in mind that nothing prevents us from skipping using the cloud provider and do everything with Multipass (or VirtualBox, or Hyper-V, or VMWare). The reason for using a cloud provider is to prove that OpenNMS can monitor unreachable devices via Minion. Similarly, we could use any other cloud provider instead of Azure. However I won't explain how to port the solution here.

Time synchronization across all the instances involved in this solution is mandatory. Failing on this could lead to undesired side effects. This is essentially guaranteed when using a cloud provider, which is why I do not include explicit instructions for it, but please be aware of it.

Requirements

Have an Azure Subscription ready.
Install Azure CLI
Install Multipass

The scripts used through this tutorial use envsubst, make sure to have it installed.

Make sure to log into Azure using az login prior creating the VM.

If you have a restricted account in Azure, make sure you have the Network Contributor role and the Virtual Machine Contributor role associated with your Azure AD account for the resource group where you want to create the VM. Of course, either Owner or Contributor at the resource group level are welcome.

All the following assume you have a macOS or Linux machine or VM from which you can issue all the commands.

Create common Environment Variables






















export PREFIX="$USER" # String to prepend to the name of all Azure resources
export RG_NAME="OpenNMS" # Change it to use a shared one
export LOCATION="eastus" # Azure Region
export DOMAIN="$LOCATION.cloudapp.azure.com" # Public Azure DNS Domain
export TIMEZONE="America/New_York"
export VNET_CIDR="13.0.0.0/16"
export VNET_SUBNET="13.0.1.0/24"
export VNET_NAME="$PREFIX-vnet"
export VNET_SUBNET_NAME="subnet1"
export KAFKA_VM_SIZE="Standard_D2s_v3" # 2 VCPU, 8 GB of RAM
export ZK_HEAP_SIZE="1G" # Must fit KAFKA_VM_SIZE
export KAFKA_URL="https://downloads.apache.org/kafka/2.8.1/kafka_2.13-2.8.1.tgz"
export KAFKA_JAVA_VERSION="11" # 8 for < 2.1.0; 11 for > 2.1.0
export KAFKA_HEAP_SIZE="2G" # Must fit KAFKA_VM_SIZE
export KAFKA_PARTITIONS="9" # > Number of Minions per location
export KAFKA_CLUSTER_SIZE="3" # Total instances of Kafka+ZK
export KAFKA_RF="2" # < KAFKA_CLUSTER_SIZE
export ONMS_VM_NAME="$PREFIX-onms01"
export ONMS_VM_SIZE="Standard_D2s_v3" # 2 VCPU, 8 GB of RAM
export ONMS_HEAP_SIZE="4096" # Expressed in MB and must fit ONMS_VM_SIZE
export MINION_LOCATION="Durham"
export MINION_HEAP_SIZE="1G" # Must fit VM RAM

We haven't tested 3.0.0, so please use 2.8.x or older for now.

Feel free to change the content and keep in mind that $PREFIX is what we will use throughout this tutorial to identify all the resources we will create in Azure uniquely.

Do not confuse the Azure Location or Region with the Minion Location; they are both unrelated things.

We're going to leverage the Azure DNS services to avoid the need to remember and using Public IP addresses, which helps if we're interested in having HTTPS with valid certificates as explained here not only for OpenNMS, but also to enable SSL/TLS in Kafka.

In Azure, the default public DNS follow the same pattern:

<vm-name>.<location>.cloudapp.azure.com

To make the VMs FQDN unique, we're going to add the username to the VM name. For instance, the OpenNMS FQDN would be:

agalue-onms01.eastus.cloudapp.azure.com

The above is what we can use to access the VM via SSH and to configure Minions.

Create the Azure Resource Group

This is a necessary step, as every resource in Azure must belong to a resource group and a location.

However, you can omit the following command and use an existing one if you prefer. In that case, make sure to adjust the environment variable RG_NAME so the subsequent commands will target the correct group.


az group create -n $RG_NAME -l $LOCATION --tags Owner=$USER

Create the Virtual Network

I prefer to create the VNET myself instead of letting Azure do it for me, especially when we want to guarantee that all the VMs will exist in the same one.







az network vnet create -g $RG_NAME \
  --name $VNET_NAME \
  --address-prefix $VNET_CIDR \
  --subnet-name $VNET_SUBNET_NAME \
  --subnet-prefix $VNET_SUBNET \
  --tags Owner=$USER \
  --output table

Create cloud-init configuration template for Kafka

The following cloud-init template assumes a 3 node cluster, where each VM would have Zookeeper and Kafka configured and running in Ubuntu LTS.

For simplicity, Zookeeper and Kafka will be running on each machine. In production, each cluster should have its own instances, as Zookeeper should not grow the same way as Kafka would grow, for multiple reasons such as a ZK cluster should always have an odd number of members (which is not the case of Kafka); traffic across ZK members grows exponentially with the number of instances (a ZK cluster of 5 members can manage multiple dozens of Kafka members, with 7 it can manage hundreds, and with 9 it can manage thousands).

For the 3-node cluster, each VM will be named like follows:

agalue-kafka-1
agalue-kafka-2
agalue-kafka-3

Note the hostnames include the chosen username to make them unique, which is mandatory for shared resource groups and the default Azure DNS public domain on the chosen region.

Remember that each VM in Azure is reachable within the same VNet from any other VM through its hostname.

From all the environment variables you'll encounter in the upcoming template, there are two crucial ones:

PUBLIC_FQDN
INSTANCE_ID

For server.properties, we must replace the environment variable PUBLIC_FQDN in the advertised.listeners with the public FQDN or IP of the VM when configuring the application before running it for the first time. With that in mind, there will be two listeners, one to be used within the VNet (which is what OpenNMS would use, on port 9092), and another associated with the Public FQDN (on port 9094), to be used by external Minions (outside Azure).

Similarly, we must replace INSTANCE_ID with a unique numeric value per instance for the broker.id in server.properties for Kafka and the myid file for Zookeeper, which are the mandatory requirements to identify each instance in their respective cluster.

The number of topic partitions must be greater than the number of Minions on a given location and greater than the number of brokers in the cluster.

Create a YAML file called /tmp/kafka-template.yaml with the following content:




















































































































































#cloud-config
package_upgrade: true

timezone: $TIMEZONE

users:
  - default
  - name: kafka

write_files:
  - owner: root:root
    path: /etc/security/limits.d/kafka.conf
    content: |
      * soft nofile 100000
      * hard nofile 100000

  - owner: root:root
    path: /etc/sysctl.d/99-kafka.conf
    content: |
      net.ipv4.tcp_keepalive_time=60
      net.ipv4.tcp_keepalive_probes=3
      net.ipv4.tcp_keepalive_intvl=10
      net.core.rmem_max=16777216
      net.core.wmem_max=16777216
      net.core.rmem_default=16777216
      net.core.wmem_default=16777216
      net.core.optmem_max=40960
      net.ipv4.tcp_rmem=4096 87380 16777216
      net.ipv4.tcp_wmem=4096 65536 16777216
      net.ipv4.tcp_window_scaling=1
      net.core.netdev_max_backlog=2500
      net.core.somaxconn=65000
      vm.swappiness=1
      vm.zone_reclaim_mode=0
      vm.max_map_count=1048575

  - owner: root:root
    permissions: '0400'
    path: /etc/snmp/snmpd.conf
    content: |
      rocommunity public default
      syslocation Azure - $LOCATION
      syscontact $USER
      dontLogTCPWrappersConnects yes
      disk /

  - owner: root:root
    path: /etc/systemd/system/zookeeper.service
    content: |
      [Unit]
      Description=Apache Zookeeper server
      Documentation=http://zookeeper.apache.org
      Wants=network-online.target
      After=network-online.target
      [Service]
      Type=simple
      User=kafka
      Group=kafka
      Environment="KAFKA_HEAP_OPTS=-Xmx$ZK_HEAP_SIZE -Xms$ZK_HEAP_SIZE"
      ExecStart=/opt/kafka/bin/zookeeper-server-start.sh /opt/kafka/config/zookeeper.properties
      ExecStop=/opt/kafka/bin/zookeeper-server-stop.sh
      [Install]
      WantedBy=multi-user.target

  - owner: root:root
    path: /etc/systemd/system/kafka.service
    content: |
      [Unit]
      Description=Apache Kafka Server
      Documentation=http://kafka.apache.org
      Wants=zookeeper.service
      After=zookeeper.service network-online.target
      [Service]
      Type=simple
      User=kafka
      Group=kafka
      LimitNOFILE=100000
      Environment="KAFKA_HEAP_OPTS=-Xmx$KAFKA_HEAP_SIZE -Xms$KAFKA_HEAP_SIZE"
      Environment="KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.rmi.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=%H -Djava.net.preferIPv4Stack=true"
      Environment="JMX_PORT=9999"
      ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
      ExecStop=/opt/kafka/bin/kafka-server-stop.sh
      [Install]
      WantedBy=multi-user.target

  - owner: root:root
    path: /tmp/zookeeper.properties # Designed for a 3-node ZK cluster
    content: |
      dataDir=/data/zookeeper
      tickTime=2000
      clientPort=2181
      initLimit=10
      syncLimit=5
      # Cluster Members
      server.1=$PREFIX-kafka-1:2888:3888;2181
      server.2=$PREFIX-kafka-2:2888:3888;2181
      server.3=$PREFIX-kafka-3:2888:3888;2181

  - owner: root:root
    path: /tmp/server.properties # Designed for a 3-node ZK cluster
    content: |
      broker.id=$INSTANCE_ID
      log.dirs=/data/kafka
      zookeeper.connect=$PREFIX-kafka-1:2181,$PREFIX-kafka-2:2181,$PREFIX-kafka-3:2181
      zookeeper.connection.timeout.ms=30000
      # Connection
      advertised.listeners=INSIDE://:9092,OUTSIDE://$PUBLIC_FQDN:9094
      listeners=INSIDE://:9092,OUTSIDE://:9094
      listener.security.protocol.map=INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
      inter.broker.listener.name=INSIDE
      # Replication
      offsets.topic.replication.factor=$KAFKA_RF
      default.replication.factor=$KAFKA_RF
      min.insync.replicas=1
      # Must be greater than number of Minions per Location
      num.partitions=$KAFKA_PARTITIONS
      # Recommended for the OpenNMS Kafka Producer
      message.max.bytes=5000000
      replica.fetch.max.bytes=5000000
      compression.type=producer
      # Cleanup (remove segments older than a week)
      log.retention.hours=168
      log.retention.bytes=-1
      # Required for OpenNMS and Minions
      auto.create.topics.enable=true
      # Recommended to avoid disrupting messages workflow
      delete.topic.enable=false

packages:
  - snmp
  - snmpd
  - jq
  - openjdk-$KAFKA_JAVA_VERSION-jre-headless

runcmd:
  - sysctl --system
  - wget -O /tmp/kafka.tar.gz $KAFKA_URL
  - cd /opt
  - mkdir kafka
  - tar -xvzf /tmp/kafka.tar.gz -C kafka --strip-components 1
  - mv -f /tmp/*.properties /opt/kafka/config/
  - mkdir -p /data/zookeeper /data/kafka
  - chown -R kafka:kafka /data /opt/kafka*
  - echo $INSTANCE_ID > /data/zookeeper/myid
  - systemctl daemon-reload
  - systemctl --now enable zookeeper
  - systemctl --now enable kafka
  - systemctl --now enable snmpd

The reason for increasing the message size (message.max.bytes, replica.fetch.max.bytes) is to avoid problems when forwarding collected metrics to Kafka via the Kafka Producer feature of OpenNMS, which I'm planning to enable.

If you for instance wants to use an older version of Kafka, you can tune the JDK package and the Kafka URL, so the template can apply the correct one, for instance:


export KAFKA_URL="https://archive.apache.org/dist/kafka/1.1.0/kafka_2.11-1.1.0.tgz"
export KAFKA_JAVA_VERSION="8"

Also, edit the template and remove ;2181 from the server entries from zookeeper.properties as expressing the client port that way expects Zookeeper 3.5 or newer.

Start Broker Instances






















for i in {1..$KAFKA_CLUSTER_SIZE}; do
  VM_NAME="$PREFIX-kafka-$i"

  echo "Creating VM $VM_NAME..."

  export INSTANCE_ID="$i"
  export PUBLIC_FQDN="$VM_NAME.$DOMAIN"
  envsubst < /tmp/kafka-template.yaml > $VM_NAME.yaml

  az vm create --resource-group $RG_NAME --name $VM_NAME \
    --size $KAFKA_VM_SIZE \
    --image canonical:0001-com-ubuntu-server-focal:20_04-lts:latest \
    --admin-username $USER \
    --ssh-key-values ~/.ssh/id_rsa.pub \
    --vnet-name $VNET_NAME \
    --subnet $VNET_SUBNET_NAME \
    --public-ip-sku Standard \
    --public-ip-address-dns-name $VM_NAME \
    --custom-data $VM_NAME.yaml \
    --tags Owner=$USER \
    --no-wait
done

Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at ~/.ssh/id_rsa.pub, or update the az vm create command.

The above will start all the VMs simultaneously using public IP addresses and FQDNs, to avoid access problems with external Minions and reconfiguration issues with the Kafka advertised listeners. However, like the public IPs, the private IPs will be dynamic. Fortunately, this is not going to be a problem as we're going to use DNS to access Kafka.

Keep in mind that the cloud-init process starts once the VM is running, meaning we should wait a few minutes after the VMs are ready to use.

Then, allow access for remote Minions:






for i in {1..$KAFKA_CLUSTER_SIZE}; do
  VM_NAME="$PREFIX-kafka-$i"

  az vm open-port -g $RG_NAME -n $VM_NAME \
    --port 9094 --priority 100 --output table
done

You can inspect the generated YAML files to see the final content used on each VM (after applying the env-var substitutions).

In case there is a problem, SSH into the VM using the public IP and the provided credentials and check /var/log/cloud-init-output.log to verify the progress and the status of the cloud-init execution.

Validate Zookeeper and Kafka status

To make sure the Zookeeper cluster started, we can use the "4 letter words" commands via the embedded web server, available when using version 3.5 or newer for instance:


curl http://$(hostname):8080/commands/monitor

The above gives us general information, including the server_state, which can be leader or follower.

To get statistics:


curl http://$(hostname):8080/commands/stats

For Zookeeper version 3.4 or older (for instance, when using older versions of Kafka), you can still use the deprecated way to verify:


echo stat | nc $(hostname) 2181; echo

From Kafka's perspective, we can verify how each broker has registered via Zookeeper or follow this guide to create a topic and use the console producer and consumer to validate its functionality.

List Broker IDs:


/opt/kafka/bin/zookeeper-shell.sh $(hostname) ls /brokers/ids

We should get:


[1, 2, 3]

If that's not the case, SSH the broker that is not listed and make sure Kafka is running. It is possible that Kafka is not properly registered to Zookeeper, and it fails to start due to how the VMs are initialized. That's because Zookeeper should start first (the whole cluster), then Kafka, but as we're not guaranteeing that, some instances might fail to start on their own. The procedure was designed to avoid this as much as possible this situation.

Get the broker basic configuration:


/opt/kafka/bin/zookeeper-shell.sh $(hostname) get /brokers/ids/1 | egrep '^\{' | jq

If we run it from the first instance, we should get:
















{
  "features": {},
  "listener_security_protocol_map": {
    "INSIDE": "PLAINTEXT",
    "OUTSIDE": "PLAINTEXT"
  },
  "endpoints": [
    "INSIDE://agalue-kafka-1:9092",
    "OUTSIDE://agalue-kafka-1.eastus.cloudapp.azure.com:9094"
  ],
  "jmx_port": 9999,
  "port": 9092,
  "host": "agalue-kafka-1",
  "version": 5,
  "timestamp": "1616265688431"
}

Note the two listeners. Clients within Azure, like OpenNMS, would use the INSIDE one on port 9092, pointing to the local FQDN or hostname of the VM (and remember they are resolvable via DNS within the same VNet). In contrast, clients outside Azure, like Minions, would use the OUTSIDE one on port 9094 pointing to the Public FQDN of each Kafka instance (accessible thanks to the NSG associated with each VM).

Kafka defaults to the hostname or FQDN of the primary interface when we don't explicitly specify it on the listener.

As Azure DNS works by default, hostnames are resolvable by all VMs within the same VNET. For this reason, Kafka will use the correct one.

However, if you're using another cloud provider or using bare-metal, make sure to either have DNS working across all the VMs. Otherwise, change the INSIDE listener to explicitly point to the private IP address of the VM and the OUTSIDE listener to point to the public IP address of the VM; and make sure to use static IPs if you're going to rely on them.

Verification for newer versions of Kafka

Another way to verify the behavior is using the console producer and console consumer to verify that we can send and receive messages through a given topic.

To do that, for recent versions of Kafka, let's create a Test topic:



/opt/kafka/bin/kafka-topics.sh \
  --bootstrap-server $(hostname):9092 \
  --create --topic Test --replication-factor 2 --partitions 3

Then, start a console producer from one of the brokers:


/opt/kafka/bin/kafka-console-producer.sh \
  --bootstrap-server $(hostname):9092 --topic Test

From another broker (separate SSH session), start a console consumer:


/opt/kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server $(hostname):9092 --topic Test

Go back to the terminal on which the console producer is running, type a message, and hit enter. Then, switch to the console consumer terminal, and we should see the message sent. Use Ctrl+C to stop the producer and consumer.

A more comprehensive test would be to download Kafka locally on your machine and run either the producer or the consumer there (use port 9094 and the public FQDN or IP of one of the brokers). That serves to test connectivity from the Internet.

Verification for older versions of Kafka

To create the Test topic:



/opt/kafka/bin/kafka-topics.sh \
  --zookeeper $(hostname):2181 \
  --create --topic Test --replication-factor 2 --partitions 3

As you can see, the difference is talking against Zookeeper directly (using --zookeeper), instead of reaching Kafka (using --bootstrap-server).

For the producer use --broker-list instead of --bootstrap-server, for instance:


/opt/kafka/bin/kafka-console-producer.sh \
  --broker-list $(hostname):9092 --topic Test

For the client, it is the same as newer versions:


/opt/kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server $(hostname):9092 --topic Test

Topic settings

The retention settings are the default (for instance, log.retention.hours and log.retention.bytes at the broker level; or retention.ms and retention.bytes at the topic level), but it is recommended to reduce them for the RPC topics, as due to the TTL, it doesn't worth keeping them for longer times. That's why 1 hour is more than enough.

Having said that, data pruning happens on closed segments only, meaning Kafka won't delete old records from the active segment (the one currently being updated with new records). That means you should also change the segment.bytes or segment.ms at the topic level to allow deletion. These can be equal to or less than the expected retention. Of course, it is crucial to have the single-topic feature enabled for RPC in both Minion and OpenNMS.

However, we must fix that after the topics are created by either OpenNMS or the Minions, using the Kafka CLI tools or specialized applications like topicctl or CMAK.

For instance, on newer versions of Kafka:






/opt/kafka/bin/kafka-configs.sh --alter \
    --bootstrap-server $(hostname):9092 \
    --entity-type topics  \
    --entity-name OpenNMS.rpc-response \
    --add-config segment.ms=3600000 \
    --add-config retention.ms=3600000 \

For older versions:






/opt/kafka/bin/kafka-configs.sh --alter \
    --zookeeper $(hostname):2181 \
    --entity-type topics  \
    --entity-name OpenNMS.rpc-response \
    --add-config segment.ms=3600000 \
    --add-config retention.ms=3600000

Note that topic level settings and broker level settings are slightly different. The topic level settings override the broker level settings when they exist.

Be careful when setting the number of partitions per topic if you're planning to have a massive number of Minion locations or share the cluster across multiple OpenNMS instances with a high number of locations. This is why having the single-topic enabled in OpenNMS and Minion is the best approach (the default in H28).

Each lead partition (and each replica the broker maintains) will have a directory in the data directory, and Kafka will maintain a file descriptor per segment. Each segment contains two files, the index and the data itself. For more information, check this blog post.

It is recommended to have a dedicated file system for the data directory formatted using XFS with noatime and nodiratime in production.

Create an Azure VM for OpenNMS

Create a cloud-init script with the following content to deploy PostgreSQL, the latest OpenNMS Horizon, and CMAK in Ubuntu LTS and store it at /tmp/opennms-template.yaml:

























































































































#cloud-config
package_upgrade: true

timezone: $TIMEZONE

write_files:
  - owner: root:root
    path: /etc/opennms-overlay/featuresBoot.d/features.boot
    content: |
      opennms-kafka-producer

  # OpenNMS RRD Settings
  - owner: root:root
    path: /etc/opennms-overlay/opennms.properties.d/rrd.properties
    content: |
      org.opennms.rrd.storeByGroup=true
      org.opennms.rrd.storeByForeignSource=true
      org.opennms.rrd.strategyClass=org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy
      org.opennms.rrd.interfaceJar=/usr/share/java/jrrd2.jar
      opennms.library.jrrd2=/usr/lib/jni/libjrrd2.so

  # OpenNMS Sink and RPC API
  - owner: root:root
    path: /etc/opennms-overlay/opennms.properties.d/kafka.properties
    content: |
      # Disable internal ActiveMQ
      org.opennms.activemq.broker.disable=true
      # Sink
      org.opennms.core.ipc.sink.strategy=kafka
      org.opennms.core.ipc.sink.kafka.bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092
      org.opennms.core.ipc.sink.kafka.acks=1
      # RPC
      org.opennms.core.ipc.rpc.strategy=kafka
      org.opennms.core.ipc.rpc.kafka.bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092
      org.opennms.core.ipc.rpc.kafka.ttl=30000
      org.opennms.core.ipc.rpc.kafka.single-topic=true
      org.opennms.core.ipc.rpc.kafka.auto.offset.reset=latest

  # OpenNMS Kafka Producer Client
  - owner: root:root
    path: /etc/opennms-overlay/org.opennms.features.kafka.producer.client.cfg
    content: |
      bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092
      compression.type=zstd
      timeout.ms=30000
      max.request.size=5000000

  # OpenNMS Kafka Producer Settings
  - owner: root:root
    path: /etc/opennms-overlay/org.opennms.features.kafka.producer.cfg
    content: |
      topologyProtocols=bridge,cdp,isis,lldp,ospf
      suppressIncrementalAlarms=true
      forward.metrics=true
      nodeRefreshTimeoutMs=300000
      alarmSyncIntervalMs=300000
      kafkaSendQueueCapacity=1000
      nodeTopic=OpenNMS_nodes
      alarmTopic=OpenNMS_alarms
      eventTopic=OpenNMS_events
      metricTopic=OpenNMS_metrics
      alarmFeedbackTopic=OpenNMS_alarms_feedback
      topologyVertexTopic=OpenNMS_topology_vertices
      topologyEdgeTopic=OpenNMS_edges

  - owner: root:root
    permissions: '0400'
    path: /etc/snmp/snmpd.conf
    content: |
      rocommunity public default
      syslocation Azure - $LOCATION
      syscontact $USER
      dontLogTCPWrappersConnects yes
      disk /

apt:
  preserve_sources_list: true
  sources:
    opennms:
      source: deb https://debian.opennms.org stable main
    docker:
      source: deb https://download.docker.com/linux/ubuntu bionic stable

packages:
  - snmp
  - snmpd
  - jq
  - jrrd2
  - opennms
  - opennms-webapp-hawtio
  - opennms-helm
  - docker-ce
  - docker-ce-cli
  - containerd.io

bootcmd:
  - curl -s https://debian.opennms.org/OPENNMS-GPG-KEY | apt-key add -
  - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -

runcmd:
  # Configure PostgreSQL
  - systemctl --now enable postgresql
  - sudo -u postgres createuser opennms
  - sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'postgres';"
  - sudo -u postgres psql -c "ALTER USER opennms WITH PASSWORD 'opennms';"
  - sed -r -i 's/password=""/password="postgres"/' /etc/opennms/opennms-datasources.xml
  # Configure OpenNMS
  - sed -r -i '/enabled="false"/{$!{N;s/ enabled="false"[>]\n(.*OpenNMS:Name=Syslogd.*)/>\n\1/}}' /etc/opennms/service-configuration.xml
  - echo "JAVA_HEAP_SIZE=$ONMS_HEAP_SIZE" > /etc/opennms/opennms.conf
  - rsync -avr /etc/opennms-overlay/ /etc/opennms/
  - /usr/share/opennms/bin/runjava -s
  - /usr/share/opennms/bin/fix-permissions
  - /usr/share/opennms/bin/install -dis
  - systemctl --now enable opennms
  # Start CMAK using Docker
  - usermod -aG docker ubuntu
  - docker run --name cmak -d -e ZK_HOSTS="$PREFIX-kafka-1:2181" -e APPLICATION_SECRET="opennms" -p 9000:9000 hlebalbau/kafka-manager:stable
  # Upgrade Grafana
  - sudo apt-get install -y adduser libfontconfig1
  - wget https://dl.grafana.com/oss/release/grafana_7.5.11_amd64.deb
  - sudo dpkg -i grafana_7.5.11_amd64.deb

We don't need to specify Kafka Brokers' whole list as part of the bootstrap.servers entry. The whole topology will be discovered through the first one that responds, and the client will use what's configured as the advertised listener to talk to each broker. I added two in case the first one is unavailable (as a backup).

If you're using an older version of Kafka, make sure to set the appropriate version when adding your cluster to CMAK.

The above installs the latest OpenJDK 11, the latest PostgreSQL, and the latest OpenNMS Horizon to the VM. It also install Kafka Manager or CMAK via Docker. I added the most basic configuration for PostgreSQL to work with authentication. Kafka will be enabled for Sink/RPC as well as the Kafka Producer. As mentioned, Azure VMs can reach each other through hostnames.

Create an Ubuntu VM for OpenNMS:























envsubst < /tmp/opennms-template.yaml > /tmp/opennms.yaml

az vm create --resource-group $RG_NAME --name $ONMS_VM_NAME \
  --size $ONMS_VM_SIZE \
  --image canonical:0001-com-ubuntu-server-focal:20_04-lts:latest \
  --admin-username $USER \
  --ssh-key-values ~/.ssh/id_rsa.pub \
  --vnet-name $VNET_NAME \
  --subnet $VNET_SUBNET_NAME \
  --public-ip-address-dns-name $ONMS_VM_NAME \
  --public-ip-sku Standard \
  --custom-data /tmp/opennms.yaml \
  --tags Owner=$USER \
  --output table

az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \
  --port 8980 --priority 200 --output table

az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \
  --port 3000 --priority 200 --output table

az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \
  --port 9000 --priority 300 --output table

Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at ~/.ssh/id_rsa.pub, or update the az vm create command.

Keep in mind that the cloud-init process starts once the VM is running, meaning we should wait about 5 minutes after the az vm create is finished to see OpenNMS up and running.

Monitor the infrastructure

Wait until OpenNMS is up and running and then execute the following, to start monitoring all the ZK/Kafka servers, and the OpenNMS server via SNMP and JMX.









































ONMS_FQDN="$ONMS_VM_NAME.$DOMAIN"

cat <<EOF >/tmp/OpenNMS.xml
<?xml version="1.0"?>
<model-import date-stamp="$(date +"%Y-%m-%dT%T.000Z")" foreign-source="OpenNMS">
EOF

for vm in $(az vm list -g $RG_NAME --query "[?contains(name,'$PREFIX-')].name" -o tsv); do
  ipaddr=$(az vm show -g $RG_NAME -n $vm -d --query privateIps -o tsv)
  cat <<EOF >>/tmp/OpenNMS.xml
  <node foreign-id="$vm" node-label="$vm">
EOF
  if [[ "$vm" == *"kafka"* ]]; then
    cat <<EOF >>/tmp/OpenNMS.xml
    <interface ip-addr="$ipaddr" status="1" snmp-primary="P">
      <monitored-service service-name="JMX-Kafka"/>
    </interface>
  </node>
EOF
  fi
  if [[ "$vm" == *"onms"* ]]; then
    cat <<EOF >>/tmp/OpenNMS.xml
    <interface ip-addr="$ipaddr" status="1" snmp-primary="P"/>
    <interface ip-addr="127.0.0.1" status="1" snmp-primary="N">
      <monitored-service service-name="OpenNMS-JVM"/>
    </interface>
  </node>
EOF
  fi
done

cat <<EOF >>/tmp/OpenNMS.xml
</model-import>
EOF

curl -v -u admin:admin \
  -H 'Content-Type: application/xml' -d @/tmp/OpenNMS.xml \
  http://$ONMS_FQDN:8980/opennms/rest/requisitions

curl -v -u admin:admin -X PUT \
  http://$ONMS_FQDN:8980/opennms/rest/requisitions/OpenNMS/import

Create Minion VMs using `multipass`

After verifying that OpenNMS is up and running, we can proceed to create the Minions.

Create a cloud-init script to deploy Minion in Ubuntu and save it at /tmp/minion-template.yaml:




















































#cloud-config
package_upgrade: true

timezone: $TIMEZONE

write_files:
  - owner: root:root
    path: /etc/minion-overlay/org.opennms.minion.controller.cfg
    content: |
      location=$MINION_LOCATION
      id=$MINION_ID
      http-url=http://$ONMS_VM_NAME.$DOMAIN:8980/opennms

  - owner: root:root
    path: /etc/minion-overlay/featuresBoot.d/kafka.boot
    content: |
      !minion-jms
      !opennms-core-ipc-sink-camel
      !opennms-core-ipc-rpc-jms
      opennms-core-ipc-sink-kafka
      opennms-core-ipc-rpc-kafka

  - owner: root:root
    path: /etc/minion-overlay/org.opennms.core.ipc.sink.kafka.cfg
    content: |
      bootstrap.servers=$PREFIX-kafka-1.$DOMAIN:9094,$PREFIX-kafka-2.$DOMAIN:9094

  - owner: root:root
    path: /etc/minion-overlay/org.opennms.core.ipc.rpc.kafka.cfg
    content: |
      bootstrap.servers=$PREFIX-kafka-1.$DOMAIN:9094,$PREFIX-kafka-2.$DOMAIN:9094
      single-topic=true

apt:
  preserve_sources_list: true
  sources:
    opennms:
      source: deb https://debian.opennms.org stable main

packages:
  - opennms-minion

bootcmd:
  - curl -s https://debian.opennms.org/OPENNMS-GPG-KEY | apt-key add -

runcmd:
  - rsync -avr /etc/minion-overlay/ /etc/minion/
  - sed -i -r 's/# export JAVA_MIN_MEM=.*/export JAVA_MIN_MEM="$MINION_HEAP_SIZE"/' /etc/default/minion
  - sed -i -r 's/# export JAVA_MAX_MEM=.*/export JAVA_MAX_MEM="$MINION_HEAP_SIZE"/' /etc/default/minion
  - /usr/share/minion/bin/scvcli set opennms.http admin admin
  - /usr/share/minion/bin/scvcli set opennms.broker admin admin
  - systemctl --now enable minion

Note that I'm using the same content for bootstrap.servers as OpenNMS, making sure to use the Public FQDNs, as Minions won't be running in Azure.

Then, start the new Minion via multipass:





export MINION_ID=minion01

envsubst < /tmp/minion-template.yaml > /tmp/$MINION_ID.yaml

multipass launch -c 1 -m 2G -n $MINION_ID --cloud-init /tmp/$MINION_ID.yaml

Optionally, create a second Minion in the same location:





export MINION_ID=minion02

envsubst < /tmp/minion-template.yaml > /tmp/$MINION_ID.yaml

multipass launch -c 1 -m 2G -n $MINION_ID --cloud-init /tmp/$MINION_ID.yaml

In case there is a problem, access the VM (e.x., multipass shell minion01) and check /var/log/cloud-init-output.log to verify the progress and the status of the cloud-init execution.

Feel free to change the CPU and memory settings for your Minion, but make sure it is consistent with MINION_HEAP_SIZE. Make sure to validate communication using the health-check command from the Karaf Shell.

When having multiple Minions per location, they will become part of a consumer group from Kafka's perspective for the RPC requests topic. The group ID will be the name of the location.

Test

As you can see, the location name is Durham (a.k.a. $MINION_LOCATION), and you should see the Minions on that location registered in OpenNMS.

SSH into the OpenNMS server and create a requisition with a node in the same network as the Minion VMs, and make sure to associate it with the appropriate location. For instance,






/usr/share/opennms/bin/provision.pl requisition add Test
/usr/share/opennms/bin/provision.pl node add Test srv01 srv01
/usr/share/opennms/bin/provision.pl node set Test srv01 location Durham
/usr/share/opennms/bin/provision.pl interface add Test srv01 192.168.0.40
/usr/share/opennms/bin/provision.pl interface set Test srv01 192.168.0.40 snmp-primary P
/usr/share/opennms/bin/provision.pl requisition import Test

Ensure to replace 192.168.0.40 with the IP of a working server in your network (reachable from the Minion VM, and preferable unreachable or nonexistent in Azure), and do not forget to use the same location as defined in $MINION_LOCATION.

Please keep in mind that Minions are VMs on your machine. 192.168.0.40 is the IP of one of my machines which is why Minions can reach it (and vice versa). To access an external machine on your network, make sure to define static routes on that machine so it can reach the Minions through your machine (assuming you're running Linux or macOS).

OpenNMS which runs in Azure, and have no access to 192.168.0.40 directly, should be able to collect data and monitor that node through any of the Minions. In fact, you can stop one of them, and OpenNMS would continue monitoring it.

To test asynchronous messages, you can send SNMP traps or Syslog messages to one of the Minions. Alternatively, you could use udpgen for this purpose. Usually, you could put a Load Balancer in front of the Minions and use its IP when sending messages from the monitored devices.

The machine that will be running udpgen must be part of the OpenNMS inventory. Then, find the IP of the Minion using multipass list, then execute the following from the machine added as a node above (the examples assumes the IP of the Minion is 192.168.75.16):

To send SNMP Traps:


udpgen -h 192.168.75.16 -x snmp -r 1 -p 1162

To send Syslog Messages:


udpgen -h 192.168.75.16 -x syslog -r 1 -p 1514

The C++ version of udpgen only works on Linux. If you're on macOS, you can use the Go version of it. Unfortunately, Windows is not an option due to a lack of support for Syslog in Go.

Note that an event definition is required when using udpgen to send traps. Here is what you'd need for Eventd:























<events xmlns="http://xmlns.opennms.org/xsd/eventconf">
   <event>
      <mask>
         <maskelement>
            <mename>id</mename>
            <mevalue>.1.3.6.1.1.6.3.1.1.5</mevalue>
         </maskelement>
         <maskelement>
            <mename>generic</mename>
            <mevalue>6</mevalue>
         </maskelement>
         <maskelement>
            <mename>specific</mename>
            <mevalue>1</mevalue>
         </maskelement>
      </mask>
      <uei>uei.opennms.org/udpgen/testTrap</uei>
      <event-label>udpgen test trap</event-label>
      <descr>Sample Event %parm[all]%</descr>
      <logmsg dest="logndisplay">Sample Event %parm[all]%</logmsg>
      <severity>Warning</severity>
   </event>
</events>

If you want to make the tests more interesting, add the following to the above definition:


<alarm-data reduction-key="%uei%:%dpname%:%nodeid%"
 alarm-type="3" auto-clean="false"/>

The Hawtio UI in OpenNMS can help visualize the relevant JMX metrics and understand what’s circulating between OpenNMS and the Minions.

For OpenNMS, Hawtio is available through :8980/hawtio if the package opennms-webapp-hawtio was installed (which is the case with the cloud-init template used).

For Minions, Hawtio is available through :8181/hawtio.

Troubleshooting

As mentioned, if time is not synchronized across all the instances, the Heartbeat sent by Minions via the Sink API won't be processed properly by OpenNMS, leading to having the Minion not registered or outages in the Minion-Heartbeat service.

We can inspect the traffic on the topics to see if the Minion is sending (or receiving) traffic to Kafka. However, as the payload is encoded within a Protobuf message, using the console consumer might not be as useful as we'd expect. Still, it works for troubleshooting purposes. For instance, from one of the Kafka brokers, we can do:



/opt/kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server $(hostname):9092 \
  --topic OpenNMS.Sink.Heartbeat

And we'll see:

$bce7b13e-d575-40b9-989a-3b5c6e7432c2 ~<minion>
   <id>minion01</id>
   <location>Durham</location>
   <timestamp>2021-03-26T12:19:55.752-07:00</timestamp>
</minion>

As we can see, the actual payload within the Protobuf message is an indented XML.

The following application can be used to properly inspect the content without worrying about the non-readable content due to the Protobuf format:

https://github.com/agalue/onms-kafka-ipc-receiver

For RPC in particular, we can access the Karaf Shell from the OpenNMS instance and use the opennms:stress-rpc command to verify communication against the Minions on a given location or against a specific Minion, and as the command name implies, to perform stress tests.

Useful Kafka Commands

For recent versions of Kafka, the following can help to get details about topics, lags, consumer groups and so on.

To verify the topic partitions and replica settings:







topics=$(/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server $(hostname):9092)

for topic in $topics; do
  /opt/kafka/bin/kafka-topics.sh \
    --bootstrap-server $(hostname):9092 \
    --describe --topic $topic 
done

To verify the current topic-level settings:







topics=$(/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server $(hostname):9092)

for topic in $topics; do
  /opt/kafka/bin/kafka-configs.sh \
    --bootstrap-server $(hostname):9092 \
    --describe --entity-type topics --entity-name $topic --all
done

To verify offsets, topics lag and consumer groups:



/opt/kafka/bin/kafka-consumer-groups.sh \
  --bootstrap-server $(hostname):9092 \
  --describe --all-groups --all-topics

When enabling security (either SASL or TLS), you need to pass those settings to the commands.

For instance, let's say you have SASL enabled, you should pass:


--command-config /opt/kafka/config/consumer.properties

Where the content of consumer.properties would be:



security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="opennms" password="0p3nNM5";

For older versions of Kafka, the equivalent commands are the following:

To verify the topic partitions and replica settings:







topics=$(/opt/kafka/bin/kafka-topics.sh --list --zookeeper $(hostname):2181)

for topic in $topics; do
  /opt/kafka/bin/kafka-topics.sh \
    --zookeeper $(hostname):2181 \
    --describe --topic $topic 
done

To verify the current topic-level settings:







topics=$(/opt/kafka/bin/kafka-topics.sh --list --zookeeper $(hostname):2181)

for topic in $topics; do
  /opt/kafka/bin/kafka-configs.sh \
    --zookeeper $(hostname):2181 \
    --describe --entity-type topics --entity-name $topic
done

To verify offsets, topics lag and consumer groups:







groups=$(/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $(hostname):9092 --list)

for group in $groups; do
  /opt/kafka/bin/kafka-consumer-groups.sh \
    --bootstrap-server $(hostname):9092 \
    --describe --all-topics --group $group
done

When passing the ZK host to --zookeeper, that has to be consistent with how zookeeper.connect was defined on each Kafka broker. If you used something like this zk1:2181,zk2:2181/kafka, you should then pass --zookeeper $(hostname):2181/kafka instead.

In big environments, it is common to have multiple OpenNMS instances, each of them with its own fleet of Minions to monitor one of the multiple data centers or a section of it. In those scenarios, it is common to have a centralized Kafka cluster that can be shared across all of them (for more information, follow this link).

The above solution has to be modified to ensure each set of OpenNMS and Minions will use their own set of topics in Kafka to avoid collisions.

The topics' prefix (which defaults to OpenNMS) can be controlled via a system-wide property called Instance ID (a.k.a. org.opennms.instance.id). We must configure this property in both places. For the OpenNMS, add it to a property file inside $OPENNMS_HOME/etc/opennms.properties.d; and for a Minion, add it to a property file inside $MINION_HOME/etc/custom.system.properties.

Add a Load Balancer in front of the Minions (Optional)

In production, when having multiple Minions per location, it is a good practice to put a Load Balancer in front of them so that the devices can use a single destination for SNMP Traps, Syslog, and Flows.

The following creates a cloud-init template for Ubuntu to start a basic LB using nginx through multipass for SNMP Traps (with a listener on port 162) and Syslog Messages (with a listener on port 514). Save the template at /tmp/nginx-template.yaml:








































#cloud-config
package_upgrade: true

packages:
  - nginx

write_files:
  - owner: root:root
    path: /etc/nginx/nginx.conf
    content: |
      user www-data;
      worker_processes auto;
      pid /run/nginx.pid;
      include /etc/nginx/modules-enabled/*.conf;
      events {
        worker_connections 768;
      }
      stream {
        upstream syslog_udp  {
          server $MINION_IP1:1514;
          server $MINION_IP2:1514;
        }
        upstream trap_udp  {
          server $MINION_IP1:1162;
          server $MINION_IP2:1162;
        }
        server {
          listen 514 udp;
          proxy_pass syslog_udp;
          proxy_responses 0;
        }
        server {
          listen 162 udp;
          proxy_pass trap_udp;
          proxy_responses 0;
        }
      }

runcmd:
  - systemctl restart nginx

Note the usage of environment variables within the YAML template. We will substitute them before creating the VM.

Then, update the template and create the LB:







export MINION_IP1=$(multipass info $MINION_ID1 | grep IPv4 | awk '{print $2}')
export MINION_IP2=$(multipass info $MINION_ID2 | grep IPv4 | awk '{print $2}')

envsubst < /tmp/nginx-template.yaml > /tmp/nginx.yaml

multipass launch -n nginx --cloud-init /tmp/nginx.yaml
echo "Load Balancer $(multipass info nginx | grep IPv4)"

Flows are outside the scope of this test as that requires more configuration on Minions and OpenNMS besides having an Elasticsearch cluster up and running with the required plugin in place.

Securing Zookeeper and Kafka

The above procedure uses Kafka and Zookeeper in plain text without authentication or encryption. That works for testing purposes or perhaps for private clusters, where access to the servers is restricted and audited.

This example, in particular, exposes Kafka to the Internet, which requires having at least authentication in place. The following explains how to enable authentication and then the steps to enable encryption.

For a more comprehensive guide, follow this tutorial from Confluent.

Authentication

This section explains how to enable authentication using SASL with SCRAM-SHA-512 for Kafka and DIGEST for Zookeeper (as Zookeeper doesn't support SCRAM). Because this guide's intention is learning, I decided to add security as a separate or optional module. That's due to the extra complexity associated with this advanced topic.

Here are the high-level changes:

Create the SCRAM credentials for Kafka through one of the brokers. The credentials are stored in Zookeeper.
Update server.properties and the systemd service definition on each Kafka broker to enable and use SASL.
Update zookeeper.properties and the systemd service definition on each ZK instance to enable and use SASL.
Stop Kafka Cluster, restart Zookeeper cluster, start Kafka Cluster.
Update OpenNMS to use SASL for the Sink API, the RPC API, and the Kafka Producer and restart.
Update Minion to use SASL for the Sink API and the RPC API and restart.

Access one of the brokers and execute the following command:








ONMS_USER="opennms"    # To be used by Kafka, OpenNMS and Minions
ONMS_PASSWD="0p3nNM5;" # To be used by Kafka, OpenNMS and Minions

/opt/kafka/bin/kafka-configs.sh --bootstrap-server $(hostname):9092 \
  --alter \
  --add-config "SCRAM-SHA-256=[password=$ONMS_PASSWD],SCRAM-SHA-512=[password=$ONMS_PASSWD]" \
  --entity-type users \
  --entity-name $ONMS_USER

On each Zookeeper instance, update zookeeper.properties to enable SASL:




cat <<EOF | sudo tee -a /opt/kafka/config/zookeeper.properties
authProvider.sasl=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=sasl
EOF

On each Kafka broker instance, update server.properties to enable SASL/SCRAM:








sudo sed -i -r '/listener.security.protocol.map/d' /opt/kafka/config/server.properties

cat <<EOF | sudo tee -a /opt/kafka/config/server.properties
# Enable Security
listener.security.protocol.map=INSIDE:SASL_PLAINTEXT,OUTSIDE:SASL_PLAINTEXT
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
sasl.enabled.mechanisms=SCRAM-SHA-256,SCRAM-SHA-512
EOF

Note that listener.security.protocol.map already exists in that file, which is why I removed it prior adding the required changes.

In theory, there is no need to enable both SCRAM-SHA-256 and SCRAM-SHA-512. I did that for compatibility purposes, but I'll use SCRAM-SHA-512 for all subsequent configurations.

On each Zookeeper instance, create the JAAS configuration file with the credentials:











ZK_USER="zkonms"
ZK_PASSWD="zk0p3nNM5;"

cat <<EOF | sudo tee /opt/kafka/config/zookeeper_jaas.conf
Server {
  org.apache.zookeeper.server.auth.DigestLoginModule required
  user_$ZK_USER="$ZK_PASSWD";
};
EOF
sudo chown kafka:kafka /opt/kafka/config/zookeeper_jaas.conf
sudo chmod 0600 /opt/kafka/config/zookeeper_jaas.conf

On each Kafka broker, create the JAAS configuration file with the credentials:





















ZK_USER="zkonms"       # Must match zookeeper_jaas.conf
ZK_PASSWD="zk0p3nNM5;" # Must match zookeeper_jaas.conf

ONMS_USER="opennms"    # Must match scram user
ONMS_PASSWD="0p3nNM5;" # Must match scram user

cat <<EOF | sudo tee /opt/kafka/config/kafka_jaas.conf
KafkaServer {
  org.apache.kafka.common.security.scram.ScramLoginModule required
  username="$ONMS_USER"
  password="$ONMS_PASSWD";
};

Client {
  org.apache.zookeeper.server.auth.DigestLoginModule required
  username="$ZK_USER"
  password="$ZK_PASSWD";
};
EOF
sudo chown kafka:kafka /opt/kafka/config/kafka_jaas.conf
sudo chmod 0600 /opt/kafka/config/kafka_jaas.conf

On each Zookeeper instance, update the systemd service definition to load the JAAS settings via KAFKA_OPTS:



OPTS='Environment="KAFKA_OPTS=-Djava.security.auth.login.config=/opt/kafka/config/zookeeper_jaas.conf"'
sudo sed -i -r -e "/^ExecStart=.*/i $OPTS" /etc/systemd/system/zookeeper.service
sudo systemctl daemon-reload

On each Kafka broker, update the systemd service definition to load the JAAS settings via KAFKA_OPTS:



OPTS='Environment="KAFKA_OPTS=-Djava.security.auth.login.config=/opt/kafka/config/kafka_jaas.conf"'
sudo sed -i -r -e "/^ExecStart=.*/i $OPTS" /etc/systemd/system/kafka.service
sudo systemctl daemon-reload

Restart the cluster in the following order:

Stop Kafka on each server.
Restart Zookeeper on each server.
Start Kafka on each server.

At this point, you should pass the SASL credentials to all Kafka CLI Tools. For instance,













ONMS_USER="opennms"    # Must match scram user
ONMS_PASSWD="0p3nNM5;" # Must match scram user

cat <<EOF | sudo tee -a /opt/kafka/config/consumer.properties
# Security
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD";
EOF

/opt/kafka/bin/kafka-topics.sh --list \
  --bootstrap-server $(hostname):9092 \
  --command-config /opt/kafka/config/consumer.properties

Note how we pass the consumer settings. The above command should list all the topics in the cluster. If you can see the list, then SASL is working. Keep in mind that not passing --command-config, the command should timeout, as the tool cannot communicate to Kafka without the credentials.

On the OpenNMS instance, update /opt/opennms/etc/opennms.properties.d/kafka.properties and /opt/opennms/etc/org.opennms.features.kafka.producer.cfg to use SASL, and restart OpenNMS. For instance:




















ONMS_USER="opennms"    # Must match scram user
ONMS_PASSWD="0p3nNM5;" # Must match scram user

for module in sink rpc; do
  cat <<EOF | sudo tee -a /etc/opennms/opennms.properties.d/kafka.properties
# Security for $module
org.opennms.core.ipc.$module.kafka.security.protocol=SASL_PLAINTEXT
org.opennms.core.ipc.$module.kafka.sasl.mechanism=SCRAM-SHA-512
org.opennms.core.ipc.$module.kafka.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD";
EOF
done

cat <<EOF | sudo tee -a /etc/opennms/org.opennms.features.kafka.producer.client.cfg
# Security
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD";
EOF

sudo systemctl restart opennms

On each Minion, update /etc/minion/org.opennms.core.ipc.sink.kafka.cfg and /etc/minion/org.opennms.core.rpc.sink.kafka.cfg to use SASL, and restart Minion. For instance:













ONMS_USER="opennms"    # Must match scram user
ONMS_PASSWD="0p3nNM5;" # Must match scram user

for module in sink rpc; do
  cat <<EOF | sudo tee -a /etc/minion/org.opennms.core.ipc.$module.kafka.cfg
# Security
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD";
EOF
done

sudo systemctl restart minion

The solution works from OpenNMS and Minion perspective, despite the fact of seeing the following message repeatedly in /opt/kafka/logs/server.log on all brokers:

[2021-04-11 12:35:56,486] INFO [SocketServer brokerId=2] Failed authentication with /13.0.1.7 (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector)

Where 13.0.1.7 is the IP of the OpenNMS server.

At this point, we have SASL authentication enabled using SCRAM-512 for Kafka and DIGEST for Zookeeper, meaning credentials might be hard to crack when intercepting traffic (but perhaps not impossible). However, to make it more secure, encryption is recommended.

If you already configured CMAK, make sure to enable the SASL/SCRAM mechanism for your cluster.

Encryption

Please keep in mind that enabling SSL/TLS will increase CPU demand on each broker and the clients, which is why using OpenJDK 11 over JDK 8 is encouraged.

To enable TLS, and because each Kafka Broker must be exposed and reachable through a public DNS entry, I'm going to use LetsEncrypt to generate the certificates. That will save a few steps because the certificates will be publicly valid, so we won't need to set up a Trust Store.

A Trust Store is mandatory when using private CAs or self-signed certificates to configure every entity that touches Kafka directly or indirectly.

The Certbot utility used to create and validate the certificate will start a temporary web server on the instance (for the validation process). For this reason, we should temporary allow access through port TCP 80:






for i in {1..$KAFKA_CLUSTER_SIZE}; do
  VM_NAME="$PREFIX-kafka-$i"

  az vm open-port -g $RG_NAME -n $VM_NAME \
    --port 80 --priority 101 --output table
done

Then, on each Kafka Broker (one by one), we must do the following to enable TLS:






































FQDN="$(hostname).eastus.cloudapp.azure.com"
EMAIL="owner@example.com"
PASSWD="0p3nNM5"

sudo snap install --classic certbot
sudo ln -s /snap/bin/certbot /usr/bin/certbot

sudo certbot certonly --standalone -d $FQDN -m $EMAIL \
  --non-interactive --agree-tos

TEMP_P12="/tmp/ssl.p12.$(date +%s)"
TEMP_KEYSTORE="/tmp/ssl.keystore.$(date +%s)"
TARGET_KEYSTORE="/opt/kafka/config/letsencrypt.jks"

sudo openssl pkcs12 -export \
  -in /etc/letsencrypt/live/$FQDN/fullchain.pem \
  -inkey /etc/letsencrypt/live/$FQDN/privkey.pem \
  -out $TEMP_P12 -name kafka -password "pass:$PASSWD"

sudo keytool -importkeystore -alias kafka \
  -deststorepass "$PASSWD" -destkeypass "$PASSWD" -destkeystore $TEMP_KEYSTORE \
  -srckeystore $TEMP_P12 -srcstoretype PKCS12 -srcstorepass "$PASSWD"

sudo cp $TEMP_KEYSTORE $TARGET_KEYSTORE
sudo chmod 440 $TARGET_KEYSTORE
sudo chown kafka:kafka $TARGET_KEYSTORE
sudo rm -f $TEMP_P12 $TEMP_KEYSTORE

CONFIG="/opt/kafka/config/server.properties"
sudo sed -i -r '/listener.security.protocol.map/d' $CONFIG
cat <<EOF | sudo tee -a $CONFIG
listener.security.protocol.map=INSIDE:SASL_PLAINTEXT,OUTSIDE:SASL_SSL
ssl.keystore.location=$TARGET_KEYSTORE
ssl.keystore.password=$PASSWD
ssl.key.password=$PASSWD
EOF

sudo systemctl restart kafka

Please use your own email, and keep in mind that the Azure location is hardcoded in the command; if you're using a different one, update the FQDN.

Note that SSL was only enabled for the OUTSIDE listener, meaning we should only modify the Minions (and listener.security.protocol.map was changed because of that), as OpenNMS won't use it because it lives in the same protected network as the Kafka cluster.

To verify, you can retrieve the broker configuration via Zookeeper:


/opt/kafka/bin/zookeeper-shell.sh $(hostname) get /brokers/ids/1 | egrep '^\{' | jq

If everything went well, you should get something like this:
















{
  "features": {},
  "listener_security_protocol_map": {
    "INSIDE": "SASL_PLAINTEXT",
    "OUTSIDE": "SASL_SSL"
  },
  "endpoints": [
    "INSIDE://agalue-kafka-1:9092",
    "OUTSIDE://agalue-kafka-1.eastus.cloudapp.azure.com:9094"
  ],
  "jmx_port": 9999,
  "port": -1,
  "host": null,
  "version": 5,
  "timestamp": "1622658498210"
}

Note that SASL_SSL applies to OUTSIDE. Now it is time to update the Minions.

On each Minion, do the following:






for module in sink rpc; do
  cfg="/etc/minion/org.opennms.core.ipc.$module.kafka.cfg"
  sudo sed -i -r '/security.protocol/s/SASL_PLAINTEXT/SASL_SSL/' $cfg
done

sudo systemctl restart minion

While you're there, you can check if TLS is actually enabled by running:


openssl s_client -connect agalue-kafka-1.eastus.cloudapp.azure.com:9094

There is no need to modify anything else as we're using valid certificates signed by a well-known public entity. When using private certificates or private CAs, you would have to create Trust Store via keytool for the clients and the brokers.

As an challenge to the reader, update the /tmp/kafka-template.yaml, /tmp/opennms-template.yaml, and /tmp/minion-template.yaml to include all the SASL and SSL/TLS configuration and start the whole environment from scratch with authentication and encryption enabled.

Securing OpenNMS

The following is inspired by this guide to enable TLS with Nginx for the OpenNMS WebUI and Grafana. However, as we're using Ubuntu here, I'll describe the required changes.

Allow access via TCP 80 and 443:


az vm open-port -g $RG_NAME -n $ONMS_VM_NAME --port 443 --priority 110 -o table
az vm open-port -g $RG_NAME -n $ONMS_VM_NAME --port 80 --priority 120 -o table

SSH the OpenNMS server and then:























































export EMAIL="user@example.com"
export LOCATION=$(curl -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2021-02-01" 2>/dev/null | jq -r '.compute.location')
export FQDN=$(hostname).$LOCATION.cloudapp.azure.com

sudo apt install -y nginx
sudo mkdir -p /var/www/$FQDN/.well-known
sudo chown nginx:nginx /var/www/$FQDN

cfg="/etc/nginx/sites-available/default"
cat <<EOF | sudo tee $cfg
server {
  listen 80;
  server_name $FQDN;
  # maintain the .well-known directory alias for lets encrypt renewals
  location /.well-known {
    alias /var/www/$FQDN/.well-known;
  }
  location /hawtio/ {
    proxy_pass http://localhost:8980/hawtio/;
  }
  location /grafana/ {
    proxy_pass http://localhost:3000/;
  }
  location /opennms/ {
    proxy_set_header    Host \$host;
    proxy_set_header    X-Real-IP \$remote_addr;
    proxy_set_header    X-Forwarded-For \$proxy_add_x_forwarded_for;
    proxy_set_header    X-Forwarded-Proto \$scheme;
    proxy_set_header    Upgrade \$http_upgrade;
    proxy_set_header    Connection "Upgrade";
    proxy_pass          http://localhost:8980/opennms/;
    proxy_redirect      default;
    proxy_read_timeout  90;
  }
}
EOF

sudo systemctl restart nginx
sudo systemctl enable nginx

sudo snap install core
sudo snap refresh core
sudo snap install --classic certbot
sudo ln -s /snap/bin/certbot /usr/bin/certbot
sudo certbot --nginx -d $FQDN --non-interactive --agree-tos -m $EMAIL

cat <<EOF | sudo tee /etc/opennms/opennms.properties.d/webui.properties
org.opennms.netmgt.jetty.host = 127.0.0.1
opennms.web.base-url = https://%x%c/
EOF
sudo systemctl restart opennms

sudo sed -i -r "s|^;domain =.*|domain = $FQDN|" /etc/grafana/grafana.ini
sudo sed -i -r "s|^;root_url =.*|root_url = %(protocol)s://%(domain)s:%(http_port)s/grafana/|" /etc/grafana/grafana.ini
sudo systemctl restart grafana-server

Make sure to use a valid content for $EMAIL, as that's required by LetsEncrypt (as we did for Kafka).

Note that cmak (or Kafka Manager) is not present due to the complexity of having it working behind a proxy.

You can remove the NSG rules for ports 8980 and 3000.




az network nsg rule delete -g $RG_NAME \
  --nsg-name ${ONMS_VM_NAME}NSG -n open-port-8980
az network nsg rule delete -g $RG_NAME \
  --nsg-name ${ONMS_VM_NAME}NSG -n open-port-3000

Tracing

Work in progress…

Some circumstances could introduce unexpected behavior to the solution. Besides the traditional monitoring to ensure that all the components are behaving as expected in CPU, Memory, Java Heap Memory, Java GC, and IO (covered as part of this tutorial), you sometimes need to dig deeper to understand what's happening.

OpenNMS added OpenTracing support via Jaeger to understand how much time messages sent via the broker are taking to be produced and consumed.

The official documentation has a guide about how to configure it.

As we have Docker running in the OpenNMS server, we can start an All-In-One Jaeger Instance through it very easily. To do that, SSH into the OpenNMS server and run the following:





docker run -d --name jaeger \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 16686:16686 \
  jaegertracing/all-in-one:1.24

OpenNMS would have direct access as it runs on the same machine accessible via localhost and should be configured as instructed in the official docs.

For the Minions, you would need to open the UDP ports 6831 and 6832 in the NSG associated with the OpenNMS server, as well as TCP 16686 to access the Jaeger WebUI:


az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \
  --port 6831-6832,16686 --priority 400 --output table

Then, configure the minion as instructed in the official docs, using the OpenNMS FQDN and the port mentioned above.

Clean Up

When we're done, make sure to delete the cloud resources.

If you created the resource group for this exercise, you could remove all the resources with the following command:


az group delete -g $RG_NAME

If you're using an existing resource group that you cannot remove, make sure only to remove all the resources created in this tutorial. All of them should be easily identified as they will contain the username and the VM name as part of the resource name. The easiest way is to use the Azure Portal for this operation. Alternatively,



















IDS=($(az resource list \
  --resource-group $RG_NAME \
  --query "[?contains(name,'$PREFIX-') && type!='Microsoft.Compute/disks']".id \
  --output tsv | tr '\n' ' '))

for id in "${IDS[@]}"; do
  echo "Removing $id"
  az resource delete --ids "$id" --verbose
done

DISKS=($(az resource list \
  --resource-group $RG_NAME \
  --query "[?contains(name,'$PREFIX-') && type=='Microsoft.Compute/disks']".id \
  --output tsv | tr '\n' ' '))

for id in "${DISKS[@]}"; do
  echo "Removing $id"
  az resource delete --ids "$id" --verbose
done

The reason to have two sets of deletion groups is that, by default, the list contains disks initially, which cannot be removed before the VMs. For this reason, we exclude the disks on the first set, and then we remove the disks.

Note that because all the resource names are prefixed with the chosen username, we can use it to identify them and remove them uniquely.

Then clean the local resources:


multipass delete $MINION_ID1 $MINION_ID2
multipass purge

Remember to remove the nginx instance if you decided to use it.