# Simple OpenNMS/Minion Environment using Kafka in Azure This lab starts an OpenNMS instance and a 3 node ZK/Kafka cluster in the cloud and two Minions on your machine, using Kafka for communication through Multipass and Azure, for learning purposes. :::warning The lab doesn't cover security by default (user authentication and encryption), which is crucial if we ever want to expose the Kafka cluster to the Internet. A separate section covers the required changes for this. ::: ![](https://i.imgur.com/XHww8ga.jpg) :::success Keep in mind that nothing prevents us from skipping using the cloud provider and do everything with `Multipass` (or `VirtualBox`, or `Hyper-V`, or `VMWare`). The reason for using a cloud provider is to prove that OpenNMS can monitor unreachable devices via Minion. Similarly, we could use any other cloud provider instead of Azure. However I won't explain how to port the solution here. ::: :::warning Time synchronization across all the instances involved in this solution is mandatory. Failing on this could lead to undesired side effects. This is essentially guaranteed when using a cloud provider, which is why I do not include explicit instructions for it, but please be aware of it. ::: ## Requirements * Have an [Azure Subscription](https://azure.microsoft.com/en-us/free/) ready. * Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) * Install [Multipass](https://multipass.run/) The scripts used through this tutorial use [envsubst](https://www.gnu.org/software/gettext/manual/html_node/envsubst-Invocation.html), make sure to have it installed. :::info Make sure to log into Azure using `az login` prior creating the VM. ::: :::danger If you have a restricted account in Azure, make sure you have the `Network Contributor` role and the `Virtual Machine Contributor` role associated with your Azure AD account for the resource group where you want to create the VM. Of course, either `Owner` or `Contributor` at the resource group level are welcome. ::: All the following assume you have a macOS or Linux machine or VM from which you can issue all the commands. ## Create common Environment Variables ```bash= export PREFIX="$USER" # String to prepend to the name of all Azure resources export RG_NAME="OpenNMS" # Change it to use a shared one export LOCATION="eastus" # Azure Region export DOMAIN="$LOCATION.cloudapp.azure.com" # Public Azure DNS Domain export TIMEZONE="America/New_York" export VNET_CIDR="13.0.0.0/16" export VNET_SUBNET="13.0.1.0/24" export VNET_NAME="$PREFIX-vnet" export VNET_SUBNET_NAME="subnet1" export KAFKA_VM_SIZE="Standard_D2s_v3" # 2 VCPU, 8 GB of RAM export ZK_HEAP_SIZE="1G" # Must fit KAFKA_VM_SIZE export KAFKA_URL="https://downloads.apache.org/kafka/2.8.1/kafka_2.13-2.8.1.tgz" export KAFKA_JAVA_VERSION="11" # 8 for < 2.1.0; 11 for > 2.1.0 export KAFKA_HEAP_SIZE="2G" # Must fit KAFKA_VM_SIZE export KAFKA_PARTITIONS="9" # > Number of Minions per location export KAFKA_CLUSTER_SIZE="3" # Total instances of Kafka+ZK export KAFKA_RF="2" # < KAFKA_CLUSTER_SIZE export ONMS_VM_NAME="$PREFIX-onms01" export ONMS_VM_SIZE="Standard_D2s_v3" # 2 VCPU, 8 GB of RAM export ONMS_HEAP_SIZE="4096" # Expressed in MB and must fit ONMS_VM_SIZE export MINION_LOCATION="Durham" export MINION_HEAP_SIZE="1G" # Must fit VM RAM ``` :::danger We haven't tested `3.0.0`, so please use `2.8.x` or older for now. ::: :::success Feel free to change the content and keep in mind that `$PREFIX` is what we will use throughout this tutorial to identify all the resources we will create in Azure uniquely. ::: :::warning Do not confuse the Azure Location or Region with the Minion Location; they are both unrelated things. ::: We're going to leverage the Azure DNS services to avoid the need to remember and using Public IP addresses, which helps if we're interested in having HTTPS with valid certificates as explained [here](https://hackmd.io/@agalue/HyGyD0diN) not only for OpenNMS, but also to enable SSL/TLS in Kafka. In Azure, the default public DNS follow the same pattern: ``` <vm-name>.<location>.cloudapp.azure.com ``` To make the VMs FQDN unique, we're going to add the username to the VM name. For instance, the OpenNMS FQDN would be: ``` agalue-onms01.eastus.cloudapp.azure.com ``` The above is what we can use to access the VM via SSH and to configure Minions. ## Create the Azure Resource Group This is a necessary step, as every resource in Azure must belong to a resource group and a location. However, you can omit the following command and use an existing one if you prefer. In that case, make sure to adjust the environment variable `RG_NAME` so the subsequent commands will target the correct group. ```bash= az group create -n $RG_NAME -l $LOCATION --tags Owner=$USER ``` ## Create the Virtual Network I prefer to create the VNET myself instead of letting Azure do it for me, especially when we want to guarantee that all the VMs will exist in the same one. ```bash= az network vnet create -g $RG_NAME \ --name $VNET_NAME \ --address-prefix $VNET_CIDR \ --subnet-name $VNET_SUBNET_NAME \ --subnet-prefix $VNET_SUBNET \ --tags Owner=$USER \ --output table ``` ## Create cloud-init configuration template for Kafka The following [cloud-init](https://cloudinit.readthedocs.io/en/latest/) template assumes a 3 node cluster, where each VM would have Zookeeper and Kafka configured and running in Ubuntu LTS. :::warning For simplicity, Zookeeper and Kafka will be running on each machine. In production, each cluster should have its own instances, as Zookeeper should not grow the same way as Kafka would grow, for multiple reasons such as a ZK cluster should always have an odd number of members (which is not the case of Kafka); traffic across ZK members grows exponentially with the number of instances (a ZK cluster of 5 members can manage multiple dozens of Kafka members, with 7 it can manage hundreds, and with 9 it can manage thousands). ::: For the 3-node cluster, each VM will be named like follows: - agalue-kafka-1 - agalue-kafka-2 - agalue-kafka-3 Note the hostnames include the chosen username to make them unique, which is mandatory for shared resource groups and the default Azure DNS public domain on the chosen region. :::success Remember that each VM in Azure is reachable within the same VNet from any other VM through its hostname. ::: From all the environment variables you'll encounter in the upcoming template, there are two crucial ones: * PUBLIC_FQDN * INSTANCE_ID For `server.properties`, we must replace the environment variable `PUBLIC_FQDN` in the `advertised.listeners` with the public FQDN or IP of the VM when configuring the application before running it for the first time. With that in mind, there will be two listeners, one to be used within the VNet (which is what OpenNMS would use, on port 9092), and another associated with the Public FQDN (on port 9094), to be used by external Minions (outside Azure). Similarly, we must replace `INSTANCE_ID` with a unique numeric value per instance for the `broker.id` in `server.properties` for Kafka and the `myid` file for Zookeeper, which are the mandatory requirements to identify each instance in their respective cluster. :::warning The number of topic partitions must be greater than the number of Minions on a given location and greater than the number of brokers in the cluster. ::: Create a YAML file called `/tmp/kafka-template.yaml` with the following content: ```yaml= #cloud-config package_upgrade: true timezone: $TIMEZONE users: - default - name: kafka write_files: - owner: root:root path: /etc/security/limits.d/kafka.conf content: | * soft nofile 100000 * hard nofile 100000 - owner: root:root path: /etc/sysctl.d/99-kafka.conf content: | net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_intvl=10 net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.core.rmem_default=16777216 net.core.wmem_default=16777216 net.core.optmem_max=40960 net.ipv4.tcp_rmem=4096 87380 16777216 net.ipv4.tcp_wmem=4096 65536 16777216 net.ipv4.tcp_window_scaling=1 net.core.netdev_max_backlog=2500 net.core.somaxconn=65000 vm.swappiness=1 vm.zone_reclaim_mode=0 vm.max_map_count=1048575 - owner: root:root permissions: '0400' path: /etc/snmp/snmpd.conf content: | rocommunity public default syslocation Azure - $LOCATION syscontact $USER dontLogTCPWrappersConnects yes disk / - owner: root:root path: /etc/systemd/system/zookeeper.service content: | [Unit] Description=Apache Zookeeper server Documentation=http://zookeeper.apache.org Wants=network-online.target After=network-online.target [Service] Type=simple User=kafka Group=kafka Environment="KAFKA_HEAP_OPTS=-Xmx$ZK_HEAP_SIZE -Xms$ZK_HEAP_SIZE" ExecStart=/opt/kafka/bin/zookeeper-server-start.sh /opt/kafka/config/zookeeper.properties ExecStop=/opt/kafka/bin/zookeeper-server-stop.sh [Install] WantedBy=multi-user.target - owner: root:root path: /etc/systemd/system/kafka.service content: | [Unit] Description=Apache Kafka Server Documentation=http://kafka.apache.org Wants=zookeeper.service After=zookeeper.service network-online.target [Service] Type=simple User=kafka Group=kafka LimitNOFILE=100000 Environment="KAFKA_HEAP_OPTS=-Xmx$KAFKA_HEAP_SIZE -Xms$KAFKA_HEAP_SIZE" Environment="KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.rmi.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=%H -Djava.net.preferIPv4Stack=true" Environment="JMX_PORT=9999" ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties ExecStop=/opt/kafka/bin/kafka-server-stop.sh [Install] WantedBy=multi-user.target - owner: root:root path: /tmp/zookeeper.properties # Designed for a 3-node ZK cluster content: | dataDir=/data/zookeeper tickTime=2000 clientPort=2181 initLimit=10 syncLimit=5 # Cluster Members server.1=$PREFIX-kafka-1:2888:3888;2181 server.2=$PREFIX-kafka-2:2888:3888;2181 server.3=$PREFIX-kafka-3:2888:3888;2181 - owner: root:root path: /tmp/server.properties # Designed for a 3-node ZK cluster content: | broker.id=$INSTANCE_ID log.dirs=/data/kafka zookeeper.connect=$PREFIX-kafka-1:2181,$PREFIX-kafka-2:2181,$PREFIX-kafka-3:2181 zookeeper.connection.timeout.ms=30000 # Connection advertised.listeners=INSIDE://:9092,OUTSIDE://$PUBLIC_FQDN:9094 listeners=INSIDE://:9092,OUTSIDE://:9094 listener.security.protocol.map=INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT inter.broker.listener.name=INSIDE # Replication offsets.topic.replication.factor=$KAFKA_RF default.replication.factor=$KAFKA_RF min.insync.replicas=1 # Must be greater than number of Minions per Location num.partitions=$KAFKA_PARTITIONS # Recommended for the OpenNMS Kafka Producer message.max.bytes=5000000 replica.fetch.max.bytes=5000000 compression.type=producer # Cleanup (remove segments older than a week) log.retention.hours=168 log.retention.bytes=-1 # Required for OpenNMS and Minions auto.create.topics.enable=true # Recommended to avoid disrupting messages workflow delete.topic.enable=false packages: - snmp - snmpd - jq - openjdk-$KAFKA_JAVA_VERSION-jre-headless runcmd: - sysctl --system - wget -O /tmp/kafka.tar.gz $KAFKA_URL - cd /opt - mkdir kafka - tar -xvzf /tmp/kafka.tar.gz -C kafka --strip-components 1 - mv -f /tmp/*.properties /opt/kafka/config/ - mkdir -p /data/zookeeper /data/kafka - chown -R kafka:kafka /data /opt/kafka* - echo $INSTANCE_ID > /data/zookeeper/myid - systemctl daemon-reload - systemctl --now enable zookeeper - systemctl --now enable kafka - systemctl --now enable snmpd ``` The reason for increasing the message size (`message.max.bytes`, `replica.fetch.max.bytes`) is to avoid problems when forwarding collected metrics to Kafka via the Kafka Producer feature of OpenNMS, which I'm planning to enable. If you for instance wants to use an older version of Kafka, you can tune the JDK package and the Kafka URL, so the template can apply the correct one, for instance: ```bash= export KAFKA_URL="https://archive.apache.org/dist/kafka/1.1.0/kafka_2.11-1.1.0.tgz" export KAFKA_JAVA_VERSION="8" ``` Also, edit the template and remove `;2181` from the `server` entries from `zookeeper.properties` as expressing the client port that way expects Zookeeper 3.5 or newer. ## Start Broker Instances ```bash= for i in {1..$KAFKA_CLUSTER_SIZE}; do VM_NAME="$PREFIX-kafka-$i" echo "Creating VM $VM_NAME..." export INSTANCE_ID="$i" export PUBLIC_FQDN="$VM_NAME.$DOMAIN" envsubst < /tmp/kafka-template.yaml > $VM_NAME.yaml az vm create --resource-group $RG_NAME --name $VM_NAME \ --size $KAFKA_VM_SIZE \ --image canonical:0001-com-ubuntu-server-focal:20_04-lts:latest \ --admin-username $USER \ --ssh-key-values ~/.ssh/id_rsa.pub \ --vnet-name $VNET_NAME \ --subnet $VNET_SUBNET_NAME \ --public-ip-sku Standard \ --public-ip-address-dns-name $VM_NAME \ --custom-data $VM_NAME.yaml \ --tags Owner=$USER \ --no-wait done ``` :::info Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at `~/.ssh/id_rsa.pub`, or update the `az vm create` command. ::: The above will start all the VMs simultaneously using public IP addresses and FQDNs, to avoid access problems with external Minions and reconfiguration issues with the Kafka advertised listeners. However, like the public IPs, the private IPs will be dynamic. Fortunately, this is not going to be a problem as we're going to use DNS to access Kafka. Keep in mind that the `cloud-init` process starts once the VM is running, meaning we should wait a few minutes after the VMs are ready to use. Then, allow access for remote Minions: ```bash= for i in {1..$KAFKA_CLUSTER_SIZE}; do VM_NAME="$PREFIX-kafka-$i" az vm open-port -g $RG_NAME -n $VM_NAME \ --port 9094 --priority 100 --output table done ``` You can inspect the generated YAML files to see the final content used on each VM (after applying the env-var substitutions). :::warning In case there is a problem, SSH into the VM using the public IP and the provided credentials and check `/var/log/cloud-init-output.log` to verify the progress and the status of the cloud-init execution. ::: ## Validate Zookeeper and Kafka status To make sure the Zookeeper cluster started, we can use the "4 letter words" commands via the embedded web server, available when using version 3.5 or newer for instance: ```bash= curl http://$(hostname):8080/commands/monitor ``` The above gives us general information, including the `server_state`, which can be `leader` or `follower`. To get statistics: ```bash= curl http://$(hostname):8080/commands/stats ``` For Zookeeper version 3.4 or older (for instance, when using older versions of Kafka), you can still use the deprecated way to verify: ```bash= echo stat | nc $(hostname) 2181; echo ``` From Kafka's perspective, we can verify how each broker has registered via Zookeeper or follow [this](https://kafka.apache.org/quickstart) guide to create a topic and use the console producer and consumer to validate its functionality. List Broker IDs: ```bash= /opt/kafka/bin/zookeeper-shell.sh $(hostname) ls /brokers/ids ``` We should get: ```bash= [1, 2, 3] ``` :::warning If that's not the case, SSH the broker that is not listed and make sure Kafka is running. It is possible that Kafka is not properly registered to Zookeeper, and it fails to start due to how the VMs are initialized. That's because Zookeeper should start first (the whole cluster), then Kafka, but as we're not guaranteeing that, some instances might fail to start on their own. The procedure was designed to avoid this as much as possible this situation. ::: Get the broker basic configuration: ```bash= /opt/kafka/bin/zookeeper-shell.sh $(hostname) get /brokers/ids/1 | egrep '^\{' | jq ``` If we run it from the first instance, we should get: ```json= { "features": {}, "listener_security_protocol_map": { "INSIDE": "PLAINTEXT", "OUTSIDE": "PLAINTEXT" }, "endpoints": [ "INSIDE://agalue-kafka-1:9092", "OUTSIDE://agalue-kafka-1.eastus.cloudapp.azure.com:9094" ], "jmx_port": 9999, "port": 9092, "host": "agalue-kafka-1", "version": 5, "timestamp": "1616265688431" } ``` Note the two listeners. Clients within Azure, like OpenNMS, would use the `INSIDE` one on port 9092, pointing to the local FQDN or hostname of the VM (and remember they are resolvable via DNS within the same VNet). In contrast, clients outside Azure, like Minions, would use the `OUTSIDE` one on port 9094 pointing to the Public FQDN of each Kafka instance (accessible thanks to the NSG associated with each VM). :::warning Kafka defaults to the `hostname` or `FQDN` of the primary interface when we don't explicitly specify it on the listener. As Azure DNS works by default, hostnames are resolvable by all VMs within the same VNET. For this reason, Kafka will use the correct one. However, if you're using another cloud provider or using bare-metal, make sure to either have DNS working across all the VMs. Otherwise, change the `INSIDE` listener to explicitly point to the private IP address of the VM and the `OUTSIDE` listener to point to the public IP address of the VM; and make sure to use static IPs if you're going to rely on them. ::: ### Verification for newer versions of Kafka Another way to verify the behavior is using the console producer and console consumer to verify that we can send and receive messages through a given topic. To do that, for recent versions of Kafka, let's create a `Test` topic: ```bash= /opt/kafka/bin/kafka-topics.sh \ --bootstrap-server $(hostname):9092 \ --create --topic Test --replication-factor 2 --partitions 3 ``` Then, start a console producer from one of the brokers: ```bash= /opt/kafka/bin/kafka-console-producer.sh \ --bootstrap-server $(hostname):9092 --topic Test ``` From another broker (separate SSH session), start a console consumer: ```bash= /opt/kafka/bin/kafka-console-consumer.sh \ --bootstrap-server $(hostname):9092 --topic Test ``` Go back to the terminal on which the console producer is running, type a message, and hit enter. Then, switch to the console consumer terminal, and we should see the message sent. Use `Ctrl+C` to stop the producer and consumer. A more comprehensive test would be to download Kafka locally on your machine and run either the producer or the consumer there (use port 9094 and the public FQDN or IP of one of the brokers). That serves to test connectivity from the Internet. ### Verification for older versions of Kafka To create the `Test` topic: ```bash= /opt/kafka/bin/kafka-topics.sh \ --zookeeper $(hostname):2181 \ --create --topic Test --replication-factor 2 --partitions 3 ``` As you can see, the difference is talking against Zookeeper directly (using `--zookeeper`), instead of reaching Kafka (using `--bootstrap-server`). For the producer use `--broker-list` instead of `--bootstrap-server`, for instance: ```bash= /opt/kafka/bin/kafka-console-producer.sh \ --broker-list $(hostname):9092 --topic Test ``` For the client, it is the same as newer versions: ```bash= /opt/kafka/bin/kafka-console-consumer.sh \ --bootstrap-server $(hostname):9092 --topic Test ``` ## Topic settings The retention settings are the default (for instance, `log.retention.hours` and `log.retention.bytes` at the broker level; or `retention.ms` and `retention.bytes` at the topic level), but it is recommended to reduce them for the RPC topics, as due to the TTL, it doesn't worth keeping them for longer times. That's why 1 hour is more than enough. Having said that, data pruning happens on closed segments only, meaning Kafka won't delete old records from the active segment (the one currently being updated with new records). That means you should also change the `segment.bytes` or `segment.ms` at the topic level to allow deletion. These can be equal to or less than the expected retention. Of course, it is crucial to have the `single-topic` feature enabled for RPC in both Minion and OpenNMS. However, we must fix that after the topics are created by either OpenNMS or the Minions, using the Kafka CLI tools or specialized applications like [topicctl](https://github.com/segmentio/topicctl) or [CMAK](https://github.com/yahoo/CMAK). For instance, on newer versions of Kafka: ```bash= /opt/kafka/bin/kafka-configs.sh --alter \ --bootstrap-server $(hostname):9092 \ --entity-type topics \ --entity-name OpenNMS.rpc-response \ --add-config segment.ms=3600000 \ --add-config retention.ms=3600000 \ ``` For older versions: ```bash= /opt/kafka/bin/kafka-configs.sh --alter \ --zookeeper $(hostname):2181 \ --entity-type topics \ --entity-name OpenNMS.rpc-response \ --add-config segment.ms=3600000 \ --add-config retention.ms=3600000 ``` Note that topic level settings and broker level settings are slightly different. The topic level settings override the broker level settings when they exist. Be careful when setting the number of partitions per topic if you're planning to have a massive number of Minion locations or share the cluster across multiple OpenNMS instances with a high number of locations. This is why having the `single-topic` enabled in OpenNMS and Minion is the best approach (the default in H28). Each lead partition (and each replica the broker maintains) will have a directory in the data directory, and Kafka will maintain a file descriptor per segment. Each segment contains two files, the index and the data itself. For more information, check [this](https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/) blog post. It is recommended to have a dedicated file system for the data directory formatted using XFS with `noatime` and `nodiratime` in production. ## Create an Azure VM for OpenNMS Create a [cloud-init](https://cloudinit.readthedocs.io/en/latest/) script with the following content to deploy PostgreSQL, the latest OpenNMS Horizon, and CMAK in Ubuntu LTS and store it at `/tmp/opennms-template.yaml`: ```yaml= #cloud-config package_upgrade: true timezone: $TIMEZONE write_files: - owner: root:root path: /etc/opennms-overlay/featuresBoot.d/features.boot content: | opennms-kafka-producer # OpenNMS RRD Settings - owner: root:root path: /etc/opennms-overlay/opennms.properties.d/rrd.properties content: | org.opennms.rrd.storeByGroup=true org.opennms.rrd.storeByForeignSource=true org.opennms.rrd.strategyClass=org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy org.opennms.rrd.interfaceJar=/usr/share/java/jrrd2.jar opennms.library.jrrd2=/usr/lib/jni/libjrrd2.so # OpenNMS Sink and RPC API - owner: root:root path: /etc/opennms-overlay/opennms.properties.d/kafka.properties content: | # Disable internal ActiveMQ org.opennms.activemq.broker.disable=true # Sink org.opennms.core.ipc.sink.strategy=kafka org.opennms.core.ipc.sink.kafka.bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092 org.opennms.core.ipc.sink.kafka.acks=1 # RPC org.opennms.core.ipc.rpc.strategy=kafka org.opennms.core.ipc.rpc.kafka.bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092 org.opennms.core.ipc.rpc.kafka.ttl=30000 org.opennms.core.ipc.rpc.kafka.single-topic=true org.opennms.core.ipc.rpc.kafka.auto.offset.reset=latest # OpenNMS Kafka Producer Client - owner: root:root path: /etc/opennms-overlay/org.opennms.features.kafka.producer.client.cfg content: | bootstrap.servers=$PREFIX-kafka-1:9092,$PREFIX-kafka-2:9092 compression.type=zstd timeout.ms=30000 max.request.size=5000000 # OpenNMS Kafka Producer Settings - owner: root:root path: /etc/opennms-overlay/org.opennms.features.kafka.producer.cfg content: | topologyProtocols=bridge,cdp,isis,lldp,ospf suppressIncrementalAlarms=true forward.metrics=true nodeRefreshTimeoutMs=300000 alarmSyncIntervalMs=300000 kafkaSendQueueCapacity=1000 nodeTopic=OpenNMS_nodes alarmTopic=OpenNMS_alarms eventTopic=OpenNMS_events metricTopic=OpenNMS_metrics alarmFeedbackTopic=OpenNMS_alarms_feedback topologyVertexTopic=OpenNMS_topology_vertices topologyEdgeTopic=OpenNMS_edges - owner: root:root permissions: '0400' path: /etc/snmp/snmpd.conf content: | rocommunity public default syslocation Azure - $LOCATION syscontact $USER dontLogTCPWrappersConnects yes disk / apt: preserve_sources_list: true sources: opennms: source: deb https://debian.opennms.org stable main docker: source: deb https://download.docker.com/linux/ubuntu bionic stable packages: - snmp - snmpd - jq - jrrd2 - opennms - opennms-webapp-hawtio - opennms-helm - docker-ce - docker-ce-cli - containerd.io bootcmd: - curl -s https://debian.opennms.org/OPENNMS-GPG-KEY | apt-key add - - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - runcmd: # Configure PostgreSQL - systemctl --now enable postgresql - sudo -u postgres createuser opennms - sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'postgres';" - sudo -u postgres psql -c "ALTER USER opennms WITH PASSWORD 'opennms';" - sed -r -i 's/password=""/password="postgres"/' /etc/opennms/opennms-datasources.xml # Configure OpenNMS - sed -r -i '/enabled="false"/{$!{N;s/ enabled="false"[>]\n(.*OpenNMS:Name=Syslogd.*)/>\n\1/}}' /etc/opennms/service-configuration.xml - echo "JAVA_HEAP_SIZE=$ONMS_HEAP_SIZE" > /etc/opennms/opennms.conf - rsync -avr /etc/opennms-overlay/ /etc/opennms/ - /usr/share/opennms/bin/runjava -s - /usr/share/opennms/bin/fix-permissions - /usr/share/opennms/bin/install -dis - systemctl --now enable opennms # Start CMAK using Docker - usermod -aG docker ubuntu - docker run --name cmak -d -e ZK_HOSTS="$PREFIX-kafka-1:2181" -e APPLICATION_SECRET="opennms" -p 9000:9000 hlebalbau/kafka-manager:stable # Upgrade Grafana - sudo apt-get install -y adduser libfontconfig1 - wget https://dl.grafana.com/oss/release/grafana_7.5.11_amd64.deb - sudo dpkg -i grafana_7.5.11_amd64.deb ``` :::info We don't need to specify Kafka Brokers' whole list as part of the `bootstrap.servers` entry. The whole topology will be discovered through the first one that responds, and the client will use what's configured as the advertised listener to talk to each broker. I added two in case the first one is unavailable (as a backup). ::: :::warning If you're using an older version of Kafka, make sure to set the appropriate version when adding your cluster to CMAK. ::: The above installs the latest OpenJDK 11, the latest PostgreSQL, and the latest OpenNMS Horizon to the VM. It also install Kafka Manager or CMAK via Docker. I added the most basic configuration for PostgreSQL to work with authentication. Kafka will be enabled for Sink/RPC as well as the Kafka Producer. As mentioned, Azure VMs can reach each other through hostnames. Create an Ubuntu VM for OpenNMS: ```bash= envsubst < /tmp/opennms-template.yaml > /tmp/opennms.yaml az vm create --resource-group $RG_NAME --name $ONMS_VM_NAME \ --size $ONMS_VM_SIZE \ --image canonical:0001-com-ubuntu-server-focal:20_04-lts:latest \ --admin-username $USER \ --ssh-key-values ~/.ssh/id_rsa.pub \ --vnet-name $VNET_NAME \ --subnet $VNET_SUBNET_NAME \ --public-ip-address-dns-name $ONMS_VM_NAME \ --public-ip-sku Standard \ --custom-data /tmp/opennms.yaml \ --tags Owner=$USER \ --output table az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \ --port 8980 --priority 200 --output table az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \ --port 3000 --priority 200 --output table az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \ --port 9000 --priority 300 --output table ``` :::info Note that I'm assuming the usage of SSH Keys for password-less access. Make sure to have a public key located at `~/.ssh/id_rsa.pub`, or update the `az vm create` command. ::: Keep in mind that the `cloud-init` process starts once the VM is running, meaning we should wait about 5 minutes after the `az vm create` is finished to see OpenNMS up and running. :::warning In case there is a problem, SSH into the VM using the public IP and the provided credentials and check `/var/log/cloud-init-output.log` to verify the progress and the status of the cloud-init execution. ::: ## Monitor the infrastructure Wait until OpenNMS is up and running and then execute the following, to start monitoring all the ZK/Kafka servers, and the OpenNMS server via SNMP and JMX. ```bash= ONMS_FQDN="$ONMS_VM_NAME.$DOMAIN" cat <<EOF >/tmp/OpenNMS.xml <?xml version="1.0"?> <model-import date-stamp="$(date +"%Y-%m-%dT%T.000Z")" foreign-source="OpenNMS"> EOF for vm in $(az vm list -g $RG_NAME --query "[?contains(name,'$PREFIX-')].name" -o tsv); do ipaddr=$(az vm show -g $RG_NAME -n $vm -d --query privateIps -o tsv) cat <<EOF >>/tmp/OpenNMS.xml <node foreign-id="$vm" node-label="$vm"> EOF if [[ "$vm" == *"kafka"* ]]; then cat <<EOF >>/tmp/OpenNMS.xml <interface ip-addr="$ipaddr" status="1" snmp-primary="P"> <monitored-service service-name="JMX-Kafka"/> </interface> </node> EOF fi if [[ "$vm" == *"onms"* ]]; then cat <<EOF >>/tmp/OpenNMS.xml <interface ip-addr="$ipaddr" status="1" snmp-primary="P"/> <interface ip-addr="127.0.0.1" status="1" snmp-primary="N"> <monitored-service service-name="OpenNMS-JVM"/> </interface> </node> EOF fi done cat <<EOF >>/tmp/OpenNMS.xml </model-import> EOF curl -v -u admin:admin \ -H 'Content-Type: application/xml' -d @/tmp/OpenNMS.xml \ http://$ONMS_FQDN:8980/opennms/rest/requisitions curl -v -u admin:admin -X PUT \ http://$ONMS_FQDN:8980/opennms/rest/requisitions/OpenNMS/import ``` ## Create Minion VMs using `multipass` After verifying that OpenNMS is up and running, we can proceed to create the Minions. Create a [cloud-init](https://cloudinit.readthedocs.io/en/latest/) script to deploy Minion in Ubuntu and save it at `/tmp/minion-template.yaml`: ```yaml= #cloud-config package_upgrade: true timezone: $TIMEZONE write_files: - owner: root:root path: /etc/minion-overlay/org.opennms.minion.controller.cfg content: | location=$MINION_LOCATION id=$MINION_ID http-url=http://$ONMS_VM_NAME.$DOMAIN:8980/opennms - owner: root:root path: /etc/minion-overlay/featuresBoot.d/kafka.boot content: | !minion-jms !opennms-core-ipc-sink-camel !opennms-core-ipc-rpc-jms opennms-core-ipc-sink-kafka opennms-core-ipc-rpc-kafka - owner: root:root path: /etc/minion-overlay/org.opennms.core.ipc.sink.kafka.cfg content: | bootstrap.servers=$PREFIX-kafka-1.$DOMAIN:9094,$PREFIX-kafka-2.$DOMAIN:9094 - owner: root:root path: /etc/minion-overlay/org.opennms.core.ipc.rpc.kafka.cfg content: | bootstrap.servers=$PREFIX-kafka-1.$DOMAIN:9094,$PREFIX-kafka-2.$DOMAIN:9094 single-topic=true apt: preserve_sources_list: true sources: opennms: source: deb https://debian.opennms.org stable main packages: - opennms-minion bootcmd: - curl -s https://debian.opennms.org/OPENNMS-GPG-KEY | apt-key add - runcmd: - rsync -avr /etc/minion-overlay/ /etc/minion/ - sed -i -r 's/# export JAVA_MIN_MEM=.*/export JAVA_MIN_MEM="$MINION_HEAP_SIZE"/' /etc/default/minion - sed -i -r 's/# export JAVA_MAX_MEM=.*/export JAVA_MAX_MEM="$MINION_HEAP_SIZE"/' /etc/default/minion - /usr/share/minion/bin/scvcli set opennms.http admin admin - /usr/share/minion/bin/scvcli set opennms.broker admin admin - systemctl --now enable minion ``` Note that I'm using the same content for `bootstrap.servers` as OpenNMS, making sure to use the Public FQDNs, as Minions won't be running in Azure. Then, start the new Minion via `multipass`: ```bash= export MINION_ID=minion01 envsubst < /tmp/minion-template.yaml > /tmp/$MINION_ID.yaml multipass launch -c 1 -m 2G -n $MINION_ID --cloud-init /tmp/$MINION_ID.yaml ``` Optionally, create a second Minion in the same location: ```bash= export MINION_ID=minion02 envsubst < /tmp/minion-template.yaml > /tmp/$MINION_ID.yaml multipass launch -c 1 -m 2G -n $MINION_ID --cloud-init /tmp/$MINION_ID.yaml ``` :::warning In case there is a problem, access the VM (e.x., `multipass shell minion01`) and check `/var/log/cloud-init-output.log` to verify the progress and the status of the cloud-init execution. ::: :::success Feel free to change the CPU and memory settings for your Minion, but make sure it is consistent with `MINION_HEAP_SIZE`. Make sure to validate communication using the `health-check` command from the Karaf Shell. ::: When having multiple Minions per location, they will become part of a consumer group from Kafka's perspective for the RPC requests topic. The group ID will be the name of the location. ## Test As you can see, the location name is `Durham` (a.k.a. `$MINION_LOCATION`), and you should see the Minions on that location registered in OpenNMS. SSH into the OpenNMS server and create a requisition with a node in the same network as the Minion VMs, and make sure to associate it with the appropriate location. For instance, ```bash= /usr/share/opennms/bin/provision.pl requisition add Test /usr/share/opennms/bin/provision.pl node add Test srv01 srv01 /usr/share/opennms/bin/provision.pl node set Test srv01 location Durham /usr/share/opennms/bin/provision.pl interface add Test srv01 192.168.0.40 /usr/share/opennms/bin/provision.pl interface set Test srv01 192.168.0.40 snmp-primary P /usr/share/opennms/bin/provision.pl requisition import Test ``` :::warning Ensure to replace `192.168.0.40` with the IP of a working server in your network (reachable from the Minion VM, and preferable unreachable or nonexistent in Azure), and do not forget to use the same location as defined in `$MINION_LOCATION`. ::: :::danger Please keep in mind that Minions are VMs on your machine. `192.168.0.40` is the IP of one of my machines which is why Minions can reach it (and vice versa). To access an external machine on your network, make sure to define static routes on that machine so it can reach the Minions through your machine (assuming you're running Linux or macOS). ::: OpenNMS which runs in Azure, and have no access to `192.168.0.40` directly, should be able to collect data and monitor that node through any of the Minions. In fact, you can stop one of them, and OpenNMS would continue monitoring it. To test asynchronous messages, you can send SNMP traps or Syslog messages to one of the Minions. Alternatively, you could use [udpgen](https://github.com/OpenNMS/udpgen) for this purpose. Usually, you could put a Load Balancer in front of the Minions and use its IP when sending messages from the monitored devices. The machine that will be running `udpgen` must be part of the OpenNMS inventory. Then, find the IP of the Minion using `multipass list`, then execute the following from the machine added as a node above (the examples assumes the IP of the Minion is `192.168.75.16`): To send SNMP Traps: ```bash= udpgen -h 192.168.75.16 -x snmp -r 1 -p 1162 ``` To send Syslog Messages: ```bash= udpgen -h 192.168.75.16 -x syslog -r 1 -p 1514 ``` :::success The C++ version of `udpgen` only works on Linux. If you're on macOS, you can use the [Go](https://github.com/agalue/udpgen) version of it. Unfortunately, Windows is not an option due to a lack of support for Syslog in Go. ::: Note that an event definition is required when using `udpgen` to send traps. Here is what you'd need for `Eventd`: ```xml= <events xmlns="http://xmlns.opennms.org/xsd/eventconf"> <event> <mask> <maskelement> <mename>id</mename> <mevalue>.1.3.6.1.1.6.3.1.1.5</mevalue> </maskelement> <maskelement> <mename>generic</mename> <mevalue>6</mevalue> </maskelement> <maskelement> <mename>specific</mename> <mevalue>1</mevalue> </maskelement> </mask> <uei>uei.opennms.org/udpgen/testTrap</uei> <event-label>udpgen test trap</event-label> <descr>Sample Event %parm[all]%</descr> <logmsg dest="logndisplay">Sample Event %parm[all]%</logmsg> <severity>Warning</severity> </event> </events> ``` If you want to make the tests more interesting, add the following to the above definition: ```xml= <alarm-data reduction-key="%uei%:%dpname%:%nodeid%" alarm-type="3" auto-clean="false"/> ``` The [Hawtio](https://hawt.io/) UI in OpenNMS can help visualize the relevant JMX metrics and understand what’s circulating between OpenNMS and the Minions. For OpenNMS, Hawtio is available through `:8980/hawtio` if the package `opennms-webapp-hawtio` was installed (which is the case with the `cloud-init` template used). For Minions, Hawtio is available through `:8181/hawtio`. ## Troubleshooting As mentioned, if time is not synchronized across all the instances, the Heartbeat sent by Minions via the Sink API won't be processed properly by OpenNMS, leading to having the Minion not registered or outages in the `Minion-Heartbeat` service. We can inspect the traffic on the topics to see if the Minion is sending (or receiving) traffic to Kafka. However, as the payload is encoded within a Protobuf message, using the console consumer might not be as useful as we'd expect. Still, it works for troubleshooting purposes. For instance, from one of the Kafka brokers, we can do: ```bash= /opt/kafka/bin/kafka-console-consumer.sh \ --bootstrap-server $(hostname):9092 \ --topic OpenNMS.Sink.Heartbeat ``` And we'll see: ``` $bce7b13e-d575-40b9-989a-3b5c6e7432c2 ~<minion> <id>minion01</id> <location>Durham</location> <timestamp>2021-03-26T12:19:55.752-07:00</timestamp> </minion> ``` As we can see, the actual payload within the Protobuf message is an indented XML. The following application can be used to properly inspect the content without worrying about the non-readable content due to the Protobuf format: https://github.com/agalue/onms-kafka-ipc-receiver For RPC in particular, we can access the Karaf Shell from the OpenNMS instance and use the `opennms:stress-rpc` command to verify communication against the Minions on a given location or against a specific Minion, and as the command name implies, to perform stress tests. ### Useful Kafka Commands For recent versions of Kafka, the following can help to get details about topics, lags, consumer groups and so on. To verify the topic partitions and replica settings: ```bash= topics=$(/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server $(hostname):9092) for topic in $topics; do /opt/kafka/bin/kafka-topics.sh \ --bootstrap-server $(hostname):9092 \ --describe --topic $topic done ``` To verify the current topic-level settings: ```bash= topics=$(/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server $(hostname):9092) for topic in $topics; do /opt/kafka/bin/kafka-configs.sh \ --bootstrap-server $(hostname):9092 \ --describe --entity-type topics --entity-name $topic --all done ``` To verify offsets, topics lag and consumer groups: ```bash= /opt/kafka/bin/kafka-consumer-groups.sh \ --bootstrap-server $(hostname):9092 \ --describe --all-groups --all-topics ``` :::warning When enabling security (either SASL or TLS), you need to pass those settings to the commands. ::: For instance, let's say you have SASL enabled, you should pass: ```= --command-config /opt/kafka/config/consumer.properties ``` Where the content of `consumer.properties` would be: ```= security.protocol=SASL_PLAINTEXT sasl.mechanism=PLAIN sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="opennms" password="0p3nNM5"; ``` For older versions of Kafka, the equivalent commands are the following: To verify the topic partitions and replica settings: ```bash= topics=$(/opt/kafka/bin/kafka-topics.sh --list --zookeeper $(hostname):2181) for topic in $topics; do /opt/kafka/bin/kafka-topics.sh \ --zookeeper $(hostname):2181 \ --describe --topic $topic done ``` To verify the current topic-level settings: ```bash= topics=$(/opt/kafka/bin/kafka-topics.sh --list --zookeeper $(hostname):2181) for topic in $topics; do /opt/kafka/bin/kafka-configs.sh \ --zookeeper $(hostname):2181 \ --describe --entity-type topics --entity-name $topic done ``` To verify offsets, topics lag and consumer groups: ```bash= groups=$(/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $(hostname):9092 --list) for group in $groups; do /opt/kafka/bin/kafka-consumer-groups.sh \ --bootstrap-server $(hostname):9092 \ --describe --all-topics --group $group done ``` :::danger When passing the ZK host to `--zookeeper`, that has to be consistent with how `zookeeper.connect` was defined on each Kafka broker. If you used something like this `zk1:2181,zk2:2181/kafka`, you should then pass `--zookeeper $(hostname):2181/kafka` instead. ::: ## Sharing Kafka across multiple OpenNMS-Minion sets In big environments, it is common to have multiple OpenNMS instances, each of them with its own fleet of Minions to monitor one of the multiple data centers or a section of it. In those scenarios, it is common to have a centralized Kafka cluster that can be shared across all of them (for more information, follow [this](https://hackmd.io/@agalue/rJu_amaWE) link). The above solution has to be modified to ensure each set of OpenNMS and Minions will use their own set of topics in Kafka to avoid collisions. The topics' prefix (which defaults to `OpenNMS`) can be controlled via a system-wide property called Instance ID (a.k.a. `org.opennms.instance.id`). We must configure this property in both places. For the OpenNMS, add it to a property file inside `$OPENNMS_HOME/etc/opennms.properties.d`; and for a Minion, add it to a property file inside `$MINION_HOME/etc/custom.system.properties`. ## Add a Load Balancer in front of the Minions (Optional) In production, when having multiple Minions per location, it is a good practice to put a Load Balancer in front of them so that the devices can use a single destination for SNMP Traps, Syslog, and Flows. The following creates a [cloud-init](https://cloudinit.readthedocs.io/en/latest/) template for Ubuntu to start a basic LB using `nginx` through `multipass` for SNMP Traps (with a listener on port 162) and Syslog Messages (with a listener on port 514). Save the template at `/tmp/nginx-template.yaml`: ```yaml= #cloud-config package_upgrade: true packages: - nginx write_files: - owner: root:root path: /etc/nginx/nginx.conf content: | user www-data; worker_processes auto; pid /run/nginx.pid; include /etc/nginx/modules-enabled/*.conf; events { worker_connections 768; } stream { upstream syslog_udp { server $MINION_IP1:1514; server $MINION_IP2:1514; } upstream trap_udp { server $MINION_IP1:1162; server $MINION_IP2:1162; } server { listen 514 udp; proxy_pass syslog_udp; proxy_responses 0; } server { listen 162 udp; proxy_pass trap_udp; proxy_responses 0; } } runcmd: - systemctl restart nginx ``` :::info Note the usage of environment variables within the YAML template. We will substitute them before creating the VM. ::: Then, update the template and create the LB: ```bash= export MINION_IP1=$(multipass info $MINION_ID1 | grep IPv4 | awk '{print $2}') export MINION_IP2=$(multipass info $MINION_ID2 | grep IPv4 | awk '{print $2}') envsubst < /tmp/nginx-template.yaml > /tmp/nginx.yaml multipass launch -n nginx --cloud-init /tmp/nginx.yaml echo "Load Balancer $(multipass info nginx | grep IPv4)" ``` :::warning Flows are outside the scope of this test as that requires more configuration on Minions and OpenNMS besides having an Elasticsearch cluster up and running with the required plugin in place. ::: ## Securing Zookeeper and Kafka The above procedure uses Kafka and Zookeeper in plain text without authentication or encryption. That works for testing purposes or perhaps for private clusters, where access to the servers is restricted and audited. This example, in particular, exposes Kafka to the Internet, which requires having at least authentication in place. The following explains how to enable authentication and then the steps to enable encryption. For a more comprehensive guide, follow [this](https://docs.confluent.io/platform/current/security/security_tutorial.html) tutorial from Confluent. ### Authentication This section explains how to enable authentication using [SASL](https://en.wikipedia.org/wiki/Simple_Authentication_and_Security_Layer) with [SCRAM-SHA-512](https://docs.confluent.io/platform/current/kafka/authentication_sasl/authentication_sasl_scram.html) for Kafka and `DIGEST` for Zookeeper (as Zookeeper doesn't support `SCRAM`). Because this guide's intention is learning, I decided to add security as a separate or optional module. That's due to the extra complexity associated with this advanced topic. Here are the high-level changes: * Create the SCRAM credentials for Kafka through one of the brokers. The credentials are stored in Zookeeper. * Update `server.properties` and the `systemd` service definition on each Kafka broker to enable and use SASL. * Update `zookeeper.properties` and the `systemd` service definition on each ZK instance to enable and use SASL. * Stop Kafka Cluster, restart Zookeeper cluster, start Kafka Cluster. * Update OpenNMS to use SASL for the Sink API, the RPC API, and the Kafka Producer and restart. * Update Minion to use SASL for the Sink API and the RPC API and restart. Access one of the brokers and execute the following command: ```bash= ONMS_USER="opennms" # To be used by Kafka, OpenNMS and Minions ONMS_PASSWD="0p3nNM5;" # To be used by Kafka, OpenNMS and Minions /opt/kafka/bin/kafka-configs.sh --bootstrap-server $(hostname):9092 \ --alter \ --add-config "SCRAM-SHA-256=[password=$ONMS_PASSWD],SCRAM-SHA-512=[password=$ONMS_PASSWD]" \ --entity-type users \ --entity-name $ONMS_USER ``` On each Zookeeper instance, update `zookeeper.properties` to enable SASL: ```bash= cat <<EOF | sudo tee -a /opt/kafka/config/zookeeper.properties authProvider.sasl=org.apache.zookeeper.server.auth.SASLAuthenticationProvider requireClientAuthScheme=sasl EOF ``` On each Kafka broker instance, update `server.properties` to enable SASL/SCRAM: ```bash= sudo sed -i -r '/listener.security.protocol.map/d' /opt/kafka/config/server.properties cat <<EOF | sudo tee -a /opt/kafka/config/server.properties # Enable Security listener.security.protocol.map=INSIDE:SASL_PLAINTEXT,OUTSIDE:SASL_PLAINTEXT sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 sasl.enabled.mechanisms=SCRAM-SHA-256,SCRAM-SHA-512 EOF ``` Note that `listener.security.protocol.map` already exists in that file, which is why I removed it prior adding the required changes. :::warning In theory, there is no need to enable both `SCRAM-SHA-256` and `SCRAM-SHA-512`. I did that for compatibility purposes, but I'll use `SCRAM-SHA-512` for all subsequent configurations. ::: On each Zookeeper instance, create the `JAAS` configuration file with the credentials: ```bash= ZK_USER="zkonms" ZK_PASSWD="zk0p3nNM5;" cat <<EOF | sudo tee /opt/kafka/config/zookeeper_jaas.conf Server { org.apache.zookeeper.server.auth.DigestLoginModule required user_$ZK_USER="$ZK_PASSWD"; }; EOF sudo chown kafka:kafka /opt/kafka/config/zookeeper_jaas.conf sudo chmod 0600 /opt/kafka/config/zookeeper_jaas.conf ``` On each Kafka broker, create the `JAAS` configuration file with the credentials: ```bash= ZK_USER="zkonms" # Must match zookeeper_jaas.conf ZK_PASSWD="zk0p3nNM5;" # Must match zookeeper_jaas.conf ONMS_USER="opennms" # Must match scram user ONMS_PASSWD="0p3nNM5;" # Must match scram user cat <<EOF | sudo tee /opt/kafka/config/kafka_jaas.conf KafkaServer { org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD"; }; Client { org.apache.zookeeper.server.auth.DigestLoginModule required username="$ZK_USER" password="$ZK_PASSWD"; }; EOF sudo chown kafka:kafka /opt/kafka/config/kafka_jaas.conf sudo chmod 0600 /opt/kafka/config/kafka_jaas.conf ``` On each Zookeeper instance, update the `systemd` service definition to load the JAAS settings via `KAFKA_OPTS`: ```bash= OPTS='Environment="KAFKA_OPTS=-Djava.security.auth.login.config=/opt/kafka/config/zookeeper_jaas.conf"' sudo sed -i -r -e "/^ExecStart=.*/i $OPTS" /etc/systemd/system/zookeeper.service sudo systemctl daemon-reload ``` On each Kafka broker, update the `systemd` service definition to load the JAAS settings via `KAFKA_OPTS`: ```bash= OPTS='Environment="KAFKA_OPTS=-Djava.security.auth.login.config=/opt/kafka/config/kafka_jaas.conf"' sudo sed -i -r -e "/^ExecStart=.*/i $OPTS" /etc/systemd/system/kafka.service sudo systemctl daemon-reload ``` Restart the cluster in the following order: * Stop Kafka on each server. * Restart Zookeeper on each server. * Start Kafka on each server. At this point, you should pass the SASL credentials to all Kafka CLI Tools. For instance, ```bash= ONMS_USER="opennms" # Must match scram user ONMS_PASSWD="0p3nNM5;" # Must match scram user cat <<EOF | sudo tee -a /opt/kafka/config/consumer.properties # Security security.protocol=SASL_PLAINTEXT sasl.mechanism=SCRAM-SHA-512 sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD"; EOF /opt/kafka/bin/kafka-topics.sh --list \ --bootstrap-server $(hostname):9092 \ --command-config /opt/kafka/config/consumer.properties ``` Note how we pass the consumer settings. The above command should list all the topics in the cluster. If you can see the list, then SASL is working. Keep in mind that not passing `--command-config`, the command should timeout, as the tool cannot communicate to Kafka without the credentials. On the OpenNMS instance, update `/opt/opennms/etc/opennms.properties.d/kafka.properties` and `/opt/opennms/etc/org.opennms.features.kafka.producer.cfg` to use SASL, and restart OpenNMS. For instance: ```bash= ONMS_USER="opennms" # Must match scram user ONMS_PASSWD="0p3nNM5;" # Must match scram user for module in sink rpc; do cat <<EOF | sudo tee -a /etc/opennms/opennms.properties.d/kafka.properties # Security for $module org.opennms.core.ipc.$module.kafka.security.protocol=SASL_PLAINTEXT org.opennms.core.ipc.$module.kafka.sasl.mechanism=SCRAM-SHA-512 org.opennms.core.ipc.$module.kafka.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD"; EOF done cat <<EOF | sudo tee -a /etc/opennms/org.opennms.features.kafka.producer.client.cfg # Security security.protocol=SASL_PLAINTEXT sasl.mechanism=SCRAM-SHA-512 sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD"; EOF sudo systemctl restart opennms ``` On each Minion, update `/etc/minion/org.opennms.core.ipc.sink.kafka.cfg` and `/etc/minion/org.opennms.core.rpc.sink.kafka.cfg` to use SASL, and restart Minion. For instance: ```bash= ONMS_USER="opennms" # Must match scram user ONMS_PASSWD="0p3nNM5;" # Must match scram user for module in sink rpc; do cat <<EOF | sudo tee -a /etc/minion/org.opennms.core.ipc.$module.kafka.cfg # Security security.protocol=SASL_PLAINTEXT sasl.mechanism=SCRAM-SHA-512 sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="$ONMS_USER" password="$ONMS_PASSWD"; EOF done sudo systemctl restart minion ``` The solution works from OpenNMS and Minion perspective, despite the fact of seeing the following message repeatedly in `/opt/kafka/logs/server.log` on all brokers: ``` [2021-04-11 12:35:56,486] INFO [SocketServer brokerId=2] Failed authentication with /13.0.1.7 (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector) ``` Where `13.0.1.7` is the IP of the OpenNMS server. :::success At this point, we have `SASL` authentication enabled using `SCRAM-512` for Kafka and `DIGEST` for Zookeeper, meaning credentials might be hard to crack when intercepting traffic (but perhaps not impossible). However, to make it more secure, encryption is recommended. ::: :::warning If you already configured `CMAK`, make sure to enable the SASL/SCRAM mechanism for your cluster. ::: ### Encryption :::warning Please keep in mind that enabling SSL/TLS will increase CPU demand on each broker and the clients, which is why using OpenJDK 11 over JDK 8 is encouraged. ::: To enable TLS, and because each Kafka Broker must be exposed and reachable through a public DNS entry, I'm going to use [LetsEncrypt](https://letsencrypt.org/) to generate the certificates. That will save a few steps because the certificates will be publicly valid, so we won't need to set up a Trust Store. A Trust Store is mandatory when using private CAs or self-signed certificates to configure every entity that touches Kafka directly or indirectly. The [Certbot](https://certbot.eff.org/lets-encrypt/ubuntubionic-other) utility used to create and validate the certificate will start a temporary web server on the instance (for the validation process). For this reason, we should temporary allow access through port TCP 80: ```bash= for i in {1..$KAFKA_CLUSTER_SIZE}; do VM_NAME="$PREFIX-kafka-$i" az vm open-port -g $RG_NAME -n $VM_NAME \ --port 80 --priority 101 --output table done ``` Then, on each Kafka Broker (one by one), we must do the following to enable TLS: ```bash= FQDN="$(hostname).eastus.cloudapp.azure.com" EMAIL="owner@example.com" PASSWD="0p3nNM5" sudo snap install --classic certbot sudo ln -s /snap/bin/certbot /usr/bin/certbot sudo certbot certonly --standalone -d $FQDN -m $EMAIL \ --non-interactive --agree-tos TEMP_P12="/tmp/ssl.p12.$(date +%s)" TEMP_KEYSTORE="/tmp/ssl.keystore.$(date +%s)" TARGET_KEYSTORE="/opt/kafka/config/letsencrypt.jks" sudo openssl pkcs12 -export \ -in /etc/letsencrypt/live/$FQDN/fullchain.pem \ -inkey /etc/letsencrypt/live/$FQDN/privkey.pem \ -out $TEMP_P12 -name kafka -password "pass:$PASSWD" sudo keytool -importkeystore -alias kafka \ -deststorepass "$PASSWD" -destkeypass "$PASSWD" -destkeystore $TEMP_KEYSTORE \ -srckeystore $TEMP_P12 -srcstoretype PKCS12 -srcstorepass "$PASSWD" sudo cp $TEMP_KEYSTORE $TARGET_KEYSTORE sudo chmod 440 $TARGET_KEYSTORE sudo chown kafka:kafka $TARGET_KEYSTORE sudo rm -f $TEMP_P12 $TEMP_KEYSTORE CONFIG="/opt/kafka/config/server.properties" sudo sed -i -r '/listener.security.protocol.map/d' $CONFIG cat <<EOF | sudo tee -a $CONFIG listener.security.protocol.map=INSIDE:SASL_PLAINTEXT,OUTSIDE:SASL_SSL ssl.keystore.location=$TARGET_KEYSTORE ssl.keystore.password=$PASSWD ssl.key.password=$PASSWD EOF sudo systemctl restart kafka ``` :::warning Please use your own email, and keep in mind that the Azure location is hardcoded in the command; if you're using a different one, update the FQDN. ::: Note that SSL was only enabled for the `OUTSIDE` listener, meaning we should only modify the Minions (and `listener.security.protocol.map` was changed because of that), as OpenNMS won't use it because it lives in the same protected network as the Kafka cluster. To verify, you can retrieve the broker configuration via Zookeeper: ```bash= /opt/kafka/bin/zookeeper-shell.sh $(hostname) get /brokers/ids/1 | egrep '^\{' | jq ``` If everything went well, you should get something like this: ```json= { "features": {}, "listener_security_protocol_map": { "INSIDE": "SASL_PLAINTEXT", "OUTSIDE": "SASL_SSL" }, "endpoints": [ "INSIDE://agalue-kafka-1:9092", "OUTSIDE://agalue-kafka-1.eastus.cloudapp.azure.com:9094" ], "jmx_port": 9999, "port": -1, "host": null, "version": 5, "timestamp": "1622658498210" } ``` Note that `SASL_SSL` applies to `OUTSIDE`. Now it is time to update the Minions. On each Minion, do the following: ```bash= for module in sink rpc; do cfg="/etc/minion/org.opennms.core.ipc.$module.kafka.cfg" sudo sed -i -r '/security.protocol/s/SASL_PLAINTEXT/SASL_SSL/' $cfg done sudo systemctl restart minion ``` While you're there, you can check if TLS is actually enabled by running: ```bash= openssl s_client -connect agalue-kafka-1.eastus.cloudapp.azure.com:9094 ``` There is no need to modify anything else as we're using valid certificates signed by a well-known public entity. When using private certificates or private CAs, you would have to create Trust Store via `keytool` for the clients and the brokers. :::info As an challenge to the reader, update the `/tmp/kafka-template.yaml`, `/tmp/opennms-template.yaml`, and `/tmp/minion-template.yaml` to include all the SASL and SSL/TLS configuration and start the whole environment from scratch with authentication and encryption enabled. ::: ## Securing OpenNMS The following is inspired by [this](https://hackmd.io/@agalue/HyGyD0diN) guide to enable TLS with [Nginx](https://www.nginx.com/) for the OpenNMS WebUI and Grafana. However, as we're using Ubuntu here, I'll describe the required changes. Allow access via TCP 80 and 443: ```bash= az vm open-port -g $RG_NAME -n $ONMS_VM_NAME --port 443 --priority 110 -o table az vm open-port -g $RG_NAME -n $ONMS_VM_NAME --port 80 --priority 120 -o table ``` SSH the OpenNMS server and then: ```bash= export EMAIL="user@example.com" export LOCATION=$(curl -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2021-02-01" 2>/dev/null | jq -r '.compute.location') export FQDN=$(hostname).$LOCATION.cloudapp.azure.com sudo apt install -y nginx sudo mkdir -p /var/www/$FQDN/.well-known sudo chown nginx:nginx /var/www/$FQDN cfg="/etc/nginx/sites-available/default" cat <<EOF | sudo tee $cfg server { listen 80; server_name $FQDN; # maintain the .well-known directory alias for lets encrypt renewals location /.well-known { alias /var/www/$FQDN/.well-known; } location /hawtio/ { proxy_pass http://localhost:8980/hawtio/; } location /grafana/ { proxy_pass http://localhost:3000/; } location /opennms/ { proxy_set_header Host \$host; proxy_set_header X-Real-IP \$remote_addr; proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto \$scheme; proxy_set_header Upgrade \$http_upgrade; proxy_set_header Connection "Upgrade"; proxy_pass http://localhost:8980/opennms/; proxy_redirect default; proxy_read_timeout 90; } } EOF sudo systemctl restart nginx sudo systemctl enable nginx sudo snap install core sudo snap refresh core sudo snap install --classic certbot sudo ln -s /snap/bin/certbot /usr/bin/certbot sudo certbot --nginx -d $FQDN --non-interactive --agree-tos -m $EMAIL cat <<EOF | sudo tee /etc/opennms/opennms.properties.d/webui.properties org.opennms.netmgt.jetty.host = 127.0.0.1 opennms.web.base-url = https://%x%c/ EOF sudo systemctl restart opennms sudo sed -i -r "s|^;domain =.*|domain = $FQDN|" /etc/grafana/grafana.ini sudo sed -i -r "s|^;root_url =.*|root_url = %(protocol)s://%(domain)s:%(http_port)s/grafana/|" /etc/grafana/grafana.ini sudo systemctl restart grafana-server ``` :::warning Make sure to use a valid content for `$EMAIL`, as that's required by LetsEncrypt (as we did for Kafka). ::: Note that `cmak` (or Kafka Manager) is not present due to the complexity of having it working behind a proxy. You can remove the NSG rules for ports 8980 and 3000. ```bash= az network nsg rule delete -g $RG_NAME \ --nsg-name ${ONMS_VM_NAME}NSG -n open-port-8980 az network nsg rule delete -g $RG_NAME \ --nsg-name ${ONMS_VM_NAME}NSG -n open-port-3000 ``` ## Tracing :::warning Work in progress... ::: Some circumstances could introduce unexpected behavior to the solution. Besides the traditional monitoring to ensure that all the components are behaving as expected in CPU, Memory, Java Heap Memory, Java GC, and IO (covered as part of this tutorial), you sometimes need to dig deeper to understand what's happening. OpenNMS added [OpenTracing](https://opentracing.io/) support via [Jaeger](https://www.jaegertracing.io/) to understand how much time messages sent via the broker are taking to be produced and consumed. The [official documentation](https://docs.opennms.com/horizon/28.0.1/deployment/opentracing/jaeger-tracing.html) has a guide about how to configure it. As we have Docker running in the OpenNMS server, we can start an All-In-One Jaeger Instance through it very easily. To do that, SSH into the OpenNMS server and run the following: ```bash= docker run -d --name jaeger \ -p 6831:6831/udp \ -p 6832:6832/udp \ -p 16686:16686 \ jaegertracing/all-in-one:1.24 ``` OpenNMS would have direct access as it runs on the same machine accessible via localhost and should be configured as instructed in the official docs. For the Minions, you would need to open the UDP ports 6831 and 6832 in the NSG associated with the OpenNMS server, as well as TCP 16686 to access the Jaeger WebUI: ```bash= az vm open-port -g $RG_NAME -n $ONMS_VM_NAME \ --port 6831-6832,16686 --priority 400 --output table ``` Then, configure the minion as instructed in the official docs, using the OpenNMS FQDN and the port mentioned above. ## Clean Up When we're done, make sure to delete the cloud resources. If you created the resource group for this exercise, you could remove all the resources with the following command: ```bash= az group delete -g $RG_NAME ``` If you're using an existing resource group that you cannot remove, make sure only to remove all the resources created in this tutorial. All of them should be easily identified as they will contain the username and the VM name as part of the resource name. The easiest way is to use the Azure Portal for this operation. Alternatively, ```bash= IDS=($(az resource list \ --resource-group $RG_NAME \ --query "[?contains(name,'$PREFIX-') && type!='Microsoft.Compute/disks']".id \ --output tsv | tr '\n' ' ')) for id in "${IDS[@]}"; do echo "Removing $id" az resource delete --ids "$id" --verbose done DISKS=($(az resource list \ --resource-group $RG_NAME \ --query "[?contains(name,'$PREFIX-') && type=='Microsoft.Compute/disks']".id \ --output tsv | tr '\n' ' ')) for id in "${DISKS[@]}"; do echo "Removing $id" az resource delete --ids "$id" --verbose done ``` The reason to have two sets of deletion groups is that, by default, the list contains disks initially, which cannot be removed before the VMs. For this reason, we exclude the disks on the first set, and then we remove the disks. Note that because all the resource names are prefixed with the chosen username, we can use it to identify them and remove them uniquely. Then clean the local resources: ```bash= multipass delete $MINION_ID1 $MINION_ID2 multipass purge ``` :::warning Remember to remove the `nginx` instance if you decided to use it. :::