Sizing Cassandra ==== The use case for having Cassandra (or ScyllaDB) as the backend for the performance metrics in OpenNMS is for storing a huge amount of non-aggregated data, which is not possible with RRDtool. RRDtool is very good for those installations with a finite and predictable amount of metrics where the size and the I/O requirements are feasible for modern SSD disks. This is important as RRDtool only scales vertically, meaning that when the current disks' limits are reached (mostly due to speed, not space), a faster disk is required. This is when Cassandra or ScyllaDB can help. Although there is a high learning curve, using any of these applications requires a commitment to have qualified personnel to manage this database. ScyllaDB is binary compatible with Cassandra (even at the SSTable level), but they are implemented very differently. Cassandra is implemented in Java, meaning that all the JVM tuning is required and other internal Cassandra tunings. On the other hand, ScyllaDB is implemented in modern C++, taking advantage of the CPU where it is running. That means ScyllaDB can manage huge machines as nodes, whereas Cassandra would require multiple instances on huge machines. It is feasible to have faster results with ScyllaDB compared with its Java sibling in terms of performance. Configuring and managing both applications is different, even if they will provide the same operational result with OpenNMS, so the decision should be carefully analyzed, especially by the team that will support this database. Consider the names Cassandra and ScyllaDB interchangeable, and to simplify the upcoming discussion, the term Cassandra will be used. # Sizing Terms When sizing Cassandra, we need to know the following: * Number of Nodes * Replication Factor * Single DC or Multi-DC environment * Number of disks per node * Total disk space per node * Total retention (TTL) * Average Sample Size * Injection Rate Note that the last two (i.e., Injection Rate and the Average Sample Size) are the only operational requirements that are not necessarily easy to estimate unless the user knows exactly how many metrics will be collected at the chosen collection interval to obtain this number and the size in bytes of a given sample to store in the samples table at the SSTable level. The following sections will explain how to calculate the cluster, but we will first discuss the injection rate. ## Evaluation Layer Since it is very common not to know the number of metrics to be collected, the evaluation layer has been implemented in OpenNMS to perform data collection as usual against the expected inventory to "count" the number of elements involved. To use this evaluation layer, the following change is required: ```bash= cat <<EOF > /opt/opennms/etc/opennms.properties.d/timeseries.properties org.opennms.rrd.storeByGroup=true org.opennms.timeseries.strategy=evaluate EOF ``` To emulate Newts' behavior, we need to enable `storeByGroup`, as the evaluation layer can work with and without it. Then, restart OpenNMS. It is crucial to know that no data will be stored on disk when this feature is enabled, meaning that it worth considering having a test or development server (with similar characteristics and access as the production one). That's because doing this in a production server would introduce gaps in the graphs due to this feature's nature. This feature is going to count the following: * Number of Nodes involved in data collection. * Number of IP Interfaces involved in persisting response time data from the poller. * Number of unique MibObj Groups (based on the active `datacollection-config.xml`) * Number of unique OpenNMS resources * Number of unique numeric metrics * Number of unique string-based metrics * Injection rate (for numeric metrics) Here is an example: ```bash= 2016-05-23 06:03:12,374 INFO [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.groups, value=1341107 2016-05-23 06:03:12,374 INFO [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.interfaces, value=0 2016-05-23 06:03:12,374 INFO [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.nodes, value=6883 2016-05-23 06:03:12,374 INFO [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.numeric-attributes, value=4499569 2016-05-23 06:03:12,374 INFO [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.resources, value=507456 2016-05-23 06:03:12,374 INFO [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.string-attributes, value=1904879 2016-05-23 06:03:12,374 INFO [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=METER, name=evaluate.samples, count=163832495, mean_rate=9415.655559643415, m1=7256.328061613966, m5=9467.944242318974, m15=9550.126418154872, rate_unit=events/second ``` From the above example, we can easily conclude that in the environment where the evaluation layer was running, the injection rate is about 9500 samples per second. The other values are still beneficial as the main 2 settings required for Newts would need them (resource cache size and ring buffer; more on this below). ### Estimations based on RRD/JRB files When it is impossible to execute the evaluation layer for some reason, it is still possible to obtain some estimates based on the RRD/JRB files that OpenNMS is currently updating on the production server. It is important to notice that if `storeByGroup` is not used, the tool will only offer the total metrics, but it won't offer the amount of `newts resources` (or `groups`) required to set the resource cache. I wrote a tool called [newts-sizing](https://github.com/agalue/newts-sizing) that can help in this regard. ## Newts Caches As shown before, from the evaluation layer, we can get the injection rate and important information about the expected resources a d metrics. 2 caches should be configured when using Newts, and the amount of memory involved will be pre-reserved from the Java Heap Size when OpenNMS starts. That means the effective Heap Size available for the rest of the OpenNMS daemons will be the total heap minus the caches' size in question. * Ring Buffer * Resource Cache The ring buffer size value has to be a power of 2 due to how this cache works. When Newts is chosen as the persistence layer, and Collectd is gathering metrics from the target devices, at the end of each collection attempt, the data is passed from the collector implementation (for example, the SNMP Collector) to the persistence layer. In Newts' case, the data is added to this ring buffer, and the persistence operation finishes so that Collectd can schedule the next collection attempt. Then, the configured "write threads" will extract data from the ring buffer and push it to Cassandra using the Newts library and the Datastax Driver (which is compatible with Cassandra and ScyllaDB). During the persistence phase, the resource cache is built. Entries will be added or removed according to the incoming `CollectionSets`. This buffer's size should be enough to cover all the entries that will be used to accelerate other Newts-related features like enumerate resources, metrics, and string attributes for graphing purposes. Unfortunately, filling up this cache is a costly and intensive operation when OpenNMS starts. It is intensive as it will be performing what's called "Newts Indexing" at a very high rate against Cassandra to either write new entries to it or read existing entries from it. When this is happening, the "Newts Writing" speed is affected, meaning that lots of entries will live on the ring buffer for a while until the indexing is completed. For this reason, the ring buffer size must be configured to accommodate all the metrics temporarily, even if it was designed for a different purpose. Once the indexing is done, the ring buffer will be barely filled up unless Cassandra has a problem (or become slow for some reason). The capacity of the Cassandra cluster dictates how fast it can write and read data. Depending on how fast it is, it can complete the indexing phase in a short time, or this period can take several minutes. The bigger the cluster, the faster it would be, but the idea is to size the cluster based on the data stored rather than accommodate the indexing phase. On big installation, the benefit is that a big cluster can be easily justified, so the cluster's injection rate capacity might be fast enough. Unfortunately, know how fast a cluster can receive data for writing purposes cannot be easily estimated, as it depends on multiple factors. For this reason, after choosing the technology (ScyllaDB or Cassandra) and the hardware (even if it is an estimate), field tests must be executed to understand if the cluster is fast enough. Fortunately, OpenNMS provides a tool for this purpose. Back to the resource cache, here is how to estimate its size. The data that will be stored on the resource cache will be the number of resources, plus the number of unique groups. On average, each entry takes 1KB, meaning that the chosen entry can be seen as the number of Kilobytes from the heap. A similar rule applies to the ring buffer. For example, let's say you have one router with 2 physical interfaces using the default snmp-collection. You are going to have the following entries on the cache: ```= response:10.0.0.1:icmp response:11.0.0.1:icmp response:12.0.0.1:icmp snmp:fs:Office:router:mib2-tcp snmp:fs:Office:router:juniper-fwdd-process snmp:fs:Office:router:ge_0_0 snmp:fs:Office:router:ge_0_0:mib2-X-interfaces snmp:fs:Office:router:ge_0_0:mib2-X-interfaces-pkts snmp:fs:Office:router:ge_0_0:mib2-interface-errors snmp:fs:Office:router:ge_0_1 snmp:fs:Office:router:ge_0_1:mib2-X-interfaces snmp:fs:Office:router:ge_0_1:mib2-X-interfaces-pkts snmp:fs:Office:router:ge_0_1:mib2-interface-errors ``` From the list, the resource cache is going to have 13 entries for this device, where the first 3 come from the poller (response time for ICMP on each IP of the device). Then, we have the groups associated with the node-level resource. Then, each interface set (one for the interface itself and one entry for each MibObj group). Back to the results from the evaluation layer: ```= type=GAUGE, name=evaluate.groups, value=1341107 type=GAUGE, name=evaluate.interfaces, value=0 type=GAUGE, name=evaluate.nodes, value=6883 type=GAUGE, name=evaluate.numeric-attributes, value=4499569 type=GAUGE, name=evaluate.resources, value=507456 type=GAUGE, name=evaluate.string-attributes, value=1904879 type=METER, name=evaluate.samples, count=163832495, mean_rate=9415.655559643415, m1=7256.328061613966, m5=9467.944242318974, m15=9550.126418154872, rate_unit=events/second ``` On the installation where the evaluation layer was enabled, we can infer that the size of the resource cache will be: ```= groups + resources = 1341107 + 507456 = 1848563 ``` Finally, we can round it and configured the following to OpenNMS: ```bash= echo "org.opennms.newts.config.cache.max_entries=2000000" >> \ /opt/opennms/etc/opennms.properties.d/newts.properties ``` As mentioned, there is no rule for the ring buffer, as it depends on how fast the chosen cluster would be. As a rule of thumb, a good starting point would be the nearest power of 2 greater than 2 times the resource cache's size. In this particular case, it would be: ```bash= echo "org.opennms.newts.config.ring_buffer_size=4194304" >> \ /opt/opennms/etc/opennms.properties.d/newts.properties ``` :::warning It is recommended to perform field tests to evaluate the Newts indexing phase's impact after starting OpenNMS to validate the chosen value for the ring buffer. ::: As mentioned, each entry's size is approximately 1KB, meaning that the values from the configuration can be seen as Gigabytes. In other words, to complete the configuration for this deployment, and considering that over 6GB of the heap will be associated with these 2 caches, the heap's total size should be greater than 8GB. That might be the smallest number as OpenNMS requires at least 2GB for basic operations. For production starting with 16GB (meaning 10GB for OpenNMS and 6GB for the buffers) would be better in this particular scenario. There are 2 implementations of the resource cache: * org.opennms.netmgt.newts.support.GuavaSearchableResourceMetadataCache * org.opennms.netmgt.newts.support.RedisResourceMetadataCache Based on field tests, it is not recommended to use the implementation based on Redis, as it is considered extremely slow for production loads, based on the time required to empty the resource buffer after the indexing is done; even if this would be a desirable option especially when having external WebUI servers for OpenNMS that won't have a cache. If the cache doesn't exist, rendering the Choose Resources page and the Graphs pages can take a considerable amount of time, depending on how many active resource types have been configured on the system. ## Cluster Size To estimate the cluster size, in other words: * Number of Nodes * Number of disks per node * Total disk space per node We need to know the following: * Injection Rate or Total Number of Metrics * Retention / TTL * Replication Factor * Average Sample Size As a general rule, the more data you want to keep, the greater the disk space per node will be. The replication factor facilitates High Availability, but it is important to keep in mind that this is not a replacement for backing up the cluster. The minimum size for a cluster is 3 with a replication factor of 2. Having a replication factor of 2 means one node can be down without losing data. The formula of how many nodes can be down at a given time is: ```= NumberOfNodesDownSimultaneously = ReplicationFactor - 1 ``` For bigger clusters, it makes sense to have a reasonable higher replication factor, but that imposes restrictions on the disk space for obvious reasons. Increasing a replication factor means having an additional copy of the data somewhere else. The discussion about Multi-DC won't be covered here. Still, one thing to keep in mind is that a Multi-DC can serve as a disaster recovery solution and be considered a backup strategy (although it is not replacing actual backups). The retention or TTL, which is how we call it within OpenNMS, is when the data will be kept on the cluster. When this time expires, the data will be removed automatically. This is done through a Cassandra feature called TTL, a property associated with every single metric inserted into the cluster. The greater the retention is, the more disk space will be needed. Another factor when choosing disk space is the compaction strategy. The default compaction strategy used by Cassandra and Newts is STCS (Size Tied Compaction Strategy). This strategy is well known for wasting disk space. In fact, to use this strategy, each node's physical disk size should be 2 times the expected data to be stored. In other words, it is required a 50% overhead to perform compactions. Considering that it is mandatory to use local and ultra-fast disks (i.e., Tier 1 Server Grade SSDs), a 50% overhead can be a costly feature. For this reason, it is essential to reduce the overhead, and the only way to do it is by using a different compaction strategy. ### Time Series Data When the data to be stored in Cassandra is based on time series metrics, meaning immutable entries based on timestamps to be stored on tables, the best approach is to use TWCS (Time Window Compaction Strategy). To learn more about why TWCS makes sense, I recommend to read the following two blog posts: * [TWCS Part 1](https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html) * [TWCS Part 2](https://thelastpickle.com/blog/2017/01/10/twcs-part2.html) The overhead when using this strategy can be as big as each time-windowed chunk. The chunk size depends on how TWCS is configured, but in practice, it can be around 5% of the disk space. Compared with STCS, it is clear which one is the winner, as the required disk overhead for STCS is 50%. However, it is important to keep in mind the time series constraint: the data has to be immutable. In other words, once stored, it won't be altered or manually modified. Data should only be evicted by TTL. If this is not the case, TWCS won't help as much as it should, meaning the overhead on disk space will be greater. Fortunately, we can consider the data stored by OpenNMS through Newts time-series data to use this strategy. :::warning The keyspace for Newts has to be manually configured when a different compaction strategy or a different replication strategy will be used. That means the usage of `cqlsh` is mandatory, and the `$OPENNMS_HOME/bin/newts` facility won't work in this case and should never be used. ::: There will be a section dedicated to the TTL, but let's assume a year of retention. Here is a way to configure this strategy: ```cql= CREATE TABLE newts.samples ( context text, partition int, resource text, collected_at timestamp, metric_name text, value blob, attributes map<text, text>, PRIMARY KEY((context, partition, resource), collected_at, metric_name) ) WITH compaction = { 'compaction_window_size': '7', 'compaction_window_unit': 'DAYS', 'expired_sstable_check_frequency_seconds': '86400', 'class': 'TimeWindowCompactionStrategy' } AND gc_grace_seconds = 604800 AND read_repair_chance = 0; ``` The "window size" is configured through 2 settings: * compaction_window_size * compaction_window_unit The reason why choosing 7 days is the following: for 1-year retention, the number of compacted chunks will be 52 (as there are 52 weeks on a year). This is a little bit higher than the recommended number of chunks, but in practice, this is reasonable, especially to simplify the calculations. For different retentions, try to target around 40 chunks or less. :::warning The compaction strategy is declared with the `samples` table, making it a global parameter on the keyspace. ::: To have the whole picture in mind, the following is how the entire keyspace for Newts would look like: ```cql= CREATE KEYSPACE IF NOT EXISTS newts WITH replication = {'class' : 'SimpleStrategy', 'replication_factor' : 2 }; CREATE TABLE IF NOT EXISTS newts.samples ( context text, partition int, resource text, collected_at timestamp, metric_name text, value blob, attributes map<text, text>, PRIMARY KEY((context, partition, resource), collected_at, metric_name) ) WITH compaction = { 'compaction_window_size': '7', 'compaction_window_unit': 'DAYS', 'expired_sstable_check_frequency_seconds': '86400', 'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy' } AND gc_grace_seconds = 604800; CREATE TABLE IF NOT EXISTS newts.terms ( context text, field text, value text, resource text, PRIMARY KEY((context, field, value), resource) ); CREATE TABLE IF NOT EXISTS newts.resource_attributes ( context text, resource text, attribute text, value text, PRIMARY KEY((context, resource), attribute) ); CREATE TABLE IF NOT EXISTS newts.resource_metrics ( context text, resource text, metric_name text, PRIMARY KEY((context, resource), metric_name) ); ``` :::warning The above assumes that no rack-awareness or multi-dc is required. If this is not the case, you would have to replace `SimpleStrateby` with `NetworkTopologyStrategy`, and set the `endpoint_snitch` to `GossipingPropertyFileSnitch`, besides configuring the expected replication per DC. ::: Use `cqlsh` to create the keyspace. Note that the CQL content has constraints to guarantee that it will only be created when it doesn't exist (same for the tables). One more time, **do not** use `/opt/opennms/bin/newts init`, as that will create the tables using STCS for everything (which is the default when you don't specify the compaction strategy as shown above). ### TTL :::warning This is a global setting in OpenNMS. ::: When configuring OpenNMS, the administrator should choose one value to be used as the retention for every single metric collected on the system. Unlike with RRDtool, it is impossible to have different retention values for different metrics, meaning that if a given customer wants different retentions, the customer in question would have to configure an OpenNMS server with a dedicated Newts keyspace for each TTL. With this schema in place, Grafana is the only way to have a unified view of all the metrics. As mentioned, retention is the amount of time a given metric will exist on the Cassandra cluster. That means, when this time expires, the data will be evicted from the keyspace during compaction. ### Average Sample Size Every environment is different. However, there are ways to estimate the effective metric size (in average) that a given row of the `newts.samples` table could take by performing some analysis. I used the stress tool (described later) to have SSTables populated on disk on a test cluster to figure out that value. I used the following script to analyze the SSTables using [sstablemetadata](https://cassandra.apache.org/doc/latest/tools/sstable/sstablemetadata.html) (a tool available after installing the `cassandra-tools` package): ```bash= #/bin/bash DIR=$1 array=($(find $DIR -name '*-Data.db')) for file in "${array[@]}"; do echo "processing $file ..." data=$(sstablemetadata $file 2>/dev/null | egrep "(totalRows|Compression ratio)") compressionRatio=$(echo $data | awk '{print $3}') totalRows=$(echo $data | awk '{print $5}') sizeInBytes=$(stat --printf="%s" $file) avgRowSize=$(bc <<< "$sizeInBytes / $totalRows") avgRowSizeRaw=$(bc <<< "$avgRowSize * (1 + $compressionRatio)") echo "totalRows: $totalRows" echo "sizeInBytes: $sizeInBytes" echo "compressionRatio: $compressionRatio" echo "avgRowSize: $avgRowSize (compressed)" echo "avgRowSize: $(printf %.2f $avgRowSizeRaw) (uncompressed)" echo done ``` :::warning It is recommended to stop Cassandra before running the script (as per recommendation from the documentation of `sstablemetadata`). ::: Here is the sample output: ```bash= [root@cassandra1 ~]# ./process.sh /var/lib/cassandra/data/newts/samples-63c18da080e411ebbfb7f95660c4108a/ processing /var/lib/cassandra/data/newts/samples-63c18da080e411ebbfb7f95660c4108a/md-5-big-Data.db ... totalRows: 4947072 sizeInBytes: 62167172 compressionRatio: 0.2959849504824436 avgRowSize: 12 (compressed) avgRowSize: 15.55 (uncompressed) processing /var/lib/cassandra/data/newts/samples-63c18da080e411ebbfb7f95660c4108a/md-11-big-Data.db ... totalRows: 1233165 sizeInBytes: 15856941 compressionRatio: 0.29917443379137526 avgRowSize: 12 (compressed) avgRowSize: 15.59 (uncompressed) processing /var/lib/cassandra/data/newts/samples-63c18da080e411ebbfb7f95660c4108a/md-10-big-Data.db ... totalRows: 4943299 sizeInBytes: 61926864 compressionRatio: 0.29556160446600444 avgRowSize: 12 (compressed) avgRowSize: 15.55 (uncompressed) ``` As you can see, the average size with the default compression settings is 12 Bytes. Also, you can see the compression rate of the data. However, that doesn't consider the other files associated with each `SSTable` (for instance, the Index file). Also, I'm ignoring the rest of the tables in the `newts` keyspace and other keyspaces and tables Cassandra maintains. That is why it is better to assume a bigger value to compensate for what the calculations are ignoring and perhaps increase the disk overhead to avoid running out of disk space. ### Sizing Formula Knowing the number of metrics to be persisted (a.k.a. `metricsCapacity`), it is possible to assume the following: ``` metricsCapacity = sampleCapacityInBytes / totalSamplesPerMetric totalSamplesPerMetric = (ttl * 86400) / (collectionStep * 60) sampleCapacityInBytes = clusterUsableDiskSpace / averageSampleSize clusterUsableDiskSpace = (availBytesPerNode * numberOfNodes) / replicationFactor availBytesPerNode = totalDiskSpacePerNodeInBytes * (1 - percentageOverhead/100) ``` `totalSamplePerMetric` would be the total amount of rows on the `newts.samples` table in Cassandra per metric. Each installation is different, but on average, we can consider that the size of a single metric (i.e., `averageSampleSize`) would be about 14 Bytes (note I'm using a slightly higher value). Unfortunately, it is tough to estimate how the real size would be due to how dynamic the non-deterministic elements of the samples table could be and the fact that SSTables might contain compressed data with different settings (without taking into consideration the metadata and other files that would exist along with the main SSTables files that hold the data). That means the actual value can be less or even higher, but I believe it is better to over-provision disk rather than be at risk of running out of disk space. Choose the number of nodes to calculate the required available bytes per node; or vice versa, choose the disk size to calculate the expected number of nodes. In general, for Cassandra, it is recommended never to use a disk greater than 4TB. ScyllaDB is different, and they recommend having a "30:1" relationship between the disk space in gigabytes and the available RAM on the system. For example, let's assume 3TB per node. The request is to collect data from 35 million metrics every 5 minutes for 3 months, assuming TWCS with 5% overhead and a replication factor of 2, the number of nodes can calculate like this: ``` numberOfNodes = ((ttl * 86400) / (collectionStep * 60) * metricsCapacity * averageSampleSize * replicationFactor) / (totalDiskSpacePerNodeInBytes * (1 - percentageOverhead/100)) ``` In other words, ``` numberOfNodes = (((90 * 86400) / (5 * 60)) * 35000000 * 14 * 2) / (3 * 2^40 * (1 - 0.05)) ~ 8 ``` With the above assumptions, we need approximately 8 nodes cluster with 3TB of disk space per instance, using TWCS with a replication factor of 2, to persist 35 million metrics every 5 min. Let's say the injection rate is well known (either because it was part of the requirements or because the evaluation layer has been used). In this case, the total amount of metrics to be collected is the injection rate multiplied by the collection interval, which is another way to obtain the number of metrics. As shown, the assumption includes knowing the number of metrics to be collected. When this is not known, an assumption on the number of nodes has to be made to estimate the total amount of metrics the cluster will handle. Cassandra grows linearly. That means, to have 4 times that capacity (a.k.a) 140 million, we need to multiply by 4 the cluster size. It is not recommended to increase the disk size per Cassandra Node as this is an anti-pattern, so the obvious variable to increase is the number of nodes within the cluster. If we increase from 8 to 36, keeping the same assumptions, we cover the new requirement. With that amount of nodes, we should increase the replication factor to have more room for potential outages. If that parameter changes, more nodes will be required, easily inferred from the formulas. I wrote a tool called [newts-sizing](https://github.com/agalue/newts-sizing) that can help in this regard. # OpenNMS Configuration At this point, we know that we should configure in OpenNMS the following parameters at a minimum: * Retention or TTL * Resource Cache Size * Ring Buffer Size * Writer Threads * Heap Size We already provided a way to calculate the cache sizes and the size of the cluster itself. One thing to keep in mind is that it is possible to end with the same injection rate with a different combination of resources and groups, impacting the ring buffer. To explain that, let's introduce the stress tool. ## Stress Tool OpenNMS offers a tool that generates random traffic similar to how Collectd does it to understand if the OpenNMS settings, the Newts settings, and the chosen Cassandra cluster can fulfill the needs. :::warning It is important to keep in mind that the actual work of Collectd can be more expensive than using this tool. For this reason, the chosen OpenNMS should never reach CPU usage over 20% when executing the stress tests to have enough computational power for Collectd and for all the rest of the OpenNMS daemons that would be running and doing work on any given production environment. ::: This tool is a Karaf Command, which requires access to the Karaf Shell: ```bash= ssh -o ServerAliveInterval=10 -p 8101 admin@localhost ``` The `ServerAliveInterval` is mandatory to keep the session alive; otherwise, you would have to restart OpenNMS if it is closed. Executing `metrics:stress --help` provides an overview of the parameters you can tune during the test. :::warning It is recommended to use this on a clean installation of OpenNMS and the cluster that will be evaluated. ::: Some tests have been executed using ScyllaDB or Cassandra, and the results were published [here](https://github.com/agalue/scylladb-aws) and [here](https://github.com/agalue/cassandra-azure). An interesting fact from those tests is that the bigger the cluster is, the faster it will be. That means adding more nodes to the cluster will make it faster, as the work will be evenly distributed across all the nodes. Keep in mind the resource cache, and here is why: ```= metrics:stress -r 60 -n 15000 -f 100 -g 1 -a 20 -s 1 -t 200 -i 300 ``` That command would inject 5000 string metrics per second, and 100000 numeric metrics per second, creating 3000000 entries on the resource cache. On the other hand, the following command: ```= metrics:stress -r 60 -n 15000 -f 20 -g 5 -a 20 -s 1 -t 200 -i 300 ``` It also injects 5000 string metrics per second and 100000 numeric metrics per second, but it creates 1800000 entries on the resource cache. That means, even if the injection rate is the same, the requirements for the resource cache and the ring buffer will be different. Based on the field tests on a 16 nodes cluster, a single OpenNMS server can inject 100000 samples per second, having 2000000 entries in the resource cache and 4194304 entries on the ring buffer, using the second command from above. That said, with the same setup, to handle the first command, the cache values have to be doubled (4000000 entries for the resource cache and 8388608 entries for the resource buffer). Based on field tests, having larger buffer sizes can be counter-productive, meaning that the load has to be divided into multiple OpenNMS servers. While figuring out the cache sizes for the expected load, other parameters can be tuned. Still, how the data collection process is configured can influence the resource cache directly, as there is where the group of objects to collect is configured (like the MibObj group inside a datacollection-group). ## Data Collection Configuration It is crucial to review all the metrics that are going to be collected to avoid pushing the caches and the injection rate with unnecessary data. If this step is not performed, the environment could end with a big cluster and a big machine for OpenNMS (or multiple ones) for no reason, besides collecting "all the data that's available", instead of storing "only the data that's needed". For example, consider the following content extracted from `$OPENNMS_HOME/etc/datacollection/mib2.xml`: ```xml= <group name="mib2-X-interfaces" ifType="all"> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.1" instance="ifIndex" alias="ifName" type="string"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.15" instance="ifIndex" alias="ifHighSpeed" type="string"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.6" instance="ifIndex" alias="ifHCInOctets" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.10" instance="ifIndex" alias="ifHCOutOctets" type="Counter64"/> </group> <group name="mib2-X-interfaces-pkts" ifType="all"> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.7" instance="ifIndex" alias="ifHCInUcastPkts" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.8" instance="ifIndex" alias="ifHCInMulticastPkts" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.9" instance="ifIndex" alias="ifHCInBroadcastPkts" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.11" instance="ifIndex" alias="ifHCOutUcastPkts" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.12" instance="ifIndex" alias="ifHCOutMulticastPkt" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.13" instance="ifIndex" alias="ifHCOutBroadcastPkt" type="Counter64"/> </group> <group name="mib2-interface-errors" ifType="all"> <mibObj oid=".1.3.6.1.2.1.2.2.1.13" instance="ifIndex" alias="ifInDiscards" type="counter"/> <mibObj oid=".1.3.6.1.2.1.2.2.1.14" instance="ifIndex" alias="ifInErrors" type="counter"/> <mibObj oid=".1.3.6.1.2.1.2.2.1.19" instance="ifIndex" alias="ifOutDiscards" type="counter"/> <mibObj oid=".1.3.6.1.2.1.2.2.1.20" instance="ifIndex" alias="ifOutErrors" type="counter"/> </group> ``` The above section is perfectly valid. Now imagine a scenario where there is a need to monitor 1000 Cisco Nexus Switches, each with 1500 Interfaces (between physical and virtual interfaces). Because we have 3 groups associated with interface statistics, the Newts persistence strategy is going to create: ``` 1000 * 1500 + 1000 * 1500 * 3 = 6000000 ``` In other words, an entry per each resource (in this case per interface), plus an instance per each group on each interface. With that size for the cache, which is not considering the node level resources, the groups for the node level resources, and other resources, in general, would require a ring buffer of 16777216 or a tremendously big cluster to be able to handle the indexing with a smaller ring buffer. Now, if we combine all these 3 groups into one, which is entirely possible because all share the same resource type (i.e., the value of the instance is the same), we dramatically reduce the resource cache to: ``` 1000 * 1500 + 1000 * 1500 = 3000000 ``` Meaning, we could be able to handle the load with a ring buffer of 8388608. This is important considering that having a Heap Size greater than 31GB can be dangerous for a Java application. The proposed solution to reduce the entries on the resource cache is to have: ```xml= <group name="mib2-X-interfaces-full" ifType="all"> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.1" instance="ifIndex" alias="ifName" type="string"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.6" instance="ifIndex" alias="ifHCInOctets" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.7" instance="ifIndex" alias="ifHCInUcastPkts" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.8" instance="ifIndex" alias="ifHCInMulticastPkts" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.9" instance="ifIndex" alias="ifHCInBroadcastPkts" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.10" instance="ifIndex" alias="ifHCOutOctets" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.11" instance="ifIndex" alias="ifHCOutUcastPkts" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.12" instance="ifIndex" alias="ifHCOutMulticastPkt" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.13" instance="ifIndex" alias="ifHCOutBroadcastPkt" type="Counter64"/> <mibObj oid=".1.3.6.1.2.1.31.1.1.1.15" instance="ifIndex" alias="ifHighSpeed" type="string"/> <mibObj oid=".1.3.6.1.2.1.2.2.1.13" instance="ifIndex" alias="ifInDiscards" type="counter"/> <mibObj oid=".1.3.6.1.2.1.2.2.1.14" instance="ifIndex" alias="ifInErrors" type="counter"/> <mibObj oid=".1.3.6.1.2.1.2.2.1.19" instance="ifIndex" alias="ifOutDiscards" type="counter"/> <mibObj oid=".1.3.6.1.2.1.2.2.1.20" instance="ifIndex" alias="ifOutErrors" type="counter"/> </group> ``` Of course, that's assuming that the monitoring platform operators will use all the metrics; otherwise, it is advised to remove what's not going to be used. ## Write Threads A thread pool will be dedicated to extract metrics from the ring buffer and push them to Cassandra through Newts. The number of threads should be tuned to be the number of cores of your OpenNMS server. We found that increasing the number of threads is not necessarily useful during field tests, as their impact on the injection rate is not dramatic, but having more threads working could impact the overall CPU usage of the OpenNMS server. ## Datastax Driver Settings There are 2 parameters that can be tuned on the Driver: * org.opennms.newts.config.max-connections-per-host * org.opennms.newts.config.max-requests-per-connection To learn more about them, please refer to the driver's [documentation](https://docs.datastax.com/en/developer/java-driver/3.5/manual/pooling/). Those are very important, and we recommend tuning them when using ScyllaDB.