I'm not a Cassandra expert. This guide should serve as a reference based on my experience with that complex piece of software when I helped customers deploying it in their environments. However, I won't be covering authentication and encryption.
Apache Cassandra is an open source, distributed, NoSQL database. It presents a partitioned wide column storage model with eventually consistent semantics.
We use Cassandra in OpenNMS to store performance metrics through Newts.
Sizing your cluster is crucial, and it is a task to be done before even think about configuring a Cassandra cluster. Follow the sizing guide for more information, as that analysis falls outside the scope of this guide.
If you're planning to share the Cassandra cluster across multiple independent OpenNMS servers, besides having a dedicated keyspace for Newts on each case, you would need to sum the total injection rates from all of them and consider that total for the cluster sizing exercise. In case you're going to use the total number of metrics (instead of the injection rate), do the same. All that assumes the retention and the collection interval will be the same; otherwise, sizing will become a complex task.
It is recommended to always use dedicated SSDs for the filesystems on which the SSTables
will be stored configured as RAID0
(never RAID1
or RAID5
) and make sure to use volumes if the size needs to be increased and format the file system using XFS
. Remember, the worst thing that could happen to a Cassandra node is run out of disk space.
I will describe how to install and configure either Cassandra 3.11.x or 4.x on RHEL/CentOS 7/8 using the official RPMs. However, for better performance, consider using ScyllaDB, especially if you're planning to use powerful servers, as unlike Cassandra, Scylla can scale vertically (besides horizontally like Cassandra).
If you're new to Cassandra, I encourage you to learn more about it before using it. Datastax has great resources here, as well as Scylla here.
I suggest following this guide from The Last Pickle that explains the things to consider when setting up a new cluster (some of them referenced here).
A given Cassandra node identifies itself within the cluster using its IP Address. For this reason, each Cassandra server needs a static IP.
If the IP address changes over time, when Cassandra restarts, the node could be registered as a new one, not as the one used to be when it joined for the first time, leading to unexpected results.
This is why using managing Cassandra is challenging and difficult in containerized-based environments, or those environments where you cannot have fixed or static IPs.
The following is based on recommendations taken from several blog posts and from Datastax documentation:
cat <<EOF | sudo tee /etc/sysctl.d/98-cassandra.conf
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_keepalive_intvl=10
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=40960
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
vm.swappiness=1
vm.max_map_count=1048575
EOF
sudo sysctl -p /etc/sysctl.d/98-cassandra.conf
It is recommended to disable swap permanently, so besides setting vm.swappiness=1
as mentioned above, make sure you don't waste file system resources and remove it from /etc/fstab
if it applies.
Additionally, based on multiple blogs I read about the topic, the following settings are also recommended:
cat <<EOF | sudo tee /etc/sysctl.d/99-cassandra-others.conf
net.ipv4.tcp_window_scaling=1
net.core.netdev_max_backlog=2500
net.core.somaxconn=65000
vm.zone_reclaim_mode=0
EOF
sudo sysctl -p /etc/sysctl.d/99-cassandra-others.conf
All the Cassandra nodes must have their time synchronized all the time. Failing on this would lead to data inconsistencies as every operation within a Cassandra cluster is time-sensitive.
For this reason, it is mandatory to have NTP installed and always running to synchronize time.
In terms of OpenNMS, it is recommended to include the OpenNMS server itself and the PostgreSQL servers in the list of synchronized servers.
First of all, Cassandra is a Java application. Therefore a JRE environment is required.
For Cassandra 3.x, Java 8 is required, so make sure to install that version and not a newer one. To install OpenJDK 8:
sudo yum install -y java-1.8.0-openjdk
For Cassandra 4.x, Java 11 is supported, although still experimental as mentioned here. To install OpenJDK 11:
sudo yum install -y java-11-openjdk
There are some articles with excellent results when using Cassandra 4 with OpenJDK 16 and ZGC and Shenandoah instead of G1GC. For that, I recommend reading this blog post from The Last Pickle. It would be interesting to test Cassandra 4 with the new LTS, OpenJDK 17.
Besides that, even though OpenJDK 8 is recommended for production, the official Docker image for Cassandra 4 uses OpenJDK 11 and SaaS solutions like Astra rely on running Cassandra in Kubernetes.
Note that OpenJDK 8 would be installed with Cassandra if there is no JDK installed.
Then, you should configure the YUM repository for Cassandra:
REPO_VERSION="40x" # Use either 311x or 40x
cat <<EOF | sudo tee /etc/yum.repos.d/cassandra.repo
[cassandra]
name=Apache Cassandra
baseurl=https://downloads.apache.org/cassandra/redhat/$REPO_VERSION/
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://downloads.apache.org/cassandra/KEYS
EOF
Then, install the RPM packages:
sudo yum install -y cassandra cassandra-tools
For cqlsh
, if you choose Cassandra 3.11, you need Python 2, whereas, with Cassandra 4, you need Python 3.6.
Keep in mind that it is expected that 3.11.12
works with Python 3, but please avoid 3.11.11
due to CASSANDRA-16822.
On RHEL/CentOS 7, Python 2 is available by default, and similarly, on RHEL/CentOS 8, comes with Python 3 by default although not activated. You can use the alternatives
command to change the default Python version.
At this point, everything should be ready, but to have a real cluster, we need to apply a few settings.
Keep in mind that the data will be stored at /var/lib/cassandra
. Make sure that filesystem has enough disk space according to the sizing process.
If the server where Cassandra is running has firewalld
enabled, make sure to open the ports by doing the following:
cat <<EOF | sudo tee /etc/firewalld/services/cassandra.xml
<?xml version="1.0" encoding="utf-8"?>
<service>
<short>cassandra</short>
<description>Apache Cassandra</description>
<port protocol="tcp" port="7199"/>
<port protocol="tcp" port="7000"/>
<port protocol="tcp" port="7001"/>
<port protocol="tcp" port="9160"/>
<port protocol="tcp" port="9042"/>
</service>
EOF
sudo firewall-cmd --reload
sudo firewall-cmd --permanent --add-service=cassandra
sudo firewall-cmd --add-service=cassandra
From the OpenNMS perspective, you should run the last two for the snmp
service to be able to monitor the Cassandra nodes that way (available after installing the net-snmp
package).
Specify the name of the cluster. All cluster members must use the same name.
Specify the IP Address or FQDN of the Cassandra Seed Node.
Use listen_interface
to specify the name of the physical interface of the machine on which Cassandra will start the client listeners. This is preferred to listen_address
, to be able to share and use the same cassandra.yaml
across all the cluster members without alteration (assuming the same hardware and OS).
Make sure to comment listen_address
Use rpc_interface
to specify the name of the physical interface of the machine on which Cassandra will start the native transport listeners. It is recommended to have a dedicated network for this, but in case that is not possible, use the same as listen_interface
. This is preferred to rpc_address
, to be able to share and use the same cassandra.yaml
across all the cluster members without alteration (assuming the same hardware and OS).
Make sure to comment rpc_address
You must change this when you need a rack-aware or multi-dc cluster. However, it is highly recommended to change it, especially for production clusters (or when you need to have multiple instances of Cassandra in the same physical server). In this case, use GossipingPropertyFileSnitch
instead of SimpleSnitch
. This requires defining the data center and rack inside cassandra-rackdc.properties
for each Cassandra instance.
SimpleSnitch
should only be used for testing and development environments as changing it later is an extremely difficult task.
You must update all the existing keyspaces configured with SimpleStrategy
to use NetworkTopologyStrategy
(but never touch those with LocalStrategy
).
Additionally, it is recommended to disable the Dynamic Snitch, by adding the following setting (not present by default) to cassandra.yaml
:
dynamic_snitch: false
This is optional but recommended if each Cassandra node happens to have lots of CPU cores. That helps speed up the joining, decommission and cleanup processes (but not repairs) while still performing multiple regular compactions simultaneously (although careful with Disk IO). I recommend 8 if the machine has more cores than that.
Additionally, as the documentation suggest, check compaction_throughput_mb_per_sec. That is expressed in MB/s, which is shared across all the compaction threads, and should be less than the real throughput of the NIC in the server.
Running anti-entropy repairs is a costly operation that is not intended for immutable tables (i.e., those configured with TWCS). That doesn't mean you should avoid repairs in general, but you can certainly do for the newts.samples
table. However, if a node has been down for over 3 hours (the default value for max_hint_window_in_ms
), you must run schedule a repair (or rebuild the down node as new). You could avoid that by increasing this value, depending on how long it would take to be aware of a hardware failure and fix it. Of course, keeping hints for other nodes requires disk space, so plan accordingly.
It would be best if you also considered altering hinted_handoff_throttle_in_kb as data might pile up quickly, leading to a long wait when transferring hits as the default value of 1024 is too conservative (not suitable for busy clusters).
As described here and more in-depth here (and also here), the default value of 256 is not going to do any good and could have bad repercussions with tasks like repairs. From those articles, a value of 4 seems reasonable.
However, as you can see here, the default in Cassandra 4 is 16, meaning that can be a great starting point if you decide to use that version. You could start a 3.11 cluster with that default; however, keep in mind that reducing the number of vnodes/tokens could lead to unbalanced clusters (i.e., have some nodes owning more data than others).
The above is already part of cassandra.yaml
on 4.x.
Changing this later is an extremely difficult task.
It is recommended to disabled this feature (marked as experimental in the documentation). This should not be a problem for OpenNMS/Newts as that Cassandra feature is not required.
This is already disabled by default in Cassandra 4, but in 3.11, you must disabled it.
It is recommended to disabled this feature (marked as experimental in the documentation). This should not be a problem for OpenNMS/Newts as that Cassandra feature is not required.
This is already disabled by default in Cassandra 4, but in 3.11, you must disabled it.
This file can be copied to new nodes, but only when using listen_interface
and rpc_interface
.
This is required only if you're planning to use rack-aware or a multi-dc cluster and requires GossipingPropertyFileSnitch
. However, it is highly recommended to use it.
Check the official docs for more information.
This file would be slightly different on each server.
Changes here are required to monitor Cassandra via JMX from OpenNMS.
This is the file where you configure part of core JVM settings. I recommend doing the following:
java.rmi.server.hostname
, uncomment it, and use the Cassandra node's IP or hostname.LOCAL_JMX=yes
, and replace the value to be no
.com.sun.management.jmxremote.password
is uncommented.com.sun.management.jmxremote.access
is uncommented.This file would be slightly different on each server as you need to specify the IP or FQDN of the node for the RMI server.
Create the file like this:
cat <<EOF | sudo tee /etc/cassandra/jmxremote.access
monitorRole readonly
cassandra readwrite
controlRole readwrite \
create javax.management.monitor.*,javax.management.timer.* \
unregister
EOF
sudo chmod 0400 /etc/cassandra/jmxremote.access
sudo chown cassandra:cassandra /etc/cassandra/jmxremote.access
This file can be copied to new nodes.
Create the file like this:
cat <<EOF | sudo tee /etc/cassandra/jmxremote.password
monitorRole QED
controlRole R&D
cassandra cassandra
EOF
sudo chmod 0400 /etc/cassandra/jmxremote.password
sudo chown cassandra:cassandra /etc/cassandra/jmxremote.password
Feel free to change the password for the cassandra
user (which is also the word cassandra
) but keep in mind you have to provide it every time you run the nodetool
command from now on, and update the OpenNMS configuration for Pollerd
, Collectd
and the detector if applies.
This file can be copied to new nodes.
The configuration file depends on the version of Cassandra you're planing to use.
/etc/cassandra/conf/jvm.options
/etc/cassandra/conf/jvm11-server.options
for OpenJDK 11 or newer, or the same file as 3.11.x for OpenJDK 8.Under HEAP SETTINGS
, we recommend setting the Java Heap Memory size up to half of the available RAM but never over 30 GB, for instance:
-Xms16G
-Xmx16G
Under GC SETTING
, comment all the CMS settings, and uncomment the following G1GC settings:
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=70
We found that G1GC performs better than CMS in production, as, without it, the CPU usage seems very high, even when Cassandra is idle. As you can learn from this blog post (among others), there are great results when using 4.x with OpenJDK 16 and ZGC.
The JVM options file can be copied to new nodes.
The following applies after having the whole Cassandra cluster up and running with the expected topology.
As mentioned when using Network Topology, some of the system keyspaces should be updated to take advantage of it.
To list all the keyspaces:
cqlsh cassandra1 -e "SELECT * FROM system_schema.keyspaces;"
The output would be:
keyspace_name | durable_writes | replication
--------------------+----------------+-------------------------------------------------------------------------------------
system_auth | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
system_schema | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
system_distributed | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
system | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
system_traces | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}
(6 rows)
Note that the first 3 have SimpleStrategy
(i.e., system_auth
, system_distributed
and system_traces
). Those are the ones to update.
Do not change those with LocalStrategy
(i.e., system_schema
and system
)
To obtain the name of the current datacenter:
cqlsh cassandra1 -e "SELECT data_center FROM system.local;"
I'd keep the default replication factor and data center name, but you can change them if you like. To update them, you can do something like the following:
CASSANDRA_DC="DC1"
cat <<EOF > /tmp/fix-keyspaces.cql
ALTER KEYSPACE system_auth WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy', '$CASSANDRA_DC' : 1
};
ALTER KEYSPACE system_distributed WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy', '$CASSANDRA_DC' : 3
};
ALTER KEYSPACE system_traces WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy', '$CASSANDRA_DC' : 2
};
EOF
cqlsh -f /tmp/fix-keyspaces.cql cassandra1
Make sure to adjust the DC name, or add more than one if you'd be using a multi-dc environment.
This must be done prior start using the Cassandra cluster. Otherwise, you'd have to run a full repair.
The following applies after having the whole Cassandra cluster up and running with the expected topology.
When the data to be stored in Cassandra is based on time series metrics, meaning storing immutable entries based on timestamps, the best approach is to use TWCS (Time Window Compaction Strategy).
The overhead when using this strategy can be as big as each time-windowed chunk. The chunk size depends on how TWCS is configured, but in practice, it can be around 5% of the disk space. Compared with STCS, it is clear which one is the winner, as the required disk overhead for STCS is 50%.
However, it is important to keep in mind the time series constraint: the data has to be immutable. In other words, once stored, it won't be altered or manually modified. Data should only be evicted by TTL. If this is not the case, TWCS won't help as much as it should, meaning the overhead on disk space will be greater.
Fortunately, we can consider the data stored by OpenNMS through Newts as time-series data to use this strategy.
The idea is to create a CQL file with the desired topology, replication factor, and TWCS for the samples
table only.
Create the following environment variables (assuming a cluster with at least 3 nodes and a TTL
of 1 year):
KEYSPACE="newts"
CASSANDRA_DC="DC1"
CASSANDRA_REPLICATION_FACTOR="2"
TWCS_COMPACTION_WINDOW_SIZE="7"
TWCS_COMPACTION_WINDOW_UNIT="DAYS"
TWCS_EXPIRED_FREQ="86400" # How frequent to check for expired SSTables (in seconds)
GC_GRACE_PERIOD="604800" # Recommended to match window size (in seconds)
Then create the CQL file with the Newts keyspace:
cat <<EOF | sudo tee /etc/cassandra/newts.cql
CREATE KEYSPACE IF NOT EXISTS ${KEYSPACE} WITH replication = {
'class' : 'NetworkTopologyStrategy',
'${CASSANDRA_DC}' : ${CASSANDRA_REPLICATION_FACTOR}
};
CREATE TABLE IF NOT EXISTS ${KEYSPACE}.samples (
context text,
partition int,
resource text,
collected_at timestamp,
metric_name text,
value blob,
attributes map<text, text>,
PRIMARY KEY((context, partition, resource), collected_at, metric_name)
) WITH compaction = {
'compaction_window_size': '${TWCS_COMPACTION_WINDOW_SIZE}',
'compaction_window_unit': '${TWCS_COMPACTION_WINDOW_UNIT}',
'expired_sstable_check_frequency_seconds': '${TWCS_EXPIRED_FREQ}',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'
} AND gc_grace_seconds = ${GC_GRACE_PERIOD}
AND read_repair_chance = 0;
CREATE TABLE IF NOT EXISTS ${KEYSPACE}.terms (
context text,
field text,
value text,
resource text,
PRIMARY KEY((context, field, value), resource)
);
CREATE TABLE IF NOT EXISTS ${KEYSPACE}.resource_attributes (
context text,
resource text,
attribute text,
value text,
PRIMARY KEY((context, resource), attribute)
);
CREATE TABLE IF NOT EXISTS ${KEYSPACE}.resource_metrics (
context text,
resource text,
metric_name text,
PRIMARY KEY((context, resource), metric_name)
);
EOF
Please note that I'm assuming the usage of NetworkTopologyStrategy
, as rack-awareness or multi-dc is the best approach for a production environment. You should avoid SimpleStrategy
(only suitable for testing, development, or simple scenarios).
The above assumes one DC. But, if you're interested in having a multi-dc deployment, you could do the following. For instance, for two DC1 called DC1 and DC2 with a replication factor of 2 for both:
CREATE KEYSPACE IF NOT EXISTS ${KEYSPACE} WITH replication = {
'class' : 'NetworkTopologyStrategy',
'DC1' : 2,
'DC2' : 2
};
Then, create the keyspace:
cqlsh -f /etc/cassandra/newts.cql $(hostname)
This should be done once.
For this reason, some precautions have been added to the CQL statements to guarantee that if the keyspace and tables were already created, they won't be recreated again.
When using rack-awareness or multi-dc using GossipingPropertyFileSnitch
do not forget to configure the data center and rack in cassandra-rackdc.properties
for each node.
If you're planing to share the same Cassandra cluster across multiple independent OpenNMS servers, it is recommended to have a different Newts Keyspace for each of them, to avoid undesired side effects, but keep in mind this introduces a complication when sizing the cluster.
The “window size” is configured through 2 settings:
The reason why choosing 7 days is the following: for 1-year retention, the number of compacted chunks will be 52 (as there are 52 weeks on a year). This is a little bit higher than the recommended number of chunks, but in practice, this is reasonable, especially to simplify the calculations. For different retentions, try to target around 40 chunks or less.
The following commands can show you useful information about the cluster status, and they can be executed from any active cluster member.
General Information:
nodetool -u cassandra -pw cassandra info
Cluster Status:
nodetool -u cassandra -pw cassandra status
If you change the password when creating the jmxremote.password
file, make sure to adjust the nodetool
command.
When there is a need to add a new cluster, make sure to copy over all the configuration files mentioned above from one of the existing nodes, and adjust cassandra-env.sh
with the server's IP or Hostname for the RMI entry.
Ensure the SEED node is up and running and it is working properly before starting the Cassandra process. Otherwise, the JOIN process will fail.
Once you have everything ready, start the Cassandra process, but keep in mind that the JOIN process could take hours or even days, depending on how much data has been added to the cluster.
You can track the progress with nodetool status
, and you'll see the letter UJ
for the nodes that are joining, and it will become UN
once the process is done.
I recommend periodically check /var/log/cassandra
if there were errors with the joining process, and refer to Cassandra's documentation to obtain information about what to do if this happens.
Once the node has been successfully joined the cluster, you can proceed to clean up all the other nodes, one at a time, by running the following command:
nodetool -u cassandra -pw cassandra cleanup &
To speed up the process and to assume you have enough compactions threads configured as mentioned earlier, you could use something like this:
nodetool -u cassandra -pw cassandra cleanup -j 4 &
The above assumes you have at least 8 compaction threads.
Because that process can also take hours or days, I sent the command to the background. You can use the following command to track progress:
nodetool -u cassandra -pw cassandra compactionstats
To be clear, you have to execute cleanup
on one node at a time and wait until it is over before proceeding to the next node, and it is not necessary to run it on the node that just joined the cluster.
Thanks to TWCS, there is no need to execute regular repairs over the samples
table unless a given node has been down for over 3 hours (see Hinted Handoff). Even, in this case, redeploy the broken or corrupt node as new might be faster than running a repair; also because it is not recommended to run repair against TWCS tables (and the following tool disallow that by default).
However, repairs should be executed over the rest of the tables, and because this is a very complex, delicate and time consuming operation, I recommend investing time learning the following tool:
Another essential thing to consider is backups. Please always keep in mind the following fact:
“replication is not backup.”
Make sure to update the JMX password for Cassandra in poller-configuration.xml
and collectd-configuration.xml
.
If you changed the keyspace name for Newts, you must alter all JMX metrics for them to work.
Then, add all the Cassandra nodes to the OpenNMS inventory to start collecting statistics. Besides JMX, I recommend setting up the Net-SNMP agent on the Cassandra nodes as well.
In terms of using Cassandra for Newts, there are several considerations to keep in mind.
The default configuration promotes speed over consistency, meaning it is possible to lose data if a node goes down. This is configured via the following setting:
org.opennms.newts.config.write_consistency=ANY
The default is ANY
(the fastest option). I think ONE
is safer, but what you should use if losing data is unacceptable for you would be LOCAL_QUORUM
. Of course, that consistency will be slower as the "majority" of the replicas, including the node that holds the primary one (i.e., quorum), must acknowledge the write request to succeed, meaning the transaction would be slower, but if your cluster is fast enough, that will make data more consistency at writing time.
ANY
can be dangerous because if the node that holds the primary replica is down, the write operation goes to the hinted handoff. If the node stays down for over max-hint-window-in-ms
, that data will be lost forever (not recoverable).
Read repairs are discouraged for TWCS tables (or immutable tables), meaning increasing the READ consistency might not be useful.
As mentioned in the sizing guide it is mandatory to update the following entries according to your needs:
There are no suggestions here as the above relies entirely on your particular environment.