In today’s world, when data is increasing at an enormous pace, companies are looking for applications and databases that can handle all types of data under one roof, including text, pictures, videos, and even sound. It isn’t easy to manage all of these under different applications owned by other companies. In addition, if the data is available locally instead of remotely, it helps with a quick turnaround for their applications. To get past this, we can use Apache Cassandra, an open-source, distributed storage system (database) used for managing vast amounts of structured, semi-structured, and unstructured data globally. It uses a dynamo-style replication model with no single point of failure along with a more robust ‘column family’ data model. Developed by Facebook, Cassandra differs significantly from relational database management systems and has helped power critical web infrastructure at companies like Netflix, Twitter, and Instagram, among others. In this blog, we’ll talk more about Cassandra including the configuration process.
What is Cassandra?
Apache Cassandra is an open-source, distributed NoSQL database built to manage vast amounts of data spread across the globe. Being an open-source project, it is available to everyone at no cost with an ever-evolving community. Instead of following the traditional master-slave architecture, it follows peer-to-peer architecture, where several nodes gossip with each other. As a result, there is no single point of failure. Because it is very elastic in nature, it can be easily scaled up or down on the fly. In addition to these features, Cassandra has a data replication feature that makes it highly available and fault tolerant. This helps to quickly retrieve data in the event of a node failure. Cassandra is also known to have a schema-free data model. Because each row may not have the same columns, it is a very flexible and vital feature.
Cassandra is available in numerous versions, with the latest being v4.1. We ran our tests on v4.0.5; therefore, the following supporting software is related to Apache Cassandra v4.0.5:
- After deploying the RHEL system and configuring the basic networking, verify that the the latest version of Java 8 or 11 is installed by using the following command:
If needed, install the latest Java version.
- To use cqlsh, the latest version of Python v3.6+ is required (Python v2.7 can be used, but in a deprecated state). Verify the python version by using the following command:
For prerequisites and the installation procedure for Apache Cassandra, click here.
There are three methods for installing Cassandra:
- Using the docker image.
- Using the binary file.
- Package installation using rpm or yum.
1. To install using RPM packages, you must set up a Cassandra repo on the Red hat node. The following is an example of Cassandra repo in our test environment:
2. Install Cassandra using the following command and make sure the services have not started yet:
yum install cassandra-4.0.5-1.noarch
3. Configure the cassandra.yaml file located at
You must configure the following critical parameters correctly:
- cluster_name ➔ must be the same on all nodes.
- num_tokens ➔ must be the same on all nodes.
- allocate_tokens_for_local_replication_factor ➔ must be the same on all nodes.
- auto_snapshot ➔ can be disabled in a test environment to avoid clearing snapshots every time you truncate the database.
- endpoint_snitch ➔ the type of Snitch used for the setup.
- concurrent_reads ➔ must be updated depending on the number of drives (16 x no. of drives).
- concurrent_writes ➔ must be updated depending on the number of cores (8 x no. of cores).
- concurrent_counter_writes ➔ must be updated depending on the number of drives (16 x no. of drives).
- seeds ➔ must be the same on all nodes. (This is the most crucial parameter). We recommend using one seed node up to three node clusters and more than one if the cluster is more significant.
- listen_address ➔ use your own IP address.
- rpc_address ➔ use your own IP address.
Apart from the
file, if the environment spans across different geographic locations, you can edit the
cassandra-rackdc.properties files as well.
1. Start the Cassandra services one at a time on all the nodes starting with the seed node and ensure all the nodes in the cluster have an Up and Normal (UN) status.
2. To access the Cassandra Query Language (CQL), ensure the Python path is exported; otherwise, accessing CQL fails.
Configuring Yahoo! Cloud Serving Benchmark (YCSB)
To run the workload on the Cassandra database, you must install the Yahoo! Cloud Serving Benchmark (YCSB) benchmarking tool on the Cassandra node (preferably seed node) as follows:
1. Untar the package.
2. Verify the directory.
Creating a database and testing a workload
To create a database and test a workload, complete the following steps:
1. Using cqlsh, log in to the Cassandra cluster and create a Keyspace and a user table.
2. Using YCSB, run a test workload.
3. Verify the nodetool status.
Moving the Cassandra application from local disk to Hitachi VSS Block volumes
There could be a scenario where the Cassandra node uses a local disk for the root directory, and the application must be moved to SAN volumes for various reasons. In this situation, you must ensure that the volumes from Hitachi VSS Block are visible and mounted on the Cassandra node. When they are ready to use, complete the following steps:
1. Stop Cassandra services on all nodes starting with the non-seed node.
2. Change the directory locations in the
cassandra.yaml file located at
3. Move the directories to the new location.
4. Perform step 3 on all the cluster nodes.
5. Start the Cassandra service with the seed node. In this scenario, we used Cassandra01.
6. Verify the cluster status.
7. Using YCSB, run a test workload.
When to use Cassandra and when to stay away
Because Cassandra follows peer-to-peer architecture, there is no single point of failure; therefore, the cluster is always online. Thanks to the replication feature, data is stored on multiple nodes and in various data centers. So even if half of the nodes are down, you can still access it.
Cassandra, by nature, is very good with heavy write workloads and reasonably good with heavy read workloads. These factors help when planning data distribution across various data centers or regions, or even the cloud, for that matter.
There are a couple of drawbacks with Cassandra. When handling ACID transactions, Cassandra prioritizes availability over consistency. As a result, there is a possibility for data to contradict. Also, to achieve blazing-fast writes, Cassandra follows an append-oriented approach. So, if data is frequently updated, the database will include several duplicates of the original entry. This happens because Cassandra doesn’t update the original record. Instead, it marks the new entry as a younger version of the original. This means that reads are slower compared to the writes. There is a similar pattern with a deletion in the database where it doesn’t delete the original data, but assigns a tag called a tombstone. This is a hindrance when scanning specific data in the database because it will find many instances of undead data. Therefore, Cassandra excels at storing time-series data, where old data does not need to be updated.
Use cases for Cassandra
- Beneficial for Inventory Management.
- Performs well for e-commerce websites and messaging platforms.
- Beneficial for storing sensor data.
- For tracking and monitoring user activities, data can eventually be used for analytics.