---
disqus: hackmd
---
Introduction to NoSQL Databases <br> WEEK_3 - Introducing Apache Cassandra
====
###### tags: `IBM Data Engineering Professional Certificate`,`Reading Note`,`Coursera`,`Introduction to NoSQL Databases`
### Overview
Kitty ipsum dolor sit amet, shed everywhere shed everywhere stretching attack your ankles chase the red dot, hairball run catnip eat the grass sniff.
## Cassandra Basic
### 1. Overview of Cassandra
* What is Apache Cassandra
* Open Source
* Distributed
* Decentralized
* Elastically scalable
* Highly available
* Fault-tolerant
* Tunable
* Consistent database
* Apache Cassandra in the NoSQL Space
| MongoDB | Apache Cassandra |
| :---: | :---: |
| 多用於搜尋用例及電商網站 | 各類型網站都適用 |
| Read with indexs | 快速寫入資料,搜尋全部資料 |
| Consistency | Availability & scalability |
| Primary-secondary architecture | Peer-to-peer architecture |
* Key Features of Apache Cassandra
* Distributed and decentralized
* Always available with tunable consistency
* Fault tolerent
* High write throughput
* fast and linear scalability
* Multiple Data center support
* SQL-Like Query language
* What is Apache Cassandra
* A reliable, performant, scalable database for data storage.
* Not a drop-in replacement for a relational database.
* Does not support joins
* Limited aggregations support
* limited support of transacrions
* For Joins and Aggregations:
* Cassandra + Spark
<br>
* Usage Scenarios for Cassandra
* When writes exceed read requests
* For example, storing all the clicks on your website or all the acess attempts on your service.
* When using append-like type of data
* Not many updates and deletes.
* When you can predefine your queries and your data access is by a known primary key
* Data can be partitioned via a key that allows the database to be spread evenly across multiple nodes.
* When there is no need for joins or aggregations
* Common Use Cases for Cassandra
| eCommerce websites | Online services | Timeseries |
| :---: | :---: | :---: |
| Storing transactions | User' authentication for access to services | Monitoring servers' access logs |
| Website interactions(Clicks) for prediction of customer behavior | Tracking users' activity in the application | Weather updates from sensors |
| Status of orders/users' transactions | | Tracking packages |
| Users' profiles and shopping history |
### 2. Architecture of Cassandra
* The Apache Cassandra architecture is designed to provide scalability, availability, and reliability to store massive amounts of data.
* Apache Cassandra Topology
* Cassandra is based on a distributed system architecture. In its simplest form, Cassandra can be installed on a single machine or container. A single Cassandra instance is called a node. Cassandra supports horizontal scalability achieved by adding more than one node as a part of a Cassandra cluster.
<br>
* As well as being a distributed system, Cassandra is designed to be a peer-to-peer architecture, with each node connected to all other nodes. Each Cassandra node can perform all database operations and can serve client requests without the need for a primary node.
<br>
* Gossip is the protocol used by Cassandra nodes for peer-to-peer communication. The gossip protocol informs a node about the state of all other nodes. A node performs gossip communications with up to three other nodes every second. The gossip messages follow a specific format and use version numbers to make efficient communication, thus shortly each node can build the entire metadata of the cluster (which nodes are up/down, what are the tokens allocated to each node, etc..).
* Multi Data Centers Deployment
* A Cassandra cluster can be a single data center deployment (like in the above pics), but most of the time Cassandra clusters are deployed in multiple data centers. A multi data-center deployment looks like below – where you can see depicted a 12 nodes Cassandra cluster, topology wise installed in 2 datacenters. Since replication is being set at keyspace level, demo keyspace specifies a replication factor 5: 2 in data center 1 and 3 in data center 2.
<br>
Note: since a Cassandra node can be as well a coordinator of operations, in our example since the operation came in data center 2 the node receiving the operation becomes the coordinator of the operation, while a node in data center 1 will become the remote coordinator – taking care of the operation in only data center 1.
* Components of a Cassandra Node
* There are several components in Cassandra nodes that are involved in the write and read operations. Some of them are listed below:
* Memtable
* Memtables are in-memory structures where Cassandra buffers writes. In general, there is one active Memtable per table. Eventually, Memtables are flushed onto disk and become immutable SSTables.
* This can be triggered in several ways:
* The memory usage of the Memtables exceeds a configured threshold.
* The CommitLog approaches its maximum size, and forces Memtable flushes in order to allow Commitlog segments to be freed.
* When we set a time to flush per table.
* CommitLog
* Commitlogs are an append-only log of all mutations local to a Cassandra node. Any data written to Cassandra will first be written to a commit log before being written to a Memtable. This provides durability in the case of unexpected shutdown. On startup, any mutations in the commit log will be applied to Memtables.
* SSTables
* SSTables are the immutable data files that Cassandra uses for persisting data on disk. As SSTables are flushed to disk from Memtables or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one. Once the new SSTable has been written, the old SSTables can be removed.
* Each SSTable is comprised of multiple components stored in separate files, some of which are listed below:
* Data.db: The actual data.
* Index.db: An index from partition keys to positions in the Data.db file.
* Summary.db: A sampling of (by default) every 128th entry in the Index.db file.
* Filter.db: A Bloom Filter of the partition keys in the SSTable.
* CompressionInfo.db: Metadata about the offsets and lengths of compression chunks in the Data.db file.
* Write Process at Node Level
* Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending with a write of data to disk:
* Logging data in the commit log
* Writing data to the Memtable
* Flushing data from the Memtable
* Storing data on disk in SSTables

* Read at node level
* While writes in Cassandra are very simple and fast operations, done in memory, the read is a bit more complicated, since it needs to consolidate data from both memory (Memtable) and disk (SSTables). Since data on disk can be fragmented in several SSTables, the read process needs to identify which SSTables most likely contain info about the partitions we are querying - this selection is done by the Bloom Filter information. The steps are described below:
* Checks the Memtable
* Checks Bloom filter
* Checks partition key cache, if enabled
* If the partition is not in the cache, the partition summary is checked
* Then the partition index is accessed
* Locates the data on disk
* Fetches the data from the SSTable on disk
* Data is consolidated from Memtable and SSTables before being sent to coordinator
<br>
### 3. Key Features of Cassandra
* Distributed & Decentralized
* Cluster runs on multiple distributed machines.
* user address the cluster in the same way:
* seamless to the number of nodes in the cluster
* All nodes perform the same functions(Server symmetry)
* Peer-to-peer architecture
<br>
* Data Distribution starts with a Query
<br>
* Data Replication and Multiple DC Support
* Replicas
* How many nodes contain a certain piece of your data (partition)
* Data Replication takes Cluster topology into consideration
* Racks and data centers distribution of nodes
* Availability versus Consistency
* Always available
* Tunable consistency
* Per operation set consistency(read/write)
* CAP theorem:Cassandra favors availability over consistency
* Tunable: Strong or eventual consistency
* Consistency conflicts solved during read
<br>
* High Availability and Fault Tolerance
* Peer-to-Peer architecture
* Nodes' temporary/permenent failures are immediately recognized by the other nodes in the cluster.
* Nodes reconfigure the data distribution once nodes are taken out of the cluster.
* Failed requests can be retransmitted to other nodes.
<br>
* Fast and Linear Scalability
* Scales horizontally by adding new nodes in the cluster
* Performance increses linearly with the number of added nodes
* New nodes are automatically assigned tokens from existing nodes
* Adding and removing of nodes is done seamlessly
<br>
* High Write Throughput
* At cluster level
* Writes can be distributed in parallel to all nodes holding replicas.
<br>
* No reading beafore writing(by default)
* At node level
* Writes are done in node memory and later flushed on disk
* All disk write are sequantial ones - append-like operations
<br>
* Cassandra Query Language
* Data Definition and Manipulation:CQL, an SQL-like syntax
```SQL=
CREATE TABLE test(
groupid uuid,
name text,
occupation text,
age int,
PRIMARY KEY((groupid), name));
INSERT INTO test(groupid, name, occupation, age)
VALUES(1001, 'Thomas', 'engineer', 24), (1001, 'James', 'designer', 30, (1002, 'Lily', 'writer', 35));
SELECT * FROM test WHERE groupid = 1001;
```
### 4. Cassandra Data Model - Part 1
* Logical Entities: Tables and Keyspaces
* Table
* Logical entity that organizes data storage at cluster and node level(according to a declared schema)
* Keyspace
* Logical entity that contains one or more tables
* Replication and data centers' distribution is defined at keyspace level
* Recommended 1 keyspace/application
```SQL=
CREATE KEYSPACE intro_cassandra WITH REPLICATION = {'class':'NetworkTopologyStrategy', 'datacenter1': 2, 'datacenter2': 3};
USE intro_cassandra;
CREATE TABLE groups(
groupid int,
group_name text STATIC,
username text,
age int,
PRIMARY KEY((groupid), username)
);
```
* Logical Entites: Tables
* Data is organized in tables containing rows of columns.
* Tables can be created, dropped, and altered at runtime without blocking updates and queries.
* To create a table, you must define a primary key and other data columns(regular columns)
```SQL=
CREATE TABLE intro_cassandra.groups(
groupid int,
group_name text STATIC,
username text,
age int,
PRIMARY KEY ((groupid), username)
);
```
* groupid: Partition Key
* username: Clustering Key
* Primary Key in Cassandra Tables
* Subset of the declared columns
* Mandatory (you cannnot change it once declared)
* Two main roles:
* Optimize read performance for table queries - Query driven table design
* Provide uniqueness to the entries
* Has two components:
* Partition Key - mandatory
* Clustering Key(s) - optional
* Partition Keys
* When data is writeen to a table, it is grouped into partitions and distributed on cluster nodes - based on partition Key
* Partition Key => Hash (token) => Node
* Partition key determines data (partition) locality in cluster
<br>
* Table Types
* Two types of tables: static and dynamic
* Static tables
* PRIMARY KEY(username)
* Dynamic tables
* PRIMARY KEY((groupid), username)
* Static Tables
<br>
### 5. Cassandra Data Model - Part 2
* Clustering Key
* Stores data in ascending or descending order within the partition for the fast retrieval of similar values
* Can have single or multiple columns
* Completes the primary key in dynamic tables
* Gives uniqueness to primary key Improves read query performance
```SQL=
CREATE TABLE intro_cassandra.groups(
groupid int,
group_name text STATIC,
username text,
age int,
PRIMARY KEY ((groupid),username)
);
```
* Group - dynamic table
* Partition key = groupid
* Clustering key = username
* 1 partition = multiple entries
* Dynamic Tables
```SQL=
INSERT INTO intro_cassandra.groups(groupid, group_name, username, age)
VALUES(45, 'Grilling', 'JayZ@yahoo.com', 46));
```
<br>
* Basic Rukes of Data Modeling
* Data Modeling - build a primary key that optimizes query execution time
* Choose a partition Key - starts answering your query and spreads the data uniformly in the cluster
* Minimize the bunber of partitions read in order to answer the query
<br>
### 6. Introduction to Cassandra Query Language Shell (cqlsh)
* Cassandra Query language
* CQL is the primary language for communication with Cassandra clusters
* Simple yet intuitive syntax(SQL-like)
* CQL lacks frammar for relational features such as JOIN statements
* Different behavior of CQL commands vs. SQL
```SQL=
CREATE KEYSPACE intro_cassandra WITH ..
```
```SQL=
CREATE TABLE test() ..
```
```SQL=
INSERT INTO test() VALUES() ..
```
```SQL=
SELECT * FROM test WHERE ..
```
```SQL=
UPDATE test SET age = 25 WHERE userid = 30 ..
```
```SQL=
DELETE FROM test WHERE userid = 30 ..
```
```SQL=
DROP TABLE test;
```
```SQL=
TRUNCATE TABLE test;
```
* CQL keywords are case-insensitive
```SQL=
SELECT * FROM users;
```
* Identifiers in CQL are case-insensitive unless enclosed in double quotation marks
* Names for identifiers created using uppercase are stored in lowercase
```SQL=
CREATE TABLE USERS(..);
```
* Commented text (//) is ignored by CQL
```SQL=
CREATE TABLE USERS(..); // Stored name of the table: users
```
* Running CQL Queries
* Run using Cassandra client drivers
* JAVA, Python, Ruby, Node.js, PHP, Scala, Clojure
* Default = open source Datastax Java Driver
* Run using cqlsh client
* Python-based command line shell for interacting with Cassandra through CQL
* Shipped with every Cassandra package
* Connects to a single node (default node or one specified on the command line)
* Other CQL client editors are available
* CQL Shell (cqlsh)
* Using cqlsh, you can:
* Create, alter, drop keyspaces
* Create, alter, drop tables
* Insert, update, delete data
* Execute read queries(SELECT)
<br>
* cqlsh - example
```SQL=
USE intro_Cassandra;
SELECT * FROM groups WHERE groupid = 12;
INSERT INTO
GROUPS(groupid, username, group_name, age)
VALUES(12, 'Aland@gmail.com', 'baking', 32);
SELECT * FROM groups WHERE groupid=12;
```
<br> 
* cqlsh - special commands
<br>
* cqlsh Consistency
<br>
* Consistency example - QUORUM
<br>
* cqlsh COPY (import / export data)
<br>
## Working with Cassandra
### 1. CQL Data Types
* Built-in Data Types
|Data Type|Data Type|
| :---: | :---: |
| ASCII | Int |
| Boolean | Text |
| Blob | Timestamp |
| Bigint | Timeuuid |
| Decimal | Tinyint |
| Double | Uuid |
| Float | Varchar |
* Collection Data Types
* Collections
* A way to group and store data together
* Cassandra 沒有 Joins 的功能
* 把資料存放在同一個 Collection
* Collection 的儲存是要有限制,所以不適合即時更新的資料。
* Collection Data Types - List
* When order of the elements needs to be maintained.
* Ex. entries in a log.
```SQL=
USE intro_cassandra;
ALTER TABLE users ADD jobs list<text>;
UPDATE user SET jobs = ['Walmart'] + jobs WHERE username
= 'Alaind@gmail.com'; // add the last job change to the list
UPDATE users SET jobs = job + ['Netflix'] WHERE username = 'Alaind@gmail.com'; // add the last job change to the list
UPDATE users SET jobs[0] = 'Reiss' WHERE username = 'Alaind@gmail.com'; // replaces Walmart with Reiss (lists start from 0)
```

* User-Defined Data Types (UDTs)
* Collection data types for one-to-many / UDTs for one-to-one
* Can attach multiple data fields, each named and typed, to single column
* The fields used to create a UDT may be any valid data type, including collections and other existing UDTs
* Once created, the user can alter, verify, and drop a field or the whole data type
* Once created, UDTs may be used to define a column in a table
```SQL=
CREATE TYPE address(
Street text,
Number int,
Flat text);
CREATE TABLE users_w_address(
Userid int,
Location address,
Primary key (userid));
INSERT INTO users_w_address(userid, location) VALUES (1, {street : 'Third', number : 34, flat : '34c'}); // insert data
DROP TYPE address; // we can drop a type
### 2. Keyspace Operations
```
## Summary & Highlights
* 課程完整整理的內容,所以把它記錄下來
* Cassandra Basisc
>* Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault tolerant, and tunable and consistent database.
>* Apache Cassandra is best used by "always available" type of applications that require a database that is always available.
>* Data distribution and replication takes place in one or more data center clusters.
>* Its distributed and decentralized architecture helps Cassandra be available, scalable, and fault tolerant.
>* Cassandra stores data in tables.
>* Tables are grouped in keyspaces.
>* A clustering key specifies the order that the data is arranged inside the partition (ascending or descending).
>* Dynamic tables partitions grow dynamically with the number of entries.
>* CQL is the primary language for communicating with Apache Cassandra clusters.
>* CQL queries can be run programmatically using a licensed Cassandra client driver, or they can be run on the Python-based CQL shell client provided with Cassandra.
* Working with Cassandra