Jiayong
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- disqus: hackmd --- Introduction to NoSQL Databases <br> WEEK_3 - Introducing Apache Cassandra ==== ###### tags: `IBM Data Engineering Professional Certificate`,`Reading Note`,`Coursera`,`Introduction to NoSQL Databases` ### Overview Kitty ipsum dolor sit amet, shed everywhere shed everywhere stretching attack your ankles chase the red dot, hairball run catnip eat the grass sniff. ## Cassandra Basic ### 1. Overview of Cassandra * What is Apache Cassandra * Open Source * Distributed * Decentralized * Elastically scalable * Highly available * Fault-tolerant * Tunable * Consistent database * Apache Cassandra in the NoSQL Space | MongoDB | Apache Cassandra | | :---: | :---: | | 多用於搜尋用例及電商網站 | 各類型網站都適用 | | Read with indexs | 快速寫入資料,搜尋全部資料 | | Consistency | Availability & scalability | | Primary-secondary architecture | Peer-to-peer architecture | * Key Features of Apache Cassandra * Distributed and decentralized * Always available with tunable consistency * Fault tolerent * High write throughput * fast and linear scalability * Multiple Data center support * SQL-Like Query language * What is Apache Cassandra * A reliable, performant, scalable database for data storage. * Not a drop-in replacement for a relational database. * Does not support joins * Limited aggregations support * limited support of transacrions * For Joins and Aggregations: * Cassandra + Spark <br>![](https://i.imgur.com/yQwTQLF.png =200x) * Usage Scenarios for Cassandra * When writes exceed read requests * For example, storing all the clicks on your website or all the acess attempts on your service. * When using append-like type of data * Not many updates and deletes. * When you can predefine your queries and your data access is by a known primary key * Data can be partitioned via a key that allows the database to be spread evenly across multiple nodes. * When there is no need for joins or aggregations * Common Use Cases for Cassandra | eCommerce websites | Online services | Timeseries | | :---: | :---: | :---: | | Storing transactions | User' authentication for access to services | Monitoring servers' access logs | | Website interactions(Clicks) for prediction of customer behavior | Tracking users' activity in the application | Weather updates from sensors | | Status of orders/users' transactions | | Tracking packages | | Users' profiles and shopping history | ### 2. Architecture of Cassandra * The Apache Cassandra architecture is designed to provide scalability, availability, and reliability to store massive amounts of data. * Apache Cassandra Topology * Cassandra is based on a distributed system architecture. In its simplest form, Cassandra can be installed on a single machine or container. A single Cassandra instance is called a node. Cassandra supports horizontal scalability achieved by adding more than one node as a part of a Cassandra cluster. <br>![](https://i.imgur.com/v0dUhcN.png =500x) * As well as being a distributed system, Cassandra is designed to be a peer-to-peer architecture, with each node connected to all other nodes. Each Cassandra node can perform all database operations and can serve client requests without the need for a primary node. <br>![](https://i.imgur.com/rsdIFmQ.png =400x) * Gossip is the protocol used by Cassandra nodes for peer-to-peer communication. The gossip protocol informs a node about the state of all other nodes. A node performs gossip communications with up to three other nodes every second. The gossip messages follow a specific format and use version numbers to make efficient communication, thus shortly each node can build the entire metadata of the cluster (which nodes are up/down, what are the tokens allocated to each node, etc..). * Multi Data Centers Deployment * A Cassandra cluster can be a single data center deployment (like in the above pics), but most of the time Cassandra clusters are deployed in multiple data centers. A multi data-center deployment looks like below – where you can see depicted a 12 nodes Cassandra cluster, topology wise installed in 2 datacenters. Since replication is being set at keyspace level, demo keyspace specifies a replication factor 5: 2 in data center 1 and 3 in data center 2. <br>![](https://i.imgur.com/JGziK5t.png =500x) Note: since a Cassandra node can be as well a coordinator of operations, in our example since the operation came in data center 2 the node receiving the operation becomes the coordinator of the operation, while a node in data center 1 will become the remote coordinator – taking care of the operation in only data center 1. * Components of a Cassandra Node * There are several components in Cassandra nodes that are involved in the write and read operations. Some of them are listed below: * Memtable * Memtables are in-memory structures where Cassandra buffers writes. In general, there is one active Memtable per table. Eventually, Memtables are flushed onto disk and become immutable SSTables. * This can be triggered in several ways: * The memory usage of the Memtables exceeds a configured threshold. * The CommitLog approaches its maximum size, and forces Memtable flushes in order to allow Commitlog segments to be freed. * When we set a time to flush per table. * CommitLog * Commitlogs are an append-only log of all mutations local to a Cassandra node. Any data written to Cassandra will first be written to a commit log before being written to a Memtable. This provides durability in the case of unexpected shutdown. On startup, any mutations in the commit log will be applied to Memtables. * SSTables * SSTables are the immutable data files that Cassandra uses for persisting data on disk. As SSTables are flushed to disk from Memtables or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one. Once the new SSTable has been written, the old SSTables can be removed. * Each SSTable is comprised of multiple components stored in separate files, some of which are listed below: * Data.db: The actual data. * Index.db: An index from partition keys to positions in the Data.db file. * Summary.db: A sampling of (by default) every 128th entry in the Index.db file. * Filter.db: A Bloom Filter of the partition keys in the SSTable. * CompressionInfo.db: Metadata about the offsets and lengths of compression chunks in the Data.db file. * Write Process at Node Level * Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending with a write of data to disk: * Logging data in the commit log * Writing data to the Memtable * Flushing data from the Memtable * Storing data on disk in SSTables ![](https://i.imgur.com/f6MLBhk.png =500x) * Read at node level * While writes in Cassandra are very simple and fast operations, done in memory, the read is a bit more complicated, since it needs to consolidate data from both memory (Memtable) and disk (SSTables). Since data on disk can be fragmented in several SSTables, the read process needs to identify which SSTables most likely contain info about the partitions we are querying - this selection is done by the Bloom Filter information. The steps are described below: * Checks the Memtable * Checks Bloom filter * Checks partition key cache, if enabled * If the partition is not in the cache, the partition summary is checked * Then the partition index is accessed * Locates the data on disk * Fetches the data from the SSTable on disk * Data is consolidated from Memtable and SSTables before being sent to coordinator <br>![](https://i.imgur.com/rgvg02J.png =500x) ### 3. Key Features of Cassandra * Distributed & Decentralized * Cluster runs on multiple distributed machines. * user address the cluster in the same way: * seamless to the number of nodes in the cluster * All nodes perform the same functions(Server symmetry) * Peer-to-peer architecture <br>![](https://i.imgur.com/u98TZ5V.png =400x) * Data Distribution starts with a Query <br>![](https://i.imgur.com/dnyx64G.png =500x) * Data Replication and Multiple DC Support * Replicas * How many nodes contain a certain piece of your data (partition) * Data Replication takes Cluster topology into consideration * Racks and data centers distribution of nodes * Availability versus Consistency * Always available * Tunable consistency * Per operation set consistency(read/write) * CAP theorem:Cassandra favors availability over consistency * Tunable: Strong or eventual consistency * Consistency conflicts solved during read <br>![](https://i.imgur.com/VzxN86V.png =300x) * High Availability and Fault Tolerance * Peer-to-Peer architecture * Nodes' temporary/permenent failures are immediately recognized by the other nodes in the cluster. * Nodes reconfigure the data distribution once nodes are taken out of the cluster. * Failed requests can be retransmitted to other nodes. <br>![](https://i.imgur.com/0UOBf1o.png =300x) * Fast and Linear Scalability * Scales horizontally by adding new nodes in the cluster * Performance increses linearly with the number of added nodes * New nodes are automatically assigned tokens from existing nodes * Adding and removing of nodes is done seamlessly <br>![](https://i.imgur.com/zzxsGMx.png =400x) * High Write Throughput * At cluster level * Writes can be distributed in parallel to all nodes holding replicas. <br>![](https://i.imgur.com/BpCIj15.png =300x) * No reading beafore writing(by default) * At node level * Writes are done in node memory and later flushed on disk * All disk write are sequantial ones - append-like operations <br>![](https://i.imgur.com/iBSBnN5.png =300x) * Cassandra Query Language * Data Definition and Manipulation:CQL, an SQL-like syntax ```SQL= CREATE TABLE test( groupid uuid, name text, occupation text, age int, PRIMARY KEY((groupid), name)); INSERT INTO test(groupid, name, occupation, age) VALUES(1001, 'Thomas', 'engineer', 24), (1001, 'James', 'designer', 30, (1002, 'Lily', 'writer', 35)); SELECT * FROM test WHERE groupid = 1001; ``` ### 4. Cassandra Data Model - Part 1 * Logical Entities: Tables and Keyspaces * Table * Logical entity that organizes data storage at cluster and node level(according to a declared schema) * Keyspace * Logical entity that contains one or more tables * Replication and data centers' distribution is defined at keyspace level * Recommended 1 keyspace/application ```SQL= CREATE KEYSPACE intro_cassandra WITH REPLICATION = {'class':'NetworkTopologyStrategy', 'datacenter1': 2, 'datacenter2': 3}; USE intro_cassandra; CREATE TABLE groups( groupid int, group_name text STATIC, username text, age int, PRIMARY KEY((groupid), username) ); ``` * Logical Entites: Tables * Data is organized in tables containing rows of columns. * Tables can be created, dropped, and altered at runtime without blocking updates and queries. * To create a table, you must define a primary key and other data columns(regular columns) ```SQL= CREATE TABLE intro_cassandra.groups( groupid int, group_name text STATIC, username text, age int, PRIMARY KEY ((groupid), username) ); ``` * groupid: Partition Key * username: Clustering Key * Primary Key in Cassandra Tables * Subset of the declared columns * Mandatory (you cannnot change it once declared) * Two main roles: * Optimize read performance for table queries - Query driven table design * Provide uniqueness to the entries * Has two components: * Partition Key - mandatory * Clustering Key(s) - optional * Partition Keys * When data is writeen to a table, it is grouped into partitions and distributed on cluster nodes - based on partition Key * Partition Key => Hash (token) => Node * Partition key determines data (partition) locality in cluster <br>![](https://i.imgur.com/rx060wA.png) * Table Types * Two types of tables: static and dynamic * Static tables * PRIMARY KEY(username) * Dynamic tables * PRIMARY KEY((groupid), username) * Static Tables <br>![](https://i.imgur.com/8tK2vtI.png =500x) ### 5. Cassandra Data Model - Part 2 * Clustering Key * Stores data in ascending or descending order within the partition for the fast retrieval of similar values * Can have single or multiple columns * Completes the primary key in dynamic tables * Gives uniqueness to primary key Improves read query performance ```SQL= CREATE TABLE intro_cassandra.groups( groupid int, group_name text STATIC, username text, age int, PRIMARY KEY ((groupid),username) ); ``` * Group - dynamic table * Partition key = groupid * Clustering key = username * 1 partition = multiple entries * Dynamic Tables ```SQL= INSERT INTO intro_cassandra.groups(groupid, group_name, username, age) VALUES(45, 'Grilling', 'JayZ@yahoo.com', 46)); ``` <br>![](https://i.imgur.com/IvB0MMb.png) * Basic Rukes of Data Modeling * Data Modeling - build a primary key that optimizes query execution time * Choose a partition Key - starts answering your query and spreads the data uniformly in the cluster * Minimize the bunber of partitions read in order to answer the query <br>![](https://i.imgur.com/pcP7gQZ.png =400x) ### 6. Introduction to Cassandra Query Language Shell (cqlsh) * Cassandra Query language * CQL is the primary language for communication with Cassandra clusters * Simple yet intuitive syntax(SQL-like) * CQL lacks frammar for relational features such as JOIN statements * Different behavior of CQL commands vs. SQL ```SQL= CREATE KEYSPACE intro_cassandra WITH .. ``` ```SQL= CREATE TABLE test() .. ``` ```SQL= INSERT INTO test() VALUES() .. ``` ```SQL= SELECT * FROM test WHERE .. ``` ```SQL= UPDATE test SET age = 25 WHERE userid = 30 .. ``` ```SQL= DELETE FROM test WHERE userid = 30 .. ``` ```SQL= DROP TABLE test; ``` ```SQL= TRUNCATE TABLE test; ``` * CQL keywords are case-insensitive ```SQL= SELECT * FROM users; ``` * Identifiers in CQL are case-insensitive unless enclosed in double quotation marks * Names for identifiers created using uppercase are stored in lowercase ```SQL= CREATE TABLE USERS(..); ``` * Commented text (//) is ignored by CQL ```SQL= CREATE TABLE USERS(..); // Stored name of the table: users ``` * Running CQL Queries * Run using Cassandra client drivers * JAVA, Python, Ruby, Node.js, PHP, Scala, Clojure * Default = open source Datastax Java Driver * Run using cqlsh client * Python-based command line shell for interacting with Cassandra through CQL * Shipped with every Cassandra package * Connects to a single node (default node or one specified on the command line) * Other CQL client editors are available * CQL Shell (cqlsh) * Using cqlsh, you can: * Create, alter, drop keyspaces * Create, alter, drop tables * Insert, update, delete data * Execute read queries(SELECT) <br>![](https://i.imgur.com/54UNxAF.png =300x) * cqlsh - example ```SQL= USE intro_Cassandra; SELECT * FROM groups WHERE groupid = 12; INSERT INTO GROUPS(groupid, username, group_name, age) VALUES(12, 'Aland@gmail.com', 'baking', 32); SELECT * FROM groups WHERE groupid=12; ``` <br> ![](https://i.imgur.com/fIyaaOO.png) * cqlsh - special commands <br>![](https://i.imgur.com/0MfJ0WY.png) * cqlsh Consistency <br>![](https://i.imgur.com/cnnq3Us.png) * Consistency example - QUORUM <br>![](https://i.imgur.com/2WVA1nM.png) * cqlsh COPY (import / export data) <br>![](https://i.imgur.com/QwjQ2Mi.png) ## Working with Cassandra ### 1. CQL Data Types * Built-in Data Types |Data Type|Data Type| | :---: | :---: | | ASCII | Int | | Boolean | Text | | Blob | Timestamp | | Bigint | Timeuuid | | Decimal | Tinyint | | Double | Uuid | | Float | Varchar | * Collection Data Types * Collections * A way to group and store data together * Cassandra 沒有 Joins 的功能 * 把資料存放在同一個 Collection * Collection 的儲存是要有限制,所以不適合即時更新的資料。 * Collection Data Types - List * When order of the elements needs to be maintained. * Ex. entries in a log. ```SQL= USE intro_cassandra; ALTER TABLE users ADD jobs list<text>; UPDATE user SET jobs = ['Walmart'] + jobs WHERE username = 'Alaind@gmail.com'; // add the last job change to the list UPDATE users SET jobs = job + ['Netflix'] WHERE username = 'Alaind@gmail.com'; // add the last job change to the list UPDATE users SET jobs[0] = 'Reiss' WHERE username = 'Alaind@gmail.com'; // replaces Walmart with Reiss (lists start from 0) ``` ![](https://i.imgur.com/39YD9ts.png =400x) * User-Defined Data Types (UDTs) * Collection data types for one-to-many / UDTs for one-to-one * Can attach multiple data fields, each named and typed, to single column * The fields used to create a UDT may be any valid data type, including collections and other existing UDTs * Once created, the user can alter, verify, and drop a field or the whole data type * Once created, UDTs may be used to define a column in a table ```SQL= CREATE TYPE address( Street text, Number int, Flat text); CREATE TABLE users_w_address( Userid int, Location address, Primary key (userid)); INSERT INTO users_w_address(userid, location) VALUES (1, {street : 'Third', number : 34, flat : '34c'}); // insert data DROP TYPE address; // we can drop a type ### 2. Keyspace Operations ``` ## Summary & Highlights * 課程完整整理的內容,所以把它記錄下來 * Cassandra Basisc >* Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault tolerant, and tunable and consistent database. >* Apache Cassandra is best used by "always available" type of applications that require a database that is always available. >* Data distribution and replication takes place in one or more data center clusters. >* Its distributed and decentralized architecture helps Cassandra be available, scalable, and fault tolerant. >* Cassandra stores data in tables. >* Tables are grouped in keyspaces. >* A clustering key specifies the order that the data is arranged inside the partition (ascending or descending). >* Dynamic tables partitions grow dynamically with the number of entries. >* CQL is the primary language for communicating with Apache Cassandra clusters. >* CQL queries can be run programmatically using a licensed Cassandra client driver, or they can be run on the Python-based CQL shell client provided with Cassandra. * Working with Cassandra

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully