--- disqus: hackmd --- Introduction to NoSQL Databases <br> WEEK_1 - Introducing NoSQL ==== ###### tags: `IBM Data Engineering Professional Certificate`,`Reading Note`,`Coursera`,`Introduction to NoSQL Databases` ### Overview >* 講解 RDBMS 與 NoSQL 的差別,如何在不同場景做選擇。 >* 深入探討 NoSQL 的功能與特性。 >* 講解 ACID 與 BASE 兩個模型之間的差異與效能優勢。 <br> ## Basic of NoSQL ### 1. Overview of NoSQL * What is NoSQL? * NoSQL 可稱為 Not only SQL。 * 儲存資料的方式與技術都有別於關聯式資料庫。 * 非關聯式 * 沒有正式的 row 與 column * 新的方式儲存和提取資料。 * 適合用於處理大數據資料。 * 比起關聯式資料庫更容易開發應用程式。 * History of NoSQL <br>![](https://i.imgur.com/HM8n3fO.png =700x) * 因應資料儲存需求,開發出RDBMS資料庫。 * 大數據資料飛漲,各個網絡公司為了解決龐大的資料儲存,隨後發表關於易於擴展的NoSQL技術的白皮書。 * NoSQL開源技術陸續被開發。 * 雲端公司推出NoSQL託管服務。 ### 2. Characteristics of NoSQL Databases * NoSQL Database Categories(類別) * Key-Value * Document * Column * Graph * NoSQL Database Characteristics(特徵) * 有自家的開源社群。 * 大部分的NoSQL是以開源的方式提供使用。 * 以開源的方式作為商業基礎。 * NoSQL的開發公司多數會同時提供商業版及開源版本。 * 每家公司都會有自家的獨特技術,但是還是有部分共同的技術,如: * 水平式擴展。 * 比 RDBMS 更容易資料共享。 * 使用 unique key 作為資料分片。 * 比 RDBMS 更多開發案例 * 更容易開發。 * 配合敏捷開發需求。 * Benefits of NoSQL Databases * Scalability <br>![](https://i.imgur.com/GcuVYs8.png =500x) * 水平擴張,從 Server 擴展至 Server Cluster, Server Racks 或最終至 Data Centers。 * Performance <br>![](https://i.imgur.com/hefF9Hz.png =400x) * 快速的回應速度 * 高併發 * Availability * 具有多個資料副本的資料庫集群,比單一資料庫的使用更為彈性。 * Cloud Architecture * 部署資料庫集群在雲端上,比傳統的部署方式更為省錢,效能更好。 * Flexible Schema * 靈活與直觀的資料架構使得開發人員在開發上更為輕便。 * NoSQL的靈活模式,在部署新應用程式的期間不需進行停機或任何資料庫鎖定。 * Varied Data Structures * Key-Value 的快速搜尋資料。 * 文件檔儲存。 * 關聯資料的圖形資料庫。 * Specialized Capabilities * 特定的索引和查詢功能 * 健壯的資料複製 * 現代的 HTTP API 請求功能 ### 3. NoSQL Database Categories - Key-Value * Key-Value NoSQL Database Architecture <br>![](https://i.imgur.com/R69jfo8.png =300x) * Least complex * Represented as hashmap * Ideal for basic CRUD operations * Scale well * Shard easily * Not intended for complex queries * Atomic for single key opeartions only * Value blobs are opaque to database * Less flexible data indexing and querying * Key-Value NoSQL Database Use Cases * Suitable Use Cases * 在 non-interconnected 資料中進行快速的 CRUD 作業,如: * 在網站上的個人資料或個人偏好設定。 * 購物車中的資料。 * Unsuitable Use Cases * 資料與資料之間是 many-to-many 的關係,如: * 社群網絡 * 推薦系統引擎 * 在併發作業的過程中,依然可以為此資料的一致性。 * 提供 ACID transaction 功能的資料庫 * Apps 基於 value 和 key 運行資料請求 * 請求 Document 資料 * Key-Value NoSQL Database Examples * AWS DynamoDB * Oracle NoSQL Database * Redis * Riak * Memcached ### 4. NoSQL Database Categories - Document * Document-Based NoSQL Database Architecture * Values 可視且可被訪問。 * 每筆資料是以文件檔的形式被儲存。 * 資料的格式通常是 JSON 或 XML。 * 每個文件檔都是一個彈性的模式。 * 可用 Key 和 Value 可搜尋並訪問資料。 * 可用 MapReduce 訪問資料。 * 水平式擴展。 * 可把分片資料儲存至各個Nodes當中。 * Document-Based NoSQL Database Use Cases * Suitable Use Cases * Event logging for apps and processes * each event instance is represented by a new document. * Online blogs * each user, post, comment, like, or action is represented by a document * Operational datasets and metadata for web and mobile apps * designed with Internet in. mind (JSON, RESTful APIs, unstructured data) * Unsuitable Use Cases * 在需要 ACID transactions 的場景 * Document databases 無法承載多個 documents 的 transaction。 * 關聯式資料庫更適合 ACID transactions 的任務。 * 資料是以 aggregate-oriented 方式設計 * 資料需要進行 normalized。 * * 關聯式資料庫更適合 ACID transactions 的任務。 * Document-Based NoSQL Database Example * IBM Cloudant * MongoDB * CouchDB * Terrastore * Couchbase ### 5. NoSQL Database Categories - Column * Column-Based NoSQL Database Architecture * 從 Google Bigtable 衍生而來。 * 別稱為 Bigtable clones。 * 儲存資料至 columns 當中 。 * Column 'families' are several rows, with unique keys, belonging to one or more columns * Grouped in families as often accessed together * Rows in a column family are not required to share the same columns * Can share all, a subset, or none * Columns can be added to any number of rows, or not * Suitable Use Cases * 適合大量的 sparse data。 * Column databases 適合分散儲存至各個節點中。 * 適合儲存 event logging 和 blogs。 * Counters are a unique use case for column databases。 * Columns can have a TTL parameter, making them useful for data with an expiration value. * Unsuitable Use Cases * ACID transactions * 適合 row level 進行讀寫作業。 * 在前期開發過程中,可能要對Columns進行增減,這可能會造成成本增加及產品的開發時間。 * Column-Based NoSQL Database Examples * Cassandra * Apache HBASE ### 6. NoSQL Database Categories - Graph * Graph NoSQL Database Architecture * Graph databases 儲存資料至 nodes, 儲存 node 之間的關係至 edges。 * 較為困難把資料分散儲存到多個Server上,否者會影響到資料庫的效能。 * Graph databases 適合用於 ACID transaction。 * Suitable Use Cases * 適合關係性高的資料。 * 社群網絡。 * Rounting, spatial, and map apps * 推薦系統引擎 * Unsuitable use Cases * 水平式擴展 * 更新全部資料或部分 nodes 裡的資料。 * Graph NoSQL Database Examples * neo4j * AWS Neptune <br> ## Working with Distributed Data ### 1. ACID vs BASE * ACID vs BASE consistency models <br>![](https://i.imgur.com/gy6dzwr.png =500x) * ACID definition <br>![](https://i.imgur.com/sqOaEKM.png =500x) * Atomic: 不可分割性 * Consistent: 一致性 * Isolated: 隔離性 * Durable: 持久性 * ACID consistency model * 使用在關聯式資料庫。 * 確保 data transaction 的一致性。 * 多使用在: * 金融業系統 * 資料倉庫 (Data Warehousing) * BASE definition <br>![](https://i.imgur.com/tfC5leO.png =500x) * Basically Available: 保持服務基本可用 * Soft state: 狀態可以有一段時間的不同步 * Eventuually consistent: 雖然有一段時間不同步,但追求最後結果一致 * BASE consistency model * 對於資料的一致性,即時更新和精準度的要求不高。 * 具有彈性和擴展性。 * 多使用在: * 電商公司 * 社群網絡平台 ### 2. Distributed Databases * Concepts of distributed systems * Distributed database * b將資料依照特性分散儲存在不同的資料庫伺服器,再以網路將這些伺服器連接起來。 * 把資料庫分散到各地區中。 * Fragmentation and replication * BASE 模型。 * Fragmentation <br>![](https://i.imgur.com/tFAjNhN.png =500x) <br>![](https://i.imgur.com/idFPVrM.png =500x) <br>![](https://i.imgur.com/iumYgpJ.png =500x) * 把資料分片儲存至各個資料庫裡。 * Replication <br>![](https://i.imgur.com/ZAFgi5o.png =500x) <br>![](https://i.imgur.com/NqERHLU.png =500x) <br>![](https://i.imgur.com/vuiH1Pd.png =500x) <br>![](https://i.imgur.com/X48fMzW.png =500x) * Advantages of distributed systems * 可靠性及彈性高。 * 效能被提升。 * 縮短訪問資料的時間。 * 簡易的提升及擴張資料庫 ### 3. The CAP Theorem * CAP Theorem <br>![](https://i.imgur.com/iy1qNw9.png =500x) * Partition Tolerance * Partition * a lost or temporarily delayed connection between nodes. * Partition tolerence * the cluster must work despite network issues * Distributed systems cannot avoid partitions and must be partition tolerant. * Partition tolerance * basic feature of NoSQL * NoSQL: CP or AP <br>![](https://i.imgur.com/BhYSsty.png =400x) ### 4. Challenges in Migrating from RDBMS to NoSQL Databases * RDMS or NoSQL <br>![](https://i.imgur.com/JKJ7oVr.png =500x) * RDBMS to NoSQL : a mindset change * Data driven model to Query driven data model * RDBMS: * Starts from the data integrity, relations between entities. * NoSQL: * Starts from your queries, not from your data.Models based on the way the application interacts with the data. * Normalized to Denormalized data * NoSQL: * Think how data can be structured based on your queries. * RDBMS: * Start from your normalized data and then build the queries. * From ACID to BASE model * Availability vs Consistency * CAP Theorem * choose between availability and consistency * Availability, performance, geographical presence, high data volumes * NoSQL systems, by design, do not support transactions and joins(except in limited cases) ## Summary and Highlights * 課程完整整理的內容,所以把它記錄下來 * Basics of NoSQL >* NoSQL means Not only SQL. >* NoSQL databases have their roots in the open source community. >* NoSQL database implementations are technically different from each other. >* There are several benefits of adopting NoSQL databases including storing and retrieving session information, and event logging for apps. >* The four main categories of NoSQL database are Key-Value, Document, Wide Column, and Graph. >* Key-Value NoSQL databases are the least complex architecturally. >* Document-based NoSQL databases use documents to make values visible for queries. >* In document-based NoSQL databases, each piece of data is considered a document, which is typically stored in either JSON or XML format. >* Column-based databases spawned from the architecture of Google’s Bigtable storage system. >* The primary use cases for column-based NoSQL databases are event logging and blogs, counters, and data with expiration values. >* Graph databases store information in entities (or nodes) and relationships (or edges). * Working with Distributed Data >* ACID stands for Atomicity, Consistency, Isolated, Durable. >* BASE stands for Basic Availability, Soft-state, Eventual Consistency. >* ACID and BASE are the consistency models used in relational and NoSQL databases. >* Distributed databases are physically distributed across data sites by fragmenting and replicating the data. >* Fragmentation enables an organization to store a large piece of data across all the servers of a distributed system by breaking the data into smaller pieces. >* You can use the CAP Theorem to classify NoSQL databases. >* Partition Tolerance is a basic feature of NoSQL databases. >* NoSQL systems are not a de facto replacement of RDBMS. >* RDBMS and NoSQL cater to different use cases, which means that your solution could use both RDBMS and NoSQL.