Amazon Redshift Tutorials

# Amazon Redshift # OLTP vs OLAP Databases ![image](https://hackmd.io/_uploads/SkEZVKeH1g.png) | OLTP | OLAP | | -------- | -------- | | MySQL, PostgreSQL, SQL Server, etc. | Amazon Redshift, BigQuery, Hive, Snowflake, etc. | |Web, transactional processing, operational systems | Data warehouse, data analysis, reporting | | Designed for transaction-oriented tasks, such as data entry, updates, and retrieval (Transactional data) | Designed for complex queries and analysis of large volumes of data (Historical data) | | Highly normalized data to reduce redundancy and ensure data integrity - 2NF, 3NF, etc. | Denormalized data structure, often organized into multidimensional schemas (like ***star*** or snowflake). | | Wide queries - Entire table/colums, few rows. Supports a large number of short online transactions (e.g., insert, update, delete). | Deep queries - Few/selected columns, large number of rows (e.g. historical data). Supports fewer but more complex queries that aggregate large datasets. | | Fast response time for quick transactions (milliseconds). | Longer response time, as queries are more complex and involve large datasets (seconds to minutes). | | Row-based storage engine | Column-based/Columnar storage engine | # Data Warehouses on-prem vs on-cloud ![image](https://hackmd.io/_uploads/BJtDgp0VJg.png) On-prem: IBM IIAS, Oracle Teradata,... On-cloud: Amazone Redshift, Google BigQuery,... | **Aspect** | **On-Premises** | **Cloud** | |--------------------------|----------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------| | **Infrastructure** | Requires dedicated physical hardware and infrastructure. | Utilizes cloud infrastructure provided by the cloud service provider. Offers scalability and flexibility. | | **Scalability** | Limited scalability, requires additional hardware procurement for expansion. | Offers dynamic scalability. Can easily scale up or down based on demand. | | **Costs** | High upfront costs for hardware, software, and maintenance. Ongoing operational costs. | Pay-as-you-go model. Lower upfront costs. Scalable pricing based on usage. | | **Maintenance** | Requires in-house IT staff for maintenance, updates, and troubleshooting. | Cloud provider handles infrastructure maintenance, updates, and patches. Allows focus on data and analytics. | | **Time To Market** | Longer deployment timelines due to hardware acquisition and setup, solution/vendor selection. | Rapid deployment with the ability to provision resources quickly. There are many cloud native services to choose from to build solutions. | | **Security** | In-house control over security measures. | Cloud provider implements robust security protocols. Shared responsibility model for security. | | **Data Integration** | May require additional tools and efforts for data integration. | Seamless integration with other cloud services. Easier data sharing and collaboration. | | **Backup and Recovery** | In-house responsibility for backup and recovery processes. | Cloud provider offers automated backup and recovery for both inter and intra regions solutions. | | **Data Governance** | Requires in-house implementation of data governance policies. | Cloud provider offers built-in data governance features. Compliance standards may vary. | | **Cost Monitoring** | Manual monitoring of costs. | Built-in cost monitoring and management tools. | # Introduction to Redshift ![redshift](https://hackmd.io/_uploads/SJe_VpANkl.png) Amazon Redshift is a fully managed, fast, and powerful petascale cloud-based data warehouse service provided by AWS. It is ideal for data warehousing, business intelligence, analytics, and reporting applications. It is highly efficient, scalable, and cost-effective, enabling you to process and analyze large volumes of data efficiently. AWS Redshift can handle large-scale data analytics and high-performance structured and semi-structured data analysis. It uses a columnar storage architecture for compression, fast query execution, and parallel processing. As a data warehouse, you can use AWS Redshift to store data from different sources in one place, making it easy to query. Redshift is easily scalable as it supports petabytes of data which you can control by adding more nodes as your data grow. It also supports distributed workloads by splitting and executing queries across multiple nodes to optimize performance. ![image](https://hackmd.io/_uploads/HkbkyFxBJx.png) # Architecture ![architecture](https://hackmd.io/_uploads/Hkh-HpAV1x.png) **Client applications** Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on open standard PostgreSQL, so most existing SQL client applications will work with only minimal changes. **Clusters** The core infrastructure component of an Amazon Redshift data warehouse is a cluster. A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. Your client application interacts directly only with the leader node. The compute nodes are transparent to external applications. **Leader node** The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node. The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes. **Compute nodes** The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes run the compiled code and send intermediate results back to the leader node for final aggregation. Each compute node has its own dedicated CPU and memory, which are determined by the node type. As your workload grows, you can increase the compute capacity of a cluster by increasing the number of nodes, upgrading the node type, or both. **Redshift Managed Storage** Data warehouse data is stored in a separate storage tier Redshift Managed Storage (RMS). RMS provides the ability to scale your storage to petabytes using Amazon S3 storage. RMS lets you scale and pay for computing and storage independently, so that you can size your cluster based only on your computing needs. It automatically uses high-performance SSD-based local storage as tier-1 cache. It also takes advantage of optimizations, such as data block temperature, data block age, and workload patterns to deliver high performance while scaling storage automatically to Amazon S3 when needed without requiring any action. **Node slices** A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation. The number of slices per node is determined by the node size of the cluster. For more information about the number of slices for each node size, go to About clusters and nodes in the Amazon Redshift Management Guide. When you create a table, you can optionally specify one column as the distribution key. When the table is loaded with data, the rows are distributed to the node slices according to the distribution key that is defined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and run queries efficiently. **Internal network** Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very high-speed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly. **Databases** A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which in turn coordinates query run with the compute nodes. Amazon Redshift is a relational database management system (RDBMS), so it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets. Amazon Redshift is based on PostgreSQL. Amazon Redshift and PostgreSQL have a number of very important differences that you need to take into account as you design and develop your data warehouse applications. For information about how Amazon Redshift SQL differs from PostgreSQL, see Amazon Redshift and PostgreSQL. https://docs.aws.amazon.com/redshift/latest/dg/c_redshift_system_overview.html Price performance: ![image](https://hackmd.io/_uploads/ryDbrulSye.png) Redshift instance type: ![image](https://hackmd.io/_uploads/Bk6mEOxBkg.png) # Table design & Best practices ## Data storage ![image](https://hackmd.io/_uploads/rkKSIulHkx.png) ## Data modeling ![image](https://hackmd.io/_uploads/ByLAHuxrJe.png) ## Columnar storage ![image](https://hackmd.io/_uploads/SJcb3deSkx.png) ![image](https://hackmd.io/_uploads/rk-73OeHyx.png) ## Compression ![image](https://hackmd.io/_uploads/S1yRU_grkx.png) ## Distribution style ![image](https://hackmd.io/_uploads/BJVgPdxByg.png) ![image](https://hackmd.io/_uploads/rJO2POgHke.png) ## Data sorting ![image](https://hackmd.io/_uploads/HJt-dugHJx.png) ![image](https://hackmd.io/_uploads/SkaMuulrkx.png) ## Query processing https://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html https://docs.aws.amazon.com/redshift/latest/dg/c-query-processing.html ## Work load management (WLM) https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html # Ingestion patterns ![image](https://hackmd.io/_uploads/rk6HcKlrkx.png) ![image](https://hackmd.io/_uploads/SJQvqKeryx.png) # Amazon S3 ![product-page-diagram_Amazon-S3_HIW.cf4c2bd7aa02f1fe77be8aa120393993e08ac86d](https://hackmd.io/_uploads/S1CdUCPS1g.png) https://aws.amazon.com/s3/ * Distributed Object Storage (Hadoop, etc.) * Cloud-based object storage (AWS S3, Azure Blob Storage, Google, etc.) * Object level * Access by keys, via Rest APIs (HTTP) * Scalability, Availability (99.99%), Durability (99.999999999%) * Data lakes, Backup and restore data, Archive data, etc. # Practices ![poc-step2](https://hackmd.io/_uploads/HkItCa0EJx.png) https://docs.aws.amazon.com/redshift/latest/gsg/new-user-serverless.html