<style> .mention { padding: 0 2px; background-color: #cbe2f9; border-radius: 4px; color: #0b5cad; } </style> # General Overview ([META](https://dev.miaplaza.com/miaplaza/website/-/issues/16058)) We are planning to implement a robust business intelligence tool to empower internal users to generate their own reports and dashboards without requiring developer assistance. Currently, <span class="mention">@dhansen</span> and the analytics team utilize KNIME to gather data and create visuals, which presents several limitations: - **Learning Curve and Technicality:** High, with a necessity for SQL knowledge. - **Data Freshness:** Data could be outdated, ranging from a day to three weeks, as it is sourced from the staging database. - **Complexity:** Direct data extraction from our databases is complicated due to their structural complexities. - **Usability:** The current method is not feasible for other departments to adopt for their data analysis needs. The new tool will be designed to ensure data privacy and ease of use, integrating data warehousing solutions to manage data efficiently and minimize load on production databases. We aim to facilitate the integration of data from various external tools like Braintree, Airtable, QuickBooks, and Google Analytics, enhancing our analytical capabilities. **Requirements from the analytics team include:** - Accessible, visually appealing data presentation. - Capability for ad-hoc analysis while maintaining data security. - Provision of fresh, real-time data. - Availability of aggregated data tables. - User-friendly interface for non-technical staff. - Flexibility to incorporate various data sources. - Comprehensive permission controls to enforce data security. # Metabase - [#16062 (closed)](https://dev.miaplaza.com/miaplaza/website/-/issues/16062) ## Overview We are going to discover the features of **[Metabase](https://www.metabase.com/)** proposed by <span class="mention">@rkloob</span> and determine if it is a tool we're looking for. Currently, it is set and working on the local developer's machine. The goal is to investigate what it can do and make a presentation to the analytics department. According to the feedback, we will decide if we are going to move on with it or find something else. ## Conclusion On April 3rd, <span class="mention">@rkloob</span> had the opportunity to present a comprehensive demonstration of our locally hosted Metabase instance to <span class="mention">@dhansen</span>. The presentation covered a thorough walkthrough of the administrative settings, the intuitive UI/UX design, and the extensive range of features offered by the tool. Notably, Raad highlighted the graphical user interface (GUI) query creator and the potential integrations within Metabase, such as Slack integration. Additionally, we collaborated on creating several queries, including one specific request David had made prior to our meeting. This request involved: ```! Generating a stacked column or area graph depicting the weekly new student account registrations across four sites, explicitly excluding accounts that never log in due to incomplete payment processes during registration. ``` David expressed his approval of Metabase and its capabilities, granting us the **go-ahead** to continue its implementation. This approval was contingent upon addressing certain issues he had previously raised, which primarily concern data management and integration challenges: 1. The need for real-time data analytics or ensuring the most up-to-date data is accessible. 2. The ability to query or join data from different databases seamlessly. 3. Integrating data from various tools we utilize, including Braintree and payment data. It's important to note that the resolution to these challenges lies not within Metabase itself but through the effective utilization of a **data warehousing solution**. This approach will allow us to overcome the mentioned obstacles and enhance our data handling capabilities. Consequently, these issues will be diligently addressed as part of our ongoing data warehousing initiative. **In conclusion**, our discussions and demonstrations affirm Metabase's suitability for our analytics needs, without any current limitations or restrictions that might hinder its future use. This tool promises significant benefits for our business team and anyone in need of comprehensive analytics solutions, reinforcing our decision to proceed with its deployment. # Data Warehouse - [#16079](https://dev.miaplaza.com/miaplaza/website/-/issues/16079) ## Overview Our objective is to explore and assess data warehousing solutions that can effectively tackle the integration and management obstacles identified in the Metabase demonstration, referenced in the previous issue. Presently, our operational infrastructure comprises multiple databases, each optimized for OLTP (Online Transactional Processing). While this setup is efficient for transactional operations, it falls short in facilitating comprehensive analytics and real-time data insights. The solution we seek must bridge this gap, enabling seamless data consolidation from varied sources and supporting advanced analytics capabilities. This involves ensuring compatibility with real-time data analytics, facilitating cross-database queries, and integrating data from external tools such as Braintree, Airtable, QuickBooks, Google Analytics, advertising platforms, and a few other smaller sources. The ultimate goal is to enhance our decision-making processes by providing a unified, up-to-date view of our data landscape. This will maximize the utility of Metabase for our analytics needs, significantly boosting our analytics capabilities and ensuring a robust analytics environment. ## Discussions ( @rkloob, @leo42 ) ### Data Sources - Production DBs (main + writemostly) - Miacademy - Main: schemas: `miacademy`, `meta` - WriteMostly: only 1 schema - AI/CD - MiaPrep | Data source | Priority | Notes | |-----------------------|-----------|--------------------------------------| | Production DBs | Very High | `3 x 2 = 6 DBs (Main + WriteMostly)` | | Braintree | Medium | Integration details pending | | Google Analytics | Low | To be evaluated | | Airtable | Low | To be evaluated | | QuickBooks | Very Low | Limited integration planned | | Advertising Platforms | Very Low | Limited integration planned | ### Data Warehousing Cloud Solutions As we consider integrating a cloud-based data warehousing solution, it's important to evaluate the key options available in the market. The selection process will depend on several factors including cost, scalability, performance, and ease of integration with existing tools and data sources. Below, we discuss three leading cloud data warehouse solutions: #### 1. **Amazon Redshift** - **Overview:** Amazon Redshift is a fully managed, petabyte-scale data warehouse service by AWS. It is designed for large scale data collection and analytics and integrates well with various data sources and business intelligence tools. - **Key Features:** - Automated data management and scaling. - Supports SQL-based querying and advanced analytics capabilities. - Secure data encryption and compliance features. - **Pricing:** Redshift charges on a per-hour basis depending on the types and number of nodes in your cluster, making it cost-effective for both small and large-scale operations. #### 2. **Google BigQuery** - **Overview:** BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility. It is part of the Google Cloud Platform. - **Key Features:** - Serverless data warehouse that provides flexible and scalable data storage. - Strong integration with AI and machine learning capabilities. - Real-time analytics with the capability to handle streaming data. - **Pricing:** BigQuery offers a pay-as-you-go pricing model based on the amount of data processed by queries and the data stored in the warehouse. #### 3. **Snowflake** - **Overview:** Snowflake is a cloud data platform that provides a data warehouse-as-a-service which supports various cloud providers including AWS, Azure, and Google Cloud. - **Key Features:** - Unique architecture that separates compute from storage, allowing each to scale independently. - Secure data sharing across Snowflake users without moving data. - Supports multi-cloud environments which can be beneficial for redundancy and geo-redundancy. - **Pricing:** Snowflake offers various pricing tiers based on usage with separate storage and computing charges, which can be tailored according to the size and demand of the business. #### Considerations for Selection: - **Integration Capability:** The chosen solution should integrate seamlessly with existing databases and external tools like Braintree, Airtable, and Google Analytics. - **Scalability & Performance:** The ability to scale resources up or down based on demand is crucial, as is the performance for processing large volumes of data. - **Security & Compliance:** Ensuring data is secure and compliant with industry regulations is vital, especially when dealing with sensitive financial or personal data. - **Cost Efficiency:** Considering the total cost of ownership including data storage, processing, and transfer fees is important for budget planning. By evaluating these factors, we can select the most appropriate data warehousing solution that aligns with our technical and business requirements. #### Pricing Comparison of Data Warehousing Solutions Below is a simplified pricing structure for each of the three major data warehousing solutions. These values are based on common scenarios and may vary depending on specific use cases and configurations. | Feature | Amazon Redshift | Google BigQuery | Snowflake | | -------------------------------- | -------------------------- | ------------------------------------- |:---------------------------- | | **Base Price** | | | | | Storage (per TB per month) | $25 (on-demand) | $20 (active), $10 (long-term storage) | $23 (on-demand) | | Compute (per hour per node) | $0.25 (dc2.large instance) | $5 (per TB processed for queries) | $2 (small virtual warehouse) | | **Typical Small Business Setup** | | | | | Monthly Storage (1 TB) | $25 | $20 | $23 | | Monthly Compute (50 hours) | $12.50 | Based on query cost | $100 | | **Typical Enterprise Setup** | | | | | Monthly Storage (50 TB) | $1,250 | $1,000 | $1,150 | | Monthly Compute (1000 hours) | $250 | Based on query cost | $2,000 | | **Data Transfer** | | | | | Outbound (per GB) | $0.02 | $0.12 | $0.02 | #### Key Considerations: - **Amazon Redshift:** Ideal for those who prefer a straightforward and predictable pricing model with capability to handle large, complex queries efficiently. - **Google BigQuery:** Costs are highly dependent on query complexity, which can be economical for businesses with variable query loads but could potentially become costly for very heavy use. - **Snowflake:** Offers more flexibility in compute and storage scaling, potentially reducing costs for companies with fluctuating workload demands. Charges separately for compute and storage, providing clear visibility into resource consumption. --- **Questions that need answering:** - Which data sources are highest priority for integration into the data warehouse? - Should we opt for a cloud-based data warehousing solution or build our own, considering cost, complexity, scalability, and timeline? - What resources are required to deliver this project, and is additional support from the infrastructure team necessary? - When can we expect this project to be delivered? ## Technical Plan ( @rkloob, @leo42 ) ### ETL (Extract, Transform, Load) General considerations: * How often do we want to apply changes? Look into CDC and peerDB * Scheduling: There are ETL pipeline/scheduling tools (what would we get here?), we coould also use k8s cronjobs or gitlab CI **Extract:** * We do not want to query the production database directly * Setup a read-replica? **Transform:** * We should evaluate existing ETL frameworks for this vs custom scripts *e.g. dbt core. What is the cloud situation? AWS Glue? * Probably want to have code (and not a no-code solution) that is portable and not tied to a specific cloud * Remove PII / Anonymize data * Simple Transform for PoC: Merge production DBs (add website/schema/db column) * We should try to push to only include specific tables (instead of all tables), to keep the data volume low. For that we probably need to know, common analytic queries. * Where does this run? * What is the result of this? postgres table? **Load:** * The choice of warehouse does not seem that important as metabase can connect to a lot of solutions. --- ### 1. Extraction - **Database Connections**: - **Concurrency Management**: Setup read replicas or extract during off-peak hours. - **Data Integrity**: Using transactions or consistent snapshots. ### 2. Transformation - **Data Cleaning**: To ensure the data is free from errors or inconsistencies: - Handling null values - Standardizing date formats - Etc ... - **Data Transformation**: Depends on our data warehouse schema, we might need to: - Normalize or denormalize data. - Merge fields. - Convert data types. - Etc ... - **Performance Optimization**: Using tools that can handle large datasets efficiently: (*Still needs investigation*) - Apache Spark - Talend ### 3. Load - **Batch loading vs. Streaming**: loading data in batches or stream it directly: - Batch loading: Simpler and less resource-intensive but less up-to-date. - Streaming: More complex but provides real-time data. - ?Error Handling: Recovery processes to manage failures during data loads. - ?Data Validation: Post-load, ensure that the data in the warehouse is accurate and complete. ### 4. Tools Investigation - **Apache Airflow**: - Airflow is a platform to programmatically author, schedule, and monitor workflows. It is designed under the principle of **"configuration as code"**. - Airflow is an **orchestration tool** making it easy to schedule and monitor data pipelines. - [How to Orchestrate an ETL Data Pipeline with Apache Airflow](https://www.freecodecamp.org/news/orchestrate-an-etl-data-pipeline-with-apache-airflow/) - ~~Apache NiFi~~: - NiFi is really a tool for moving data around, you can do enrichments of individual records but it is typically mentioned to do **ETL** with a small **T**. - NiFi is a **no-code** approach. - ~~Talend~~: - Provides a graphical workspace that allows users to build data integration processes in a **drag-and-drop** manner. - Talend is a **low-code** tool. --- ### Change Data Capture (CDC) A software architecture that allows for **detecting changes/mutations** to a database, **capturing** them and **sending** them to a **downstream** system or processor. **Types:** 1. **Log-Based CDC** :star: - Reads database transaction logs to detect changes (e.g. **Postgres WAL**). - Efficient with minimal performance impact. 2. **Trigger-Based CDC** - Uses database triggers to capture changes in tables. - Straightforward but may impact performance. 3. **Timestamp-Based CDC (Metadata-Based)** - Relies on timestamp columns to identify modified rows. - Simple but requires accurate timestamp fields. 4. **Diff-Based CDC (Snapshot-Based)** - Compares current and previous data snapshots to detect changes. - Effective but resource-intensive for large datasets. **Usage:** - Sync data across different databases. - Stream processing based on data changes (alternative to event-driven approaches). - Invalidate or update cache. ```! When we want to extract data from a source system to a downstream system: Do we want something near real-time? If yes, then we need to capture data using CDC or using event-driven solution: * when using event-driven solution there will be backend changes in order to send data from the backend to a message queue in parallel to the database. * On the other hand, when using CDC, we can avoid writing changes to the backend, as we use Postgres WAL and we push the changes to an append-only event streaming system (e.g. Apache Kafka). If no, we can go for Poll-based batch processing but the batch job puts an extra load on the database and ends up with stale data anyway. If you try to run the job more frequently to compensate, you increase the burden. It’s a catch-22 that can’t be solved by batch ETL. ``` **CDC Open-source Tools:** - PeerDB - Debezium --- ### Draft Plan #1 **Draft plan:** 1. Setup DB read replicas (Meeting with Pgl, Omar & James) 2. Assign a Data Integration cluster or a server in the infra. 3. Initialize an orchestration tool that is going to orchestrate the whole ETL pipeline. 4. Initialize CDC extraction tool, and connect it to our production read replicas DBs. (Enable WAL logging & check disk usage) 5. Integrate CDC tool with our orchestration tool. 6. Write transformation scripts using DBT and Python. - Combine production DBs. - Remove PII. - Data Formatting. 7. Using the orchestration tool used, load the data into the DW. **Tool Plan:** - Orchestration Tool: - Prefect - Apache Airflow - Transformation Tool: - DBT Core - Extracting Tool (CDC): - Debzium - PeerDB - Manually written CDC - Data Warehouse: - Posgtres DB - Snowflake --- ### Draft Plan #1 but more solid **Why would we need a DB read-replica ?** To ensure that our operational transactional database (OLTP) remains unaffected by analytical queries, it's beneficial to distribute the workload based on query type (Transactional vs Analytical). A database read-replica serves as a practical solution for accessing data from the production database without compromising its performance. **How the DB read-replica won't affect the production DB ? (DB read-replica implementation)** - **Using Streaming Replication (Log Shipping):** This is the primary method of replication used in PostgreSQL. In streaming replication, WAL data is sent continuously to the replica servers as soon as it's generated, without waiting for the segment files to be completed. This near real-time replication helps in minimizing the lag between the primary and replica databases. - **Asynchronous vs. Synchronous Replication:** - **Asynchronous**: By default, PostgreSQL uses asynchronous replication. In this setup, the primary does not wait for an acknowledgment from the replica that the WAL data has been received and applied. This method minimizes the impact on the primary database’s performance, as it can continue processing new transactions without delay. - **Synchronous**: Optionally, synchronous replication can be configured for scenarios where data loss cannot be tolerated. Here, the primary will wait for at least one replica to confirm the receipt and application of the WAL data before committing a transaction. This can affect the transaction latency on the primary. - **Impact on the Primary Database** Even with asynchronous replication, there is a minimal impact on the primary database: - **Network and Disk I/O:** Shipping WAL files consumes network bandwidth and disk I/O. However, given the architecture of modern network infrastructures and disk systems, this is generally **minimal** and manageable. Network usage depends on the rate of change in the database (i.e., the size of the WAL data generated). - **CPU Overhead:** The CPU overhead associated with WAL generation is a part of normal database operations and is not significantly increased by the act of replication itself. The additional CPU load related to managing replication connections and coordinating the shipping process is **generally low**. - **Conclusion** The primary database’s performance impact from using a read-replica in PostgreSQL through streaming replication is generally low, particularly with asynchronous replication. This setup ensures that the primary database can continue high-performance operations with minimal delay, making it an efficient strategy for scaling read operations and managing heavy analytical workloads. > If we implement synchronous replication, and overload the capacity of the replica, production will need to wait for the secondary. The documentation notes this consideration. Even if not using sync replication, in some cases resource consumption may affect performance. Say we replicate to a secondary on the same SAN. WAL replication plus query I/O may saturate the shared storage bandwidth. Primary isn't going to perform well if its storage is maxed out. We can prevent this problem with capacity planning, and not sharing storage or compute systems. **Sources:** - [Postgres Replication Documentation](https://www.postgresql.org/docs/current/different-replication-solutions.html) - [Postgres Replication Benchmark](https://www.timescale.com/blog/scalable-postgresql-high-availability-read-scalability-streaming-replication-fb95023e2af/) --- ### PostgreSQL Backup Strategies > There are always two parts to a backup strategy: backup AND recovery > **SQL dump** fill info here **File system level backup** fill info here **Point-in-Time Recovery (PITR):** PostgreSQL Point-in-Time Recovery (PITR) is a technique that enables database administrators (DBAs) to restore a PostgreSQL database to a specific time or state by using archived transaction logs, known as Write-Ahead Logs (WAL files). These logs capture all database changes following a full backup, allowing DBAs to recover data to any moment since the last backup, even if the backup itself is outdated. This capability is crucial for minimizing data loss or corruption. However, PITR can be time-consuming and may lead to significant downtime, especially with large datasets. To mitigate this, setting up a **delayed Standby via streaming replication** is recommended, which can streamline the recovery process in extensive data environments. ![image](https://hackmd.io/_uploads/HJA90uomR.png) 1. **pgBackRest** A sophisticated PostgreSQL backup and restore solution that supports high-performance features like parallel backup and restore, delta restore, and Point-in-Time Recovery. It offers robust capabilities for managing large and complex databases. 2. **WAL-E** A utility for managing PostgreSQL WAL files. It is designed to perform continuous archiving of PostgreSQL WAL files and base backups. WAL-E supports storing backups in various remote storage services like AWS S3, making it an excellent tool for disaster recovery. 3. **pg_basebackup** A built-in PostgreSQL tool that creates a base backup of a PostgreSQL database cluster. While it doesn't directly support PITR, it can be used in conjunction with manual WAL file management or other tools to set up PITR. 4. **pg_dump and pg_restore** While pg_dump and pg_restore are primarily used for logical backups and restores, they are essential components in a broader backup strategy. They can be used to create backups that are not tied to the physical format of the data storage, which can be beneficial for migrating data across systems that might use different storage configurations. 5. **PostgreSQL Continuous Archiving** PostgreSQL's built-in continuous archiving functionality allows WAL files to be archived to a specified location as they are completed. This feature is fundamental to implementing PITR, as it ensures that there is a continuous history of all changes that can be replayed to recover the database to any point in time.