Introduction to Data Engineering WEEK_3

W3 ==== ## Data Platforms, Data Stores, and Security ### 1. Architecting the Data Platform * Layers of a Data Platform Architecture: #### Data Ingestion or or Data Collection Layer * Connect to data sources. * Transfer data from data sources to the data platform in streaming and batch modes. * Maintain information aboput the data collected in the metadata repository. * Tools for Data Ingestion: * GCP Data Flow * AWS Kinesis * Kafka #### Data Storage and Integration Layer * Store data for processing and long-term use. * Transform and merge extracted data, either logically or physically. * Make data available for processing in both streaming and batch modes. * The storage layer needs to be: * Reliable * Scalable * High-Performing * Cost-Efficient #### Data Processing Layer * Read data in batch or streaming modes from storage and apply transformations. * Support popular querying tools and programming languages. * Scale to meet the processing demands of a growing dataset. * Provide a way for analysts and data scientists to work with data in the data platform. * Transformation Tasks: * Structuring * Actions that change the form and schema of the data. * Normalization * Cleaning the database of unused data and reducing redundancy and inconsistency. * Denormalization * Combining data from multiple tables into a single table so that it can be queried more efficiently. * Storage and Processing may not always be performed in separate layers. * Storage and Processing can occur in the same layer. * Data can first be stored in Hadoop File Distribution System, or HDFS, and then processed in a data processing engine like Spark. * Data processing layer can also precede the data storage layer, where transformations are applied before the data is loaded, or stored, in the database. #### Analysis and User Interface Layer * Querying tools and programming languages. * APIs that can be used to run reports on data for both online and offline processing. * APIs that can consume data from the stirage in real-time for use in other applications and services. * Dashboarding and Business Intelligence applications. #### Data Pipeline Layer * Overlaying the Data Ingestion, Data Storage and Integration, and Data Processing layers is the Data Pipelibe layers with the Extract, Transform, and Load tools. * This layer is responsible for implementing and maintaining a continuously flowing data pipeline. ### 2. Factors for Selecting and Designing Data Stores * A data store, or data repository, is a general term used to refer to data that has been collected, organized, and isolated so that it can be used for: * Used for business operations. * Mined for reporting and data analysis. * Considerations for designing a Data Store * A repository can be: * Database * Data Warehouse * Data Mart * Big Data Store * Data lake * Primary considerations for designing a data store: * Type of data * There are multiple types of databases and selecting the right one is a crucial part of designing. * A database is essentially a collection of data designed for: * Input * Storage * Search and Retrieval * Modification * Volume of data * Data lake * Store large volumes of raw data in its native format, straight from its source. * Store both relational and non-relational data at scale without defining the data's structure and schema * Big Data Store * Store data that is high-volume, high-velocity, of diverse types, needs distributed processing for fast analytics. * Big Data Stores split large files across multiple computers allowing parallel access to them. * Computations run in parallel on each node where data i stored. * Intended use of data * 选用或设计数据库需要考虑到的几项要点： * Number of Transactions * Frequency of Updates * Type of Operations * Responese Time * Backup and Recovery * Transactional Systems used for capturing high-volume transactions, need to be designed for high-speed read, write, and update operations. * Analytical Systems need complex queries to be applied to large amounts of historical data aggregated from transactional systems. they nedd faster response times to complex queries. * Schema design, indexing, and partitioning strategies have a big role to play in performance of systems based on how data is getting used. * Scalability * Normalization * Optimal use of storage space * Makes database maintenance easier * Provides faster access to data * Storage considerations * Performance: * Throughput * Rate at which information can be read from and writeen to the storage. * Latency * Time it takes to access a specific location in storage. * Availability * Storage solution must enable you to access your data when you need it, without exception.There should be no downtime. * Integrity * Data must be safe from corruption, loss, and outside attack. * Recoverability * Storage solution should ensure you can recover your data in the envent of failures and natural disasters. * Privacy, Security, and Governance needs * A secure data strategy is a layered approach. * Access Control * Multizone Encryption * Data Management * Monitoring Systems * 数据保护规范 * General Data Protection Regulation (GDPR) * California Consumer Privacy Act (CCPA) * Health Insurance Portability and Accountability Act (HIPAA) * Data needs to be made available through controlled data flow and data management by using multiple data protection techniques. * Strategies for data privacy, security, and governance regulations need to be a of a data store's design from the start. ### 3. Security * Enterprise level data platforms and data repositories need to tackle security at multiple levels: * Physical infrastructure security * Network security * Application security * Data security * The CIA Triad * Key components to creating an effective strategy for information security include: * Confidentiality through controlling unauthorized access * Integrity through validating that your resources are trustworthy and have not been tempered with * Availability by ensuring authorized users have access to resources when they need it. * Infrastructure security * Measures to ensure physical infrastructure security: * Access to the perimeter of the facility based on authentication. * Round the clock surveillance for entry and exit points of the facility. * Multiple power feeds from independent utility providers with dedicated generators and UPS battery backup. * Heating and cooling mechanisms for managing the temperature and humidity levels in the facility. * Factoring in ebvironmental threats before considering the location of the facility * Network security * Network security network security is vital to keep interconnected systems and data safe: * Firewalls to prevent unauthorized access to private networks. * Network access control to ensure endpoint security by allowing only authorized devices to connect to the network. * Network segmentation to create silos or virtual local area networks within a network. * Security protocols to ensure attackers cannot tap into data while it is in transit. * Intrusion detection and intrusion prevention systems that inspect incoming traffic for intrusion attempts and vulnerabilities. * Application security * Application security is critical for keeping customer data private and ensuring applications are fast and responsive. * Threat modeling to identify relative weaknesses and attck patterns related to the application. * Secure design that mitigates risks * Secure coding guides and practices that prevent vulnerabilities. * Security testing to fix problems before the application is deployed and to validate that it is free of known. security issues. * Data Security * Data at rest: * Includes files, objects, and storage. * Stored physically in a database, data warehouse, tapes, offsite backups, and mobile devices. * Can be protected by encryption * Data in transit: * Moving from one location to another over the internet. * Can be protected using encryption methods such as HTTPS, SSL, and TLS. ## Data Collection and Data Wrangling ### 1. How to Gather and Import Data * Gathering data from data sources such as databases, the web, sensor data, data exchanges, and several other sources leveraged for specific data needs. * Importing data into different types of data repositories. ### 2. Data Wrangling * Raw data has to undergo a series of transformations and cleansing activities in order to be analytics-ready. * Data wrangling, or Data munging, is an iterative process that involves： * Data Exploration * Transformation * Validation * Making data available for credible and meaningful analysis. * Data Wrangling * Transformation * Normalizing data includes: * Cleaning unused data * Reducing redundancy * Reducing inconsistency * Denormalizing data includes: * Combining data from multiple tables into a single table for faster querying of data for reports and analysis. * Cleaning Data: * Fixing irregularities in data in order to produce a credible and accurate analysis. * Inspection * Detecting issues and errors. * Validating against rules and constraints. * Profiling data to inspect source data. * Visualizing datat using statistical methods. * Cleaning * The techniques you apply for cleaning your dataset will depend on your use case and the type of issues you encounter. * Duplicate data are data points that are repeated in your dataset. * Need to be removed * Irrelevant data is data that is not contextual to your use case. * Data type conversion is needed to ensure that values in. a field are stored as the data type of that field. * Standardizing data is needed to ensure date-time formats and units of measurement are standard across the dataset. * Syntax errors, such as white spaces, extra spaces, typos, and formats need to be fixed. * Outliers need to be examined for accuracy and inclusion in the dataset. ## Summary and Highlights * Depending on where the data must be sourced from, there are a number of methods and tools available for gathering data. These include query languages for extracting data from databases, APIs, Web Scraping, Data Streams, RSS Feeds, and Data Exchanges. * Once the data you need has been gathered and imported, your next step is to make it analytics-ready. This is where the process of Data Wrangling, or Data Munging, comes in. * Data Wrangling involves a whole range of transformations and cleansing activities performed on the data. Transformation of raw data includes the tasks you undertake to: * Structurally manipulate and combine data using Joins and Unions. * Normalize data, that is, clean the database of unused and redundant data. * Denormalize data, that is, combine data from multiple tables into a single table so that it can be queried faster. * Cleansing activities include: * Profiling data to uncover anomalies and quality issues. * Visualizing data using statistical methods in order to spot outliers. * Fixing issues such as missing values, duplicate data, irrelevant data, inconsistent formats, syntax errors, and outliers. * A variety of software and tools are available for the data wrangling process. Some of the popularly used ones include Excel Power Query, Spreadsheets, OpenRefine, Google DataPrep, Watson Studio Refinery, Trifacta Wrangler, Python, and R, each with their own set of features, strengths, limitations, and applications. ## Querying Data, Performance Tuning, and Troubleshooting ### 1. Querying and Analyzing Data * Counting * count() * Counting the number of rows of data, or records, in the data set. ![](https://i.imgur.com/wRK5UTg.png) * distinct * Displaying the unique car dealers in the data set. ![](https://i.imgur.com/tRzdOWW.png) * Counting the total number of unique, or distinct, car dealers. ![](https://i.imgur.com/LHVyRdF.png) * Aggregation * Aggregation functions help to provide an overview of the data set from different perspectives. * sum() * Calculating the sum of a numeric column. ![](https://i.imgur.com/tK8We4B.png) * avg() * Calculating the average value of a numeric column. ![](https://i.imgur.com/xaeWuq8.png) * stddev() * Calculating the standard deviation to see how spread out the cost of a used car is. ![](https://i.imgur.com/qElHlq5.png) * Extreme Value Identification0 * Identifying extreme values in a data column. * max() * Calculating the maximum value in a column. ![](https://i.imgur.com/qAcu89A.png) * min() * Calculating the minimum value in a column. ![](https://i.imgur.com/X80Hh5c.png) * Slicing Data * Finding customers based on a specific condition or set of conditions. * Slicing the data set to retrieve data for customers who: * Live in a certain area. * Have purchased their car from dealers in a specific area. * Have spent between USD 1000-2000 for their car * Have spent between USD 1000-2000 for their car and live in a specific area. ![](https://i.imgur.com/oaq9BPs.png) * Sorting Data * Sorting data help to arrange data in a meaningful order, making it easier to understand and analyze. * Sorting the data set on date of purchase to see if more cars are purchased on festival days. ![](https://i.imgur.com/bQqnS19.png) * Filtering Patterns * Filtering pattern to perform partial matchs of data values. * Equal To Operator returns records in which a data value matches a certain value. ![](https://i.imgur.com/xUJzaBT.png) * Like Operator helps specify a pattern to return records that match a data value partially. ![](https://i.imgur.com/tFeSxFS.png) * Grouping Data * Grouping data based on a commonality. * Total amount spent by customers, pincode-wise. ![](https://i.imgur.com/p0BAT2K.png) ### 2. Performance Tuning and Troubleshooting * 数据工程师其中一個指责就是监控及优化系统的效率和可用性。 * A data pipeline typically runs with a combination of complex tools and can face several different types of performance threats: * Scalability, in the face of increasing data sets and workloads. * Application failures. * Scheduled jobs not functioning accurately. * Tool incompatibilities. * Data pipelines * Performance Metrics * resource utilization and utilization patterns. * Traffic * Number of user requests received in a given period. * Troubleshooting * Collect information about the incident to ascetain if obseved behavior is an issue. * Check if you're working with all the right versions of software and source codes. * Check the logs and metrics early on in your troubleshooting process to isolate whether an issue is related to infrastructure, data, software, or a combination of these. * Reproduce the issue in a test environment. This can be an iterative and time-consuming process. * Database Optimization for Performance * Performance Metrics for Databases: * System outages * Capacity utilization * Application slowdown * Performance of queries * Conflicting activities and queries being executed simultaneously. * Batch activities causing resource contraints * Capacity Planning * Determining the optimal hardware and software resources required for performance. * Database Indexing * Location data without searching each row in a database resulting in faster querying. * Database Partitioning * Dividing large tables into smaller, individual tables, improving performance and data manageability. * Database Normalization * Reducing inconsistencies arising out of data redundancy and anomalies arising out of update, delete, and insert operations on databases. * Monitoring Systems * Monitoring and alerting systems help us collect quantitative data about our systems and applications in real time. * These systems give us visibility into the performance of our data pipelines, platforms, databases, applications, tools, queries, scheduled jobs, and more. * Database Monitoring Tools * Take frequent snapshots of the performance indicators of a database. * This helps to: * Track when and how a problem started to occur. * Isolate and get to the root of the issue. * Application performance management tools * Measure and monitor the performance of applications and amount of resources utilized by each process. This helps in proactive allocation of resources to improve application performance. * Tools for Monitoring Query Performance * Gather statistics about query throughput, execution, performance, resource utilization and utilization patterns for better planning and allocation of resources. * Job-level runtime monitoring * Break up a job into a series of logical steps which are monitored for completion and time to completion. * Monitoring the amount of data being processed * Through a data pipeline helps to assess if size of the workload could be slowing down the system. * Maintenance Schedules * Preventive maintenance routines generate data that we can use to identify systems and procedures responsible for faults and low availability. * These routines can be: * Time-based * Planned as scheduled activities at pre-fixed time intervals. * Condition-based * Performed when there is a specific issue or when a decrease in performance has been noted or flagged. ## Governance and Compliance ### 1.Governance and Compliance * Data that needs Governance * Personal * Persona; Information(PI) and Sensitive Personal Information(SPI) * Can be traced back to an individual. * Can be used to identify an individual. * Can be used to cause harm to an individual. * Industry-specific regulations: * Health Insurance Protability and Accointability Act (HIPAA) for Healthcare * Payment Card Industry Data Security standard (PCI DSS) for Retail. * Sarbanes Oxley (SOX) for Finance. * Compliance * Compliance covers the processes and procedures through which an organization adheres to regulations and conducts its operations in a legal and ethical manner. * Esrablish controls and checks in drder to comply with regulations. * Matain a verufiable audit trail to establish adherence to regulations. * Compliance is an ongoing process requiring a blend of: * People * Process * Technology * Data Lifecycle * Governance regulations require enterprises to know their purpose and maintain transparency in their actions at each step of the data lifecycle. * In the Data Acquisition Stage, you need to: * Identify data that needs to be collected and the legal basis for procuring the data. * Establish the intended use of data, publish as a privacy policy. * Identify the amount of data you need to meet your defined purposes. * In the Data Processing Stage, you need to: * Flesh out details of how exactly you are going to process personal data. * Establish your legal basis for for the processing of personal data. * In the Data Storage Stage, you need to: * Define where you will store the data. * Esrablish specific measures you will take to prevent internal and external security breaches. * In the Data Sharing Stage, you need to: * Identify third-party vendors in your supply chain that will have access to the collected data. * Establish how you will hold third-party vebdor contractually accountable to regulatons. * In the Data Retention and Disposal Stages, you need to: * Define policies and processes you will follow for the retention and deletion of personal datat after a designated time. * Define how you will ensure deleted data is removed from all locations, including third-party systems. * Technology as an Enabler * Today's tools and technologies provide several controls for ensuring organizations comply to governance regulations. * Authentication and Access Control * Layered authentication processes * Combination of passwords, tokens, and biometrics, to prevent unauthorized acess. * Authentication systems verify that you are wgi you say you are. * Access control systems ensure that authorized users have access to resources, both systems and data, based on their user group and role. * Encryption and Data Masking * Encryption converted to an encoded format that can only be legible once it is decrypted via a secure key. * Encryption of data is available for: * Data at rest * Data in transit * Data Masking provides anonymization of data for downstream processing and pseudonymization of data. * Anonymization, the presentation layer is abstracted without changing the data in the database itself. * Pseudonymization of data replaces personally identifiable information with artificial identifiers so that it cannot be traced back to an individual's identify. * Hosting * On-premise and cloud systems that comply with the requirements and restrictions for international data transfers. * Monitoring and Alerting * Security monitoring proactively monitors, tracks, and reacts to security violations across infrastructure, applications, and platforms. * Monitoring systems provide detailed audit reports that track access and other operations on the data. * Alerting functionalities flag security breaches as they occur so that immediate remedial actions can be triggered. * Alerts are based on the severity and urgency level of the breach. * Data Erasure * A software-based method of permanently clearing data from a system by overwriting. * Data erasure prevents deleted data from being retrieved.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.