Introduction to Data Engineering WEEK_2

W2 ==== ## The Data Ecosystem and Languages for Data Professionals ### 1. Overview of the Data Engineering Ecosystem * 身为数据工程师在数据团队 ( Data Team ) 里的日常： * Extracting data from disparate sources. * Architecting and managing data pipelines for transformation, integration, and storage of data. * Architecting and managing data repositories. * Automating and optimizing workflows and flow of data between systems. * Developing applications needed through the data engineering workflow. * Data * 数据类型： * structured * Data that follows a rigid format and can be organized into rows and columns. * Semi-structured * Mix of data that has consistent characteristics and data that doesn’t conform to a rigid structure. * Unstructured * Data that is complex, and mostly qualitative information that is impossible to reduce to rows and columns. * For example, photos, videos, text files, PDFs, and social media content. * 数据会來自不同的数据格式： * Relational Database * Non-Relational Database * APIs * Web Services * Data Streams * Social Platforms * Sensor Devices * Data Repositories * Transactional * OLTP (Online Transaction Processing System) * Designed to store high-volume day-to-day operational data. * Typically relational, they can also be non-relational. * Analytical * OLAP (Online Analytical Processing systems) * Optimized for conducting complex data analytics. * include relational and non-relational databases, data warehouses, data marts, data lakes, and big data stores. * Data Integration * Combine data from disparate sources into a unified view, accessed by users to query and manipulate the data. * a set of tools and processes that cover the entire journey of data from source to destination systems. ### 2. Types of Data * Structured Data * Has a well-defined structure. * Can be stored in well-defined schemas such as databases. * Can be represented in a tabular manner with rows and columns. * Is objective facts and numbers that can be collected, exported, stored, and organized in typical databases. * Sources of structured data: * SQL Databases. * OLTP (Online Transaction Processing Systems) * Spreadsheets * Online forms Sensors * GPS (Global Positioning Systems) * RFID (Radio Frequency Identification) * Network and Web server logs * Structured data is data that is well organized in formats that can be stored in databases and lends itself to standard data analysis methods and tools. * Semi-structured Data * Has some organizational properties but lacks a fixed or rigid schema. * Cannot be stored in the form of rows and columns as in databases. * contains tags and elements, or metadata, which is used to group data and organize it in a hierarchy. * Sources of semi-structured data: * E-mails * XML and other markup languages * Binary executables * TCP/IP packets * Zipped files * XML and JSON allow users to define tags and attributes to store data in a hierarchical form and are used widely to store and exchange semi-structured data. * Semi-structured data is data that is somewhat organized and relies on meta tags for grouping and hierarchy. * Unstructured Data * Does not have an easily identifiable structure. * Cannot be organized in a mainstream relational database in the form of rows and columns. * Does not follow any particular format, sequence, semantics, or rules. * Sources of unstructured data: * Web pages * Social media feeds * Images in varied file formats * Video and Audio files * Documents and PDF files * PowerPoint presentations * Media logs * Surveys * Unstructured data can be stored in files and documents (such as a Word doc)for manual analysis or in NoSQL databases that have their own analysis tools for examining this type of data. * Unstructured data is data that is not conventionally organized in the form of rows and columns in a particular format. ### 3. Understanding Different Types of File Formats * 身为数据工程师多少都需要了解各种数据格式的差別。 * Delimited text files * Files used to store data as text. * Each line, or row, has values separated by a delimiter. * Delimiter is a sequence of one or more characters for specifying the boundary between independent entities or values. * Most common delimiters are the comma, tab, colon, vertical bar, and space. * Comma-separated values (or CSVs) and tab-separated values (or TSVs) are the most commonly used file types in this category. * Microsoft Excel Open XML Spreadsheet, or .XLSX * Microsoft Excel Open XML file format that falls under the spreadsheet file format. It is an XML-based file format created by Microsoft. * Open file format, which means it is generally accessible to most other applications. * It can use and save all functions available in Excel. * Is a secure file formats as it cannot save malicious code. * Extensible Markup Language, or XML * Extensible Markup Language, or XML, is a markup language with set rules for encoding data. * Readable by humans and machines. * Self-descriptive language. * Platform independent. * Programming language independent. * Makes it simpler to share data between systems. * Portable Document Format, or PDF * PDF, is a file format developed by Adobe to present documents independent of application software, hardware, and operating systems. * Can be viewed the same way on any device. * It frequently used in legal and financial documents. * can also be used to fill in data such as forms. * JavaScript Object Notation, or JSON * JavaScript Object Notation, or JSON, is a text-based open standard designed for transmitting structured data over the web. * Language-independent data format. * Can be read in any programming language. * Easy to use. * Compatible with a wide range of browsers. * One of the best tools for sharing data. ### 4. Sources of Data * Relational Databases * Store data in a structured way. * Store structured data that can be leveraged for analysis. * For example, data from a retail transactions system can be used to analyze sales in different regions, and data from a customer relationship management system can be used for making sales projections. * Flat files * Store data in plain text format * Each line, or row, is one record * Each value is separated by a delimiter * All of the data in a flat file maps to a single table. * Spreadsheet files * Special type of flat files * Organize data in tabular format * Can contain multiple worksheets. * .XLS or .XLSX are common spreadsheet formats * Other formats include Google Sheets, Apple Numbers, and LibreOffice Calc. * XMLfiles * contain data values that are identified or marked up using tags. * can support more complex data structures, such as hierarchical. * common uses of XML include data from online surveys, bank statements, and other unstructured data sets. * APIs and Web Services * Listen for incoming requests, which can be in the form of web requests from users or network requests from applications, and return data in plain text, XML, HTML, JSON, or media files. * Web scraping * Extract relevant data from unstructured sources. * Also known as screen scraping, web harvesting, and web data extraction. * Download specific data from web pages based on defined parameters. * Can, among other things, extract text, contact information, images, videos, product items, and much more from a website. * Popular uses: * Providing price comparisons by collecting product details from retailer, manufacturers, and eCommerce websites. * Generating sales leads through public data sources. * Extracting data from post and authors on various forums and communities. * Collecting training and testing datasets for machine learning models. * Popular web scraping tools: * BeautifulSoup * Scrapy * Pandas * Selenium * Data Streams and feeds * Aggregating constant streams of data flowing from sources such as instruments, IoT devices, and applications, GPS data from cars, computer programs, websites, and social media posts. * RSS (or Really Simple Syndication) feeds. Capturing updated data from online forums and news sites where data is refreshed on an ongoing basis. ### 5. Languages for Data Professionals * SQL * Structured Query Language, is a querying language designed for accessing and manipulating information from, mostly, though not exclusively, relational databases. * Using SQL, you can: * Insert, update, and delete records in a database. * Create new databases, tables, and views. * Write stored procedures. * Advabtages of using SQL: * SQL is portable and can be used independent. * Can be used for querying data in a wide variety of databases and data repositories. * Has a simple syntax that is similar to the English language. * Its syntax allows developers to write programs with fewer lines of code using basic keywords. * Can retrieve large amounts of data quickly and efficiently. * Runs on an interpreter system. * Python * Python is a widely-used open-source, general-purpose, high-level programming language. * Its syntax allows programmers to express their concepts in fewer lines of code. * An ideal tool for beginning programmers beacause of its focus on simplicity and readability. * Great for performing high-computational tasks in large volumes of data. * Has in-built functions for frequently used concepts. * Supports multiple programming paradigms-object-oriented, imperative, functional, and procedural. * Unix/Linux Shell * A Unix/Linux Shell is a computer program written for the UNIX shell. It is a series of UNIX commands written in a plain text file to accomplish a specific task. * Typical operations performed by shell scripts include: * File manipulation. * Program execution. * System administration tasks such as disk backups and ealuating system logs. * Installation scripts for complex programs. * Executing routine backups. * Running batches. * PowerShell * PowerShell is a cross-platform automation tool and configuration framework by Microsoft that is optimized for working with structured data formats, such as JSON, CSV, XML, and REST APIs, websites, and office applications. * It consists of a command-line shell and scripting language. * Is object-based, which makes it possible to filter, sort, measure, group, compare, and many more actions on objects as they pass through a data pipeline. * data mining, building GUIs, and creating charts, dashboards, and interactive reports. ## Data Repositories, Data Pipelines, and Data Integration Platforms ### 1. Overview of Data Repositories * Databases * Collection of data, or information, designed for the input, storage, search and retrieval, and modification of data. * DBMS (Database Management System) * Set of programs that creates and maintains the database. * Querying * Allows you to store, modify, and extract information from the database using a function. * Even though a database and DBMS mean different things the terms are often used interchangeably. * Factors governing choice of database include: * Data type * Data structure * Querying mechanisms * Latency requirements * Transaction speeds * Intended use of data * Relational Databases * Data organized into a tabular format with rows and columns. * Well-defined structure and schema. * Optimized for data operations and querying. * Use SQL as the standard querying language. * Non-Relational Databases * Built for speed, flexibility, and scale * Data can be stored in a schema-less form. * Widely used for processing big data. * Data Warehouse * Consolidates it through the extract, transform, and load process, also known as the ETL process, into one comprehensive database for analytics and business intelligence. * Extract data from fifferent data sources * Transform the data into a clean and usable state. * Load the data into data repository. * Big Data Stores * Distributed computational and storage infrastructure to store, scale, and process very large data sets. ### 2. RDBMS * What is a Relational Database ? * A relational database is a collection of data organized into a table structure, where the tables can be linked, or related, based on data common to each. * Relational databases use structured query language, or SQL, for querying data. * Relational databases, by design, are ideal for: * Ideal for the optimized storage, retrieval, and processing of data for large volumes of data. * each table has a unique set of rows and columns. * Relationships can be defined between tables. * Field can be restricted to specific data types and values. * Can retrieve millions of records in seconds using SQL for querying data. * Security architecture of relational databases provides greater acces control and governance. * Advantages of Relational Databases * Create meaningful infromation by joining tables. * Flexibility to make changes while the database is in use. * Minimize data redundancy by allowing relationships to be defined between tables. * offer export and import options that provide ease of backup and disaster recovery. * Are ACID compliant, ensuring accuracy and reliability in database transactions. * Relational Databases are well suited for: * OLTP (Online Transaction Processing) application * Can support transaction-oriented tasks that run at high rates, accommodate large number of users, manage small amounts of data, support frequent queries and fast response times. * Data Warehouses * Can be optimized for online analytical processing (OLAP) * IoT Solutions * Provide the speed and ability to collect and process data from edge devices. * Limitations of RDBMS: * Does not work well with semi-structured and unstructured data. * Migration between two RDBMS's is possible only when the source and destination tables have identical schemas and data types. * Entering a value greater than the defined length of a data field results in loss of information. ### 3. NoSQL * NoSQL(not only SQL) or Non SQL is a non-relational database design that provides flexible schemas for the storage and retrieval of data. * Built for specific data models * Has flexible schemas that allow programmers to create and manage modern applications. * Do not use a traditional row / column / table databases design with fiexd schemas. * Do not, typically, use the structured query language (or SQL) to query data. * Four different types of NoSQL databases: * Key-value store * Both keys and values can be anything from simple integers or strings to complex JSON documents. * Great for storing user session data, user preferences, making real-time recommendations, targeted advertising, and in-memory data caching. * Not a great fit if you want to: * Query data on specific data value. * Need relationships between data values. * Nedd multiple unique keys. * Document Based * Document datatbases store each record and its associated data within a single document. * They enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents. * Preferable for eCommerce platforms, medical records storage, CRM platforms, and analytics platforms. * Not a great fit if you want to: * Run complex search queries. * Perform multi-operation transactions. * Column Based * Data is stored in cells grouped as columns of data instead of rows. * A logical grouping of columns is referred to as a column family. * All cells corresponding to a column are saved as a continuous disk entry, making access and search easier and faster. * Great for systems that require heavy write requests, storing time-series data, weather data, and IoT data. * Not a great fit if you want to: * Run complex queries. * Change querying patterns frequently. * Graph Based * Graph-based databases use a graphical model to represent and store data. * Useful for visualizing, analyzing, amd finding connections between different pieces of data. * An excellent choice for working with connected data. * A great fit if you want to: * Social networks * Product recommendations * Network diagrams * Fraud detection * Access management * Not a great fit if you want to: * Process high volumes of transactions. * Advantages of NoSQL * Its ability to handle large volumes of structured, semi-structured, and unstructured data. * Its ability to run as a distributed systems scaled across multiple data centers. * An efficient and cost-effective scale-out architecture that provides additional capacity and performance with the addition of new nodes. * Simpler design, better control over availability, and improved scalability that makes it agile, flexible, and support quick iterations. * Key differences between Relational databases and Non-Relational databases. ![](https://i.imgur.com/w3bQXGV.png) ### 4. Data Warehouses, Data Marts, and Data Lakes * Data Mining Repositories store data for: * Reporting * Analysis * Deriving insights * Data Warehouses * Storing current and historical data that has been cleansed, conformed, and categorized. * Data gets loaded into the data warehouse, it is already modeled and structured for a specific purpose, meaning it's analysis-ready. ![](https://i.imgur.com/ZlnzwaR.png) * A Data Warehouse has a 3-tier architecture: * The bottom tier of the architecture includes the database servers, which could be relational, non-relational, or both, that extract data from different sources. * The middle tier of the architecture consists of the OLAP Server, a category of software that allows users to process and analyze information coming from multiple database servers. * Topmost tier of the architecture includes the client front-end layer. This tier includes all the tools and applications used for querying, reporting, and analyzing data. ![](https://i.imgur.com/QPoSmT4.png) * Benefits of cloud-based data warehouses: * Lower costs * Limitless storage and compute capabilities * Scale on pay-as-you-go basis * Faster disaster recovery * Data Marts * A data mart is a sub-section of the data warehouse, built specifically for a particular business function, purpose, or community of users. * 3 types of data marts: #### Dependent ![](https://i.imgur.com/GtPV7ne.png) #### Independent ![](https://i.imgur.com/qGkT9wq.png) #### Hybrid ![](https://i.imgur.com/9MXbnKs.png) * The purpose of a Data Mart is to: * Provide users' data that is most relevant to them when they need. * Accelerate business processes. * Improve end-user response time * Provide secure acess and control. * Data Lakes * Store large amounts of structured, semi-structured, and unstructured data in their native format. * Data can be loaded without defining the structure and schema of data. * Exist as a repository of raw data straight from the source, to be transformed based on the use case. * Data is classified, protected, and governed. * A reference architecture that combines multiple technologies. * Can be deployed using: * Cloud Object Storage. * AWS S3 * Large-scale distributed systems. * Hadoop * Relational Database Management Systems * NoSQL data repositories. * Benefit: * Ability to store all types of data. * Agility to scale based on storage capacity. * Saving time in defining structures, schemas, and transformations. * Data is imported in its original format. * Ability to repurpose data in several different ways and wide-ranging use cases. ### 5. ETL, ELT, and Data Pipelines * Extract, Transform, and Load Process is an automated process which includes: * Gathering raw data. * Extracting information needed for reporting and analysis * Cleaning, standardizing, and transforming data into usable format. * Loading data into a data repository. #### ETL process: * Extraction can be through: * Batch processing * Large chunks of data moved from source to destination at scheduled intervals. * Stream processing * Data pulled in real-time from source, transformed in transit, and loaded into data repository. * Transforming Data: * Standardizing date formats and units of measurement. * Removing duplicate data. * Filtering out data that is not required. * Enriching data. * Establishing key relationships across tables. * Applying business rules and data validations. * Loading Data: * Loading is the transportation of processed data in to a data repository.It can be * Initial loading: * Populating all of the data in the repository. * Incremental loading: * Applying updates and modifications periodically. * Full refresh: * erasing a data table and reloading fresh data. * Load Verification includes checks for: * Missing or null values * server performance * Load failures #### ELT process: * Help process large sets of unstructured and non-relational data. * Is ideal for Data Lakes. * Advantages: * Shortens the cycle between extraction and delivery. * Allows you to ingest volumes of raw data as immediately as the data becomes available. * Affords greater flexibility to analysts and data scientists for exploratory data analytics. * Transforms on;y that data which is required for a particular analysis so it can be leveraged for multiple use cases. * Is more suited to work with Big Data. #### Data Pipelines: * Encompasses the entire journey of moving data from one system to another, including the ETL process. * Can be used for both batch and streaming data. * Supports both batch and streaming data. * Support both long-running batch queries and smaller interactive queries. * Typically loads data into a data lake but can also load data into a variety of target destinations. ### 6. Data Integration Platforms * Data integration as a discipline comprising the practices, architectural techniques, and tools that allow organizations to ingest, transform, combine, and provision data across various data types. * Data integration usage scenarios: * Data consistency across applications. * Master data management. * Data sharibng between enterprises. * Data migration and consolidation. * Data Integration includes: * Accessing, queueing, or extracting data from operational systems. * Transforming and merging extracted data either logically or physically. * Data quality and governance. * Delivering data through an integrated approach for analytics purposes. * Capabilities of a modern data integration platform: * Pre-built connectors and adapters. * Open-source architecture. * Optimization for both batch processing of large-scale data and continuous data streams, or both. * Integration with Big data source. * Additional functionalities for data qualioty and governance, compliance, and security. * Portability between op-premise and different types pf cloud environments. ## Big Data Platforms ### 1. Foundations of Big Data * Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools, and machines. It requires new, innovative, and scalable technology to collect, host, and analytically process the vast amount of data gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value. * The V's of Big Data. * Velocity * Data is being generated extremely fast, in a process that never stops. * Volume * The increase in the amount of data stored. * Variety * That data comes from different sources, machines, people, and processes, both internal and external to organizations. * Veracity * the quality and origin of data, and its conformity to facts and accuracy. * Value * Is our ability and need to turn data into value. ### 2. Big Data Processing Tools: Hadoop, HDFS, Hive, and Spark * The Big Data processing technologies provide ways to work with large sets of structured, semi-structured, and unstructured data so that value can be derived from big data. * An open source technologies and the role they play in big data analytics: #### Hadoop: * Distributed storage and processing of large datasets across clusters of computers. * Hadoop provides a reliable, scalable, and cost-effective solution for storing data with no format requirements. * Benefits include: * Better real-time data-driven decision: * Incorporates emerging data formats not traditionally used in data warehouses. * Improved data access and analysis: * Provides real-time, self-service acces to stakeholders. * Data offload and consolidation: * Optimizes and streamlines costs by consolidating data,including cold data, across the organization. #### Hadoop Distributed File System: * Hadoop Distributed File System, or HDFS, which is a storage system for big data that runs on multiple commodity hardware connected through a network. * Provides scalable and reliable big data storage by partitioning files over multiple nodes. * Splits large files across multiple computers, allowing parallel access to them. * Replicates file blocks on different nodes to prevent data loss, making it fault-tolerant. * Benenfits that come from using HDFS include: * Fast recovery from hardware failures, because HDFS is built to detect faults and automatically recover. * Access to streaming data, because HDFS supports high data throughput rates. * Accommodation of large data sets, because HDFS can scale to hundreds of nodes, or computers, in a single cluster. * Portability, because HDFS is portable across multiple hardware platforms and compatible with a variety of underlying operating systems. #### Hive * Hive is an open-source data warehouse software for reading, writing, and managing large data set files that are stored directly in either HDFS or other data storage systems such as Apache HBase. * Is better suited for data warehousing tasks such as ETL, reporting, and data analysis. * Easy access to data via SQL. #### Spark * A general-purpose data processing engine designed to extract and process large volumes of data for a wide range of applications. * Interactive Analytics * Streams Processing * Machine Learning * Data Integration * ETL * Has in-memory processing which significantly increases speed of computations. * Provides interfaces for major programming languages such as Java, Scala, Python, R, and SQL. * Can run using its standalone clustering technology. * Can also run on top of other infrastructures, such as Hadoop. * Can access data in a large variety of data sources, including HDFS and Hive. * Processes streaming data fast. * Performs complex analytics in real-time.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.