owned this note
owned this note
Published
Linked with GitHub
W1
====
### Overview
>Data Engineering 在 Data Ecosystem 的团队里所扮演的角色及在工作上所使用的技术。
## Modern Data Ecosystem and role of Data Engineering
### 1. Welcome to Introduction to Data Engineering
* 数据的价值主要來自于两个重点:
* 数据的准确性
* 访问数据的效率
* 这个课程中会有以下的课程內容:
* Data
* Data Reositories
* Data Pipelines
* Data Integration Platforms
* Big Data
* The Architecture of Data Platform
* The Design Considerations for Data Stores
* ETL and ELT Process
* Data Security
* Data Privacy
* Governance and Compliance
### 2. Modern Data Ecosystem
* 现代的数据生态系统主要由互联、独立及不断发展的数据发展成整个网络。
* 要是一个企业数据的环境(Enterprise Data Environment)主要会包括以下绩点:
* Data integrated from disaparate sources.
* Diffent types of analysis and skills to generate insights.
* Active stakeholders to collaborate and act on insights generated.
* Tools, applications, and infrastructure to store, process, and disseminate data as required.
* 以下的部分会详细讲解关于 Data Source, Enterprise Data Environment 及 Users 之间的关系和深入探讨各部分的內容:
* Data Sources
* Data 主要可分为 Structed 及 Unstructed datasets.
* 以下的数据格式都可以是 Data Source :
* Images
* Videos
* Clickstreams
* User conversations
* Social media platforms
* Internet of Things (IoT devices)
* Real-time events
* Data sources from data providers and agencies.
* 现今的 Data Source 多样化且动态性.
* 处理来自多样化的data source:
* Enterprise Data Environment
* 原始数据需要被组织、清理及优化提供于后端使用者。
* 原始数据需要符合组织里规定的标准化。
* 此阶段涉及数据管理和提供高可用性、灵活性、可访问性及安全性的数据存储库。
* Users
* 提供接口、API或应用程序给终端使用者访问他们所需要的数据库.
### 3. Key Players in the Data Ecosystem
* 要从数据中取得价值,我们需要不同的角色使用各种技术在其当中各司其职.
* 以下将讨论各个角色的工作內容:
#### Data Engineers
* Data Engineers 主要职务:
* 提取、整合和组织不同来源的数据.
* 清理、转换和准备数据.
* 设计、存储和管理数据存储库中的数据.
* Data Engineers 主要技能:
* 良好的编程能力.
* 健全的系统架构知识.
* 对relational data和non-relational data存储有深入的了解.
#### Data Analysts
* Data Analysts 主要职务:
* 检查和清理数据.
* 应用统计方法來分析和挖掘数据.
* 数据可视化以及呈现数据分析的结果.
* Data Analysts 主要技能:
* 具备以 Spreadsheets、 Databases queries 和统计工具來建构图表以及dashboard.
* 具备一些编程能力.
* 良好的分析和故事敘述能力.
#### Data Scientists
* Data Scientists 主要職務:
* 透过分析历史数据建立有预测未來的机器学习模型.
* Data Scientists 主要能力:
* 良好的数学及统计能力.
* 具备编程能力.
* 具备访问数据库的能力.
* 良好建立数据模型.
### 4. What is Data Engineering ?
* 数据工程师主要会涉及到的工作內容:
* 收集数据 -> 数据处理 -> 存储数据 -> 数据提供
#### Collecting source data:
* Extracting, integrating and organizing data from disparate source.
* Data acquisition from multiple source.
* Data architecture for storing source data.
#### Processing data
* Cleaning, transforming and preparing data to make it usable.
* Distributed systems for processsing data.
* Pipelines for extracting, transforming and loading data.
* Solutions for safeguarding quality, privacy and security of data.
* Performance optimization.
* Adherence to compliance guidelines.
#### Storing data
* Storing data for reliability and easy availability of data.
* Data stores for storage of procrsses data.
* Scalable systems.
* Ensuring data privacy, security, compliance, monitoring, backup and recovery.
#### Making data available to users securely
* Making data available to users securely.
* APIs, services, and programs for retrieving data for end-users.
* User acces through interfaces and dashboards.
* Checks and balances to ensure data security.
## Responsibilities and Skillsets of a Data Engineer
### 1. Responsibilities and Skillsets of a Data Engineer
* 在广泛的层面上,一位数据工程师的工作有:
* Extract, organize, and integrate data from disparate sources.
* Prepare data for analysis and reporting by transforming and cleansing it.
* Design and manage data pipelines that encompass the journey of data from source to sedtination systems.
* Setup and manage the infrastructure required for the ingestion, processing, and storage of data:
* Data Platforms
* Data Stores
* Distributed Systems
* Data Repositories
* 技术能力( Technical Skills ):
#### Operating Systems:
* UNIX
* LINUX
* System Utilities and Commands
#### Infrastructure Components:
* Vitual Machines
* Networking
* Application Services
* Cloud-based Services
#### Databases and Data Warehouses:
* RDBMS
* MySQL
* PostgreSQL
* NoSQL
* Redis
* MongoDB
* Cassandra
* Data Warehouses
* GCP BigQuery
* AWS RedShift
#### Data Pipelines:
* Apache Beam
* AirFlow
* DataFlow
#### ETL Tools:
* AWS Glue
* GCP Data Fusion
#### Languages:
* Query languages:
* SQL
* NoSQL
* Programming languages:
* Python
* R
* Java
* Shell and Scripting languages:
* Unix/Linux Shell
#### Big Data Processing Tools
* Hadoop
* Hive
* Spark
* 综合能力 ( Functional Skills ):
* Convert business requirements into technical specifications.
* Work with the complete software development lifecycle:
* Ideation -> Architecture -> Design -> Prototyping -> Testing -> Deployment
* Understand data's potential application in business.
* Understand risks of poor data management.
* 软技能 ( Soft Skills ):
* Interpersonal Skills
* Teamwork
* Collaboration
* Effective
* Communication
* Data engineering requires a broad set of skillsets.
* We need to select one or more specialization areas, but have a good understanding of all areas, so you can make more informed decisions.
* Your skills will grow over time with experience, the areas you choose to focus on, and the time you invest in upskilling yourself.
<br>