DM-Midterm - HackMD

# Data Modeling Midterm ## Data Modeling Introduction ### Data Modeling Definition * 資料建模是創建資料庫時，包含或應該包含之資訊的概念視圖 (conceptual view) 的過程 * 資料建模是對現實的簡化抽象 ### Context of DMs ![截圖 2023-11-03 下午2.12.40.png](https://hackmd.io/_uploads/SyhukzG7T.png) ### Data Modeling Process 1. 辨識實體：每個實體都應該各自具有凝聚力 (cohesive)，並且在邏輯上分開 2. 確認每個實體的關鍵屬性：例如顧客可能有姓名、電話號碼、稱呼等屬性 3. 辨識實體之間的關係：例如某個顧客「住在」某個地址，通常透過 unified modeling language (UML) 表示 4. 將屬性映射到實體：這將確保模型反映企業如何使用資料 5. 根據需要分配鑰匙及決定正規化 (normalization) 程度：正規化可以減少冗餘 (redundancy)但會犧牲系統效能 6. 確認並驗證資料模型：Data Modeling 是一個迭代的過程，需要根據業務需求而重複及改進 ### DMs Challenges 1. 缺乏組織承諾和業務支持 2. 業務用戶缺乏理解 3. 模型複雜卻沒有對架構有妥善規劃 4. 沒有完整理解或定義業務需求 ## ER Diagram 1 ### Relation Model * 用來表示資料與資料之間的關係 ### Relation Model: Data Strucutre * Relation: * 表格 * Tuple: * 表格中橫的資料（這堂課是 column==） * Attribute: * 表格中直的資料（這堂課是 row==），每個 Attribute 會有一個名稱 ![截圖 2023-11-03 下午3.54.59.png](https://hackmd.io/_uploads/H1mDv7G7p.png) * Primary Key: * tuple 獨一無二的 id * Domain: * attribute 的值範圍 * Quantity: * tuple 的值 * Table: * 用來描述關係 * Candidate key * attribute 內每一個 tuple 的值都是獨一無二，該 attribute 即可做 candidate key，例如若上表每個 Name 都不一樣，Name 欄也可做 candidate key，candidate key 應具有唯一性和最小性 * All-key: * 每一個 Attribute 合起來做 primary key * Prime attribute * 被選作 Candidate key 的 attribute，反之為 non-prime attribute ### Level of Database * 資料庫從底層到高層： * Physical / Internal * Conceptaul * External ![image.png](https://hackmd.io/_uploads/BkD2Z4zmp.png) ### Data Independence * DBMS 的一種能力，使「程式開發與維護」和「底層資料庫的修改」相互獨立，有兩種類型： 1. 物理資料獨立性（Physical Data Independence）：當資料庫的物理結構發生變化，應用程式仍然可以不受影響的操作資料庫。例如 * 從傳統硬碟 (HDD) 轉移到固態硬碟 (SSD) * 將資料庫位置從 C 槽移至 D 槽 * 切換到不同的資料結構 2. 邏輯資料獨立性（Logical Data Independence）：當資料庫的結構發生變化，應用程式仍可以透過 API 不受影響的操作資料庫。例如 * 增刪改 Attribute * 將兩筆 record 合成一筆 * 將一筆 record 拆成多筆 * 兩者之間的不同： * Physical Data Independence 相對容易達成，也相對容易回復 (retrieve) * Physical Data Independence 的改動不會影響 Application program level，Logical Data Independence 若新增或刪除 field，需要調整 Application program level * Physical Data Independence 與 internal schema 有關，Logical Data Independence 與 conceptual schema 有關 ### Integrity rule of Relational Model [資料庫的完整性規則](https://www.mysql.tw/2017/04/blog-post.html) 1. 實體完整性規則 (Entity Integrity Rule) 指在單一資料表中，主索引鍵必須要具有【唯一性】並且也不可以為空值 (NULL)。 2. 參考完整性規則 (Referential Integrity Rule) 指在兩個資料表中，次要資料表的外鍵 (FK) 的資料欄位值，一定要存在於主要資料表的主鍵 (PK) 中的資料欄位值。 3. 值域完整性規則 (Domain Integrity Rule) 指在單一資料表中，同一資料行中的資料屬性必須要相同。 ## ER Diagram 2 ### Type of Domain Domain: The value range of attribute * Primitive * 該值無法從其他 Attribute 的值延伸 * 例如：例如顧客的「顧客名稱」 * Derived * 從其他 Attribute 延伸 * 例如：學生的「平均成績」 * Calculated * 該 Attribute 是為了處理業務需求或簡化流程而來 * 例如：顧客的「顧客編號」 ### Identifying Relationship * Identifying Relationship 指的是 Child Entity 必須依附 Parent Entity 存在，通常 Child Entity 的 Foreign Key 是 Parent Entity 的 Primary Key * 例如 Entity PERSON 的 Primary Key 是 person_id，Entity PHONE 的 Foreign Key 是持有人的 person_id，一旦持有人不在了，就沒有屬於該人的電話號碼 * 若兩個實體間沒有依附則為 Non-identifying Relationship ### Degree How many **entities** involved in a relationship. ## Data Warehouse ### 為什麼需要資料倉儲 * Transation processing 和 analysis processing 所需要的資料庫性質不同 * Transation processing 需要高頻率的取得資料並在短時間內進行操作 * Decision Support System 則需要長時間運行，並消耗大量系統資源 ### Problem of Traditional DB * 傳統資料庫是為了業務營運設計的，例如：查詢、統計、報告等簡單數據處理工作，並不適用於數據分析 ![image.png](https://hackmd.io/_uploads/Bk0xcoDXT.png) Data -> Information -> Knowledge ### OLTP is not suitable for DSS OLTP(Online Transaction Processing): 利用網路與資料庫對交易資料即時進行處理 * Problem of Data integration: 將不同來源和格式的資料進行處理後整合 * 資料是動態的 * 沒有歷史資料 * OLTP 只適用於簡單的決策和查詢 ### Extraction Program * 搜尋文件或資料庫時根據特定標準抓取資料，並將資料轉移到資料庫 * 優點： 1. 不會干擾到數據分析的過程 2. 使用者可以拿到整理好的、需要的資料 ![image.png](https://hackmd.io/_uploads/BJc6pswQ6.png) ### Weekness of Traditional Database * 資訊需求：資料取得太慢，企業價值已經損失 * 資訊提供者：辛苦得到的資料被拒絕 ### The needs of DW * 從交易環境中提取資料 * 分析處理需要查詢大量的歷史資料以供複雜的操作 * DW 的價值 * Information Processing: 可供簡單的 query、統計分析和製圖 * Analytical Processing: 提供多維度的資料供分析 * Data Mining: 挖掘資料中隱含的 knowledge ### Four Characteristics of DW 1. 主題導向 (Subject oriented): DW 通常是以主題為中心規劃的，以支援特定的主題或業務需求 2. 集成 (Integrated): * DW 整合來自不同來源的資料，使其成為一個統一的資料儲存和查詢環境 * Integration approach * Unification (統一)：消除不一致性，確保資料格式是一致的 (consistent) * Synthesis (綜合)：從原始資料合成或計算成新資料，供後續分析 3. 不易失性 (Non-volatile): 儲存在 DW 的資料是不可更新的，這有助於保持歷史資料的完整性，因此通常為 read-only ![image.png](https://hackmd.io/_uploads/r1ap1gtma.png) 4. 時間變體 (Time-variant): DW 時刻新增和刪除資料，此外資料是有時間維度的 ### Data in DW * Four level: ![image.png](https://hackmd.io/_uploads/H1D8ugFQT.png) Current Details 是近期的業務資料，通常是 DW 用戶最感興趣的部分而且資料量最大 ### DW Structure ![image.png](https://hackmd.io/_uploads/rJCy5xFXa.png) ### Data Mart * Data Mart 是 DW 的子集，針對特定部門的資料儲存和分析 ![image.png](https://hackmd.io/_uploads/BJezheY7p.png) * Dependent vs. Non-dependent ![image.png](https://hackmd.io/_uploads/Bkv03lYXa.png) * 依賴型 Data Mart 相對合理且可靠，DW 更新時，Data Mart 資料會一起更新 ### Reasons for Building a Data Mart * 建造 DW 通常是件昂貴、耗時且風險較高的 project * 因此有些企業選擇先從小規模、低成本且開發時間短的 Data Mart 建起 ### Data Granularity Granularity 顆粒度 <-> Refinement 細化程度 * 顆粒度愈大，資料量可以理解為愈細小，顆粒愈多，也就是細節愈多 * 顆粒度愈小，因為資料已被處理或整合，例如，已整併為月報，季報，就較難看到細節 * 以上抱持存疑的態度 ![image.png](https://hackmd.io/_uploads/rJNk6EFm6.png) lower level, less detail ### Data Segmentation * 將資料分割成幾個單位獨立運作，提升 data processing 的效率 * 方法： * 依時間分割 * 依 business type 分割 * 依地理位置分割 * 好處：分割成小資料塊，可以更快、更容易存取資料