Index Structure of Database

# Index Structure of Database Seconday file in file system ## Outline 1. Single level ordered indexes(單層索引) 2. multi-level indexes(多層索引) 3. multiple dimensional indexes(針對多個欄位建立索引) ## Introduction of index stucture 1. Index file is the **secondary file** (relational tables are primary file) 2. 建立索引的目的：加速查詢 * 如同書本的目錄是要幫助讀者找到特定的名詞概念，database建立索引的目的就是幫助user找到特定欄位的資料 * 因為索引檔通常比主檔小，搜尋時間可以比在主檔搜尋快 4. Defined according to indexing fields(根據特定欄位建立index) 5. 以disk block而非record為單位搜尋 ## Single Level ordered indexes ### Two criteria for us to divide ways of single level indexing: 1. Whether the index file is created according to indexing field (Primary file所根據排序的欄位是否是indexing field) <建索引的欄位是不是資料檔排序的欄位：是：左; 不是：右> -> **Non-dense** 2. 建索引的欄位是不是key -> **Dense** | Types of indexing | Index field used for Physical Ordering of the file | Index field not used for Physical Ordering of the file | | -------- | -------- | -------- | | Index field is key | Primary Index | Secondary Index (Key) | | Index field is not key | Clustering Index | Secondary Index (NonKey) | * Note: Primary Index和Clustering index只會有一個會成立，因為這兩個的索引欄位和資料主檔的排序欄位相同，差別在該欄位是不是primary key，而主檔只會根據一個欄位排序，所以兩種index方式只會有一個成立 * 一個檔案可以有多個Secondary index ## Four Ways of Indexing > 重要前提：**資料主檔要是sorted file才能建Primary index or Clustering index**，如果主檔是用Hashing or Heap file，那這兩種index method就不能用。 ### Primary Index: 1. Property: Primary file is sorted accoding to primary key, which is also the indexing column. <***索引欄位是主檔排序欄位，且此欄位是Primary Key***> 2. Since we don't have **repeated value** in the column and the **column is sorted**, we only need to store the first value of each block (**Anchor record**). ![](https://hackmd.io/_uploads/HJlFuz0Uh.png) 3. Operation: 主檔的新增與刪除可能都會影響到Anchor Record -> Lazy Insertion and Lazy deletion > 4. Index file的entry數：主檔的Block數 ### Clustering Index: 1. Property: Primary file is sorted accoding to indexing field, but the indexing field is not the key.<索引欄位是主檔排序欄位，但此欄位不是Primary Key> 2. Indexing structure: The index file stores the **non-repeated value** and the block pointer points to **the block the first value located**.<***索引檔存non-repeated value第一次出現的block位址，因此如果公司有5個部門，index file就會有5個entry***> ![](https://hackmd.io/_uploads/Syrq5z0Ln.png) 3. 為何不能像Primary index存Anchor Record就好? * 因為每個Block的第一筆Record可能是重複的 4. Porblem: 如果此時有大量資料要Insert到主檔(Let's say, 1 insert 到 block 1)，那2的第一個值就會跑到block 2，index file就要跟著更動。如果是有大量資料可以reserve一個block去存該value的資料 > 5. index file的entry數：索引欄位所有可能的值的個數` ### Secondary Index(Key): 1. Property: The **indexing field is Key**, but Primary file is **not sorted by the indexing field**. <***索引欄位和主檔排序欄位不同，但索引欄位是PK***> 2. Store all the key's value in indexing file and the corresponding block pointer points to the block the key located.<記錄所有可能的值> ![](https://hackmd.io/_uploads/BkmIRzRUh.png) 3. Index file的entry數：資料數 ### Secondary Index(NonKey) 1. Property: The indexing field is **not Key**, but Primary file is **not sorted by the indexing field**. 2. Since there are repeated values, Secondary Index(NonKey) uses additional blocks to record block pointers. The composition is as fellows: * **Index file**: record the non-repeated field value and the block poitners point to the block of poitners.<紀錄可能出現的值> * **Blocks of block pointers**: record the locations of the data in the data file.<紀錄特定值所在的Block> ![](https://hackmd.io/_uploads/S1SsAzRLn.png) > 3. Index file的entry數：所有可能資料數 * 該資料的出現次數 --- ## Multi-level indexing > 我們想在索引上建立索引 -> 多層索引 ### ISAM 1. Content: 針對Primary Index 建立 Primary Index 2. 因為有兩層，所有總共兩次Block I/O ![](https://hackmd.io/_uploads/ryGkfXRL2.png) ### B+ Tree 1. Review B+ Tree * B+ Tree is a M-way search tree, which is an extended and balanced binary search tree * Time of B+ tree: * B+ tree worst case層數(N個node，每個node有 $\dfrac{m}{2}$ 個 child): $\log_{\dfrac{M}{2}}{N}$ * Find = O($\log_{2}{\dfrac{M}{2}}$ * $\log_{\dfrac{M}{2}}{N}$) = O($\log_2N$) * <相當於去每一層，都會access一個node/block，針對該block內的資料<$\dfrac{M}{2}$>去做binary search<$\log_{\dfrac{M}{2}}{N}$>> ![](https://hackmd.io/_uploads/BkHMObkvh.png) 2. B+ tree and indexing * Composition: * Non-leaf node:標兵，幫助我們找到索引 * Leaf node：真正的索引，有兩種指標： 1. 指向record所在的block位址 2. 指向下一個索引node，好處是可以做range query e.g. 列出SSN在43~68的員工資料 3. How to choose m in B+ Tree: * m usually = disk size --- ## Why indexing can make query searching more efficient 1. 搜尋範圍縮小：After building index files, DB will search the values in index files and index files normally will be **smaller** than data file. So the searching efficiency will increase. 2. 先整理資料使資料搜尋較快速：The usage of **B+ Tree** and B+ Tree allows us to seach index more efficienly by building tree structure. * 就像整理衣服，如果事先把衣服分成褲子一堆、裙子一堆、襯衫一堆等，在找尋襯衫時就不用去亂亂的衣服堆sequential search --- ## Multiple dimensional indexing 現在我們希望透過2個以上column一起建立index, e.g. 我們想找出年齡介於50-60之間，且薪水介於100-200之間，希望同時針對年齡以及薪水建index，不只能支援兩個column一起的range query，也希望各別column的range query能夠有效率(不要sequential) ### Composite Key 1. Content: 先針對age建index，再針對salary建index，相當於把(age, salary)看成composite key 2. 缺點：如果想針對內層column e.g. salary進行range query，就只能sequential search ![](https://hackmd.io/_uploads/Skc6qz1v3.png) ### Partitioned Hash 1. Content: 如果要用n個bit去做hashing，然後要建index的欄位有2個，就把n個bit切成2份，比如說age用第一位，salary用後兩位。Hash bit的分配要根據column的值域來分。 2. 取前n位去做hash才能做range query ![](https://hackmd.io/_uploads/B18RXXyvn.png) ### Grid files 1. Content: 把search space切割成網格狀，每一格是一個block，每個點對應到一個record。如果overflow(超過blocking factor)就要再往下切 2. 切法會影響performance -> Find best split line ![](https://hackmd.io/_uploads/Hyw4rXJvh.png) ## kd tree 1. Intro: k-dimensional binary search tree * 因為是兩個attribute的binary search tree，attribute會交替出現，先根據salary分，再根據age分，再根據salary分...... * 只有leaf是record index * 如果overflow(> blocking factor)，要再繼續兵分兩路分下去 ![](https://hackmd.io/_uploads/H1q3vm1w3.png) ## Quad(4) tree 1. 相較於Grid file一次切一刀，Quad tree一次切成四格，每一個象限都是一個矩形 ![](https://hackmd.io/_uploads/rytftXJD3.png) ## R tree 1. Intro: 2-dimensional B+ tree，但不同於B+ Tree，切出來的範圍彼此之前會有交集 2. Query * 資料落在哪裡(e.g. age=50 & salary=10000): * Range query：search space和query region如果有交集，就要recursively去找下去 3. 缺點 * Node的涵蓋範圍會有overlap，e.g. 資料落在R10，而R10會被R3和R4包含到，那就要去R3和R4(Sub-tree)都去搜尋，這樣就會**降低搜尋效率** <Multiple path visited from the root> * 因此在新增資料時，如果overlap可以minimize會比較好 ![](https://hackmd.io/_uploads/rJkih4JDh.png) ## Bitmap Index 1. Intro: **針對索引欄位建立bit string**, e.g. gender attribute在0和3是male，因此male的position 0和3就是1，其他position就是0 2. 搜尋：把條件互相做AND 或 OR e.g. 想要找female且L1的員工，就把01101 & 10100 = 00100 3. 適用情況：當欄位的所有可能的值很少，不然會需要建許多bit string ![](https://hackmd.io/_uploads/SJlT041D3.png)