# Index Structure of Database
Seconday file in file system
## Outline
1. Single level ordered indexes(單層索引)
2. multi-level indexes(多層索引)
3. multiple dimensional indexes(針對多個欄位建立索引)
## Introduction of index stucture
1. Index file is the **secondary file** (relational tables are primary file)
2. 建立索引的目的:加速查詢
* 如同書本的目錄是要幫助讀者找到特定的名詞概念,database建立索引的目的就是幫助user找到特定欄位的資料
* 因為索引檔通常比主檔小,搜尋時間可以比在主檔搜尋 快
4. Defined according to indexing fields(根據特定欄位建立index)
5. 以disk block而非record為單位搜尋
## Single Level ordered indexes
### Two criteria for us to divide ways of single level indexing:
1. Whether the index file is created according to indexing field (Primary file所根據排序的欄位是否是indexing field) <建索引的欄位是不是資料檔排序的欄位:是:左; 不是:右> -> **Non-dense**
2. 建索引的欄位是不是key -> **Dense**
| Types of indexing | Index field used for Physical Ordering of the file | Index field not used for Physical Ordering of the file |
| -------- | -------- | -------- |
| Index field is key | Primary Index | Secondary Index (Key) |
| Index field is not key | Clustering Index | Secondary Index (NonKey) |
* Note: Primary Index和Clustering index只會有一個會成立,因為這兩個的索引欄位和資料主檔的排序欄位相同,差別在該欄位是不是primary key,而主檔只會根據一個欄位排序,所以兩種index方式只會有一個成立
* 一個檔案可以有多個Secondary index
## Four Ways of Indexing
> 重要前提:**資料主檔要是sorted file才能建Primary index or Clustering index**,如果主檔是用Hashing or Heap file,那這兩種index method就不能用。
### Primary Index:
1. Property: Primary file is sorted accoding to primary key, which is also the indexing column. <***索引欄位是主檔排序欄位,且此欄位是Primary Key***>
2. Since we don't have **repeated value** in the column and the **column is sorted**, we only need to store the first value of each block (**Anchor record**).

3. Operation: 主檔的新增與刪除可能都會影響到Anchor Record -> Lazy Insertion and Lazy deletion
> 4. Index file的entry數:主檔的Block數
### Clustering Index:
1. Property: Primary file is sorted accoding to indexing field, but the indexing field is not the key.<索引欄位是主檔排序欄位,但此欄位不是Primary Key>
2. Indexing structure: The index file stores the **non-repeated value** and the block pointer points to **the block the first value located**.<***索引檔存non-repeated value第一次出現的block位址,因此如果公司有5個部門,index file就會有5個entry***>

3. 為何不能像Primary index存Anchor Record就好?
* 因為每個Block的第一筆Record可能是重複的
4. Porblem: 如果此時有大量資料要Insert到主檔(Let's say, 1 insert 到 block 1),那2的第一個值就會跑到block 2,index file就要跟著更動。如果是有大量資料可以reserve一個block去存該value的資料
> 5. index file的entry數:索引欄位所有可能的值的個數`
### Secondary Index(Key):
1. Property: The **indexing field is Key**, but Primary file is **not sorted by the indexing field**. <***索引欄位和主檔排序欄位不同,但索引欄位是PK***>
2. Store all the key's value in indexing file and the corresponding block pointer points to the block the key located.<記錄所有可能的值>

3. Index file的entry數:資料數
### Secondary Index(NonKey)
1. Property: The indexing field is **not Key**, but Primary file is **not sorted by the indexing field**.
2. Since there are repeated values, Secondary Index(NonKey) uses additional blocks to record block pointers. The composition is as fellows:
* **Index file**: record the non-repeated field value and the block poitners point to the block of poitners.<紀錄可能出現的值>
* **Blocks of block pointers**: record the locations of the data in the data file.<紀錄特定值所在的Block>

> 3. Index file的entry數:所有可能資料數 * 該資料的出現次數
---
## Multi-level indexing
> 我們想在索引上建立索引 -> 多層索引
### ISAM
1. Content: 針對Primary Index 建立 Primary Index
2. 因為有兩層,所有總共兩次Block I/O

### B+ Tree
1. Review B+ Tree
* B+ Tree is a M-way search tree, which is an extended and balanced binary search tree
* Time of B+ tree:
* B+ tree worst case層數(N個node,每個node有 $\dfrac{m}{2}$ 個 child): $\log_{\dfrac{M}{2}}{N}$
* Find = O($\log_{2}{\dfrac{M}{2}}$ * $\log_{\dfrac{M}{2}}{N}$) = O($\log_2N$)
* <相當於去每一層,都會access一個node/block,針對該block內的資料<$\dfrac{M}{2}$>去做binary search<$\log_{\dfrac{M}{2}}{N}$>>

2. B+ tree and indexing
* Composition:
* Non-leaf node:標兵,幫助我們找到索引
* Leaf node:真正的索引,有兩種指標:
1. 指向record所在的block位址
2. 指向下一個索引node,好處是可以做range query e.g. 列出SSN在43~68的員工資料
3. How to choose m in B+ Tree:
* m usually = disk size
---
## Why indexing can make query searching more efficient
1. 搜尋範圍縮小:After building index files, DB will search the values in index files and index files normally will be **smaller** than data file. So the searching efficiency will increase.
2. 先整理資料使資料搜尋較快速:The usage of **B+ Tree** and B+ Tree allows us to seach index more efficienly by building tree structure.
* 就像整理衣服,如果事先把衣服分成褲子一堆、裙子一堆、襯衫一堆等,在找尋襯衫時就不用去亂亂的衣服堆sequential search
---
## Multiple dimensional indexing
現在我們希望透過2個以上column一起建立index, e.g. 我們想找出年齡介於50-60之間,且薪水介於100-200之間,希望同時針對年齡以及薪水建index,不只能支援兩個column一起的range query,也希望各別column的range query能夠有效率(不要sequential)
### Composite Key
1. Content: 先針對age建index,再針對salary建index,相當於把(age, salary)看成composite key
2. 缺點:如果想針對內層column e.g. salary進行range query,就只能sequential search

### Partitioned Hash
1. Content: 如果要用n個bit去做hashing,然後要建index的欄位有2個,就把n個bit切成2份,比如說age用第一位,salary用後兩位。Hash bit的分配要根據column的值域來分。
2. 取前n位去做hash才能做range query

### Grid files
1. Content: 把search space切割成網格狀,每一格是一個block,每個點對應到一個record。如果overflow(超過blocking factor)就要再往下切
2. 切法會影響performance -> Find best split line

## kd tree
1. Intro: k-dimensional binary search tree
* 因為是兩個attribute的binary search tree,attribute會交替出現,先根據salary分,再根據age分,再根據salary分......
* 只有leaf是record index
* 如果overflow(> blocking factor),要再繼續兵分兩路分下去

## Quad(4) tree
1. 相較於Grid file一次切一刀,Quad tree一次切成四格,每一個象限都是一個矩形

## R tree
1. Intro: 2-dimensional B+ tree,但不同於B+ Tree,切出來的範圍彼此之前會有交集
2. Query
* 資料落在哪裡(e.g. age=50 & salary=10000):
* Range query:search space和query region如果有交集,就要recursively去找下去
3. 缺點
* Node的涵蓋範圍會有overlap,e.g. 資料落在R10,而R10會被R3和R4包含到,那就要去R3和R4(Sub-tree)都去搜尋,這樣就會**降低搜尋效率** <Multiple path visited from the root>
* 因此在新增資料時,如果overlap可以minimize會比較好

## Bitmap Index
1. Intro: **針對索引欄位建立bit string**, e.g. gender attribute在0和3是male,因此male的position 0和3就是1,其他position就是0
2. 搜尋:把條件互相做AND 或 OR e.g. 想要找female且L1的員工,就把01101 & 10100 = 00100
3. 適用情況:當欄位的所有可能的值很少,不然會需要建許多bit string
