Data Structures and Algorithms (DSA)

# Data Structures and Algorithms (DSA) [TOC] 補充文章： - [用JAVA學資料結構與演算法](https://hackmd.io/@Aquamay/H1nxBOLcO/https%3A%2F%2Fhackmd.io%2F%40Aquamay%2Frk1C8ni5d) - [演算法＆資料結構學習筆記](https://medium.com/@amber.fragments/演算法-學習筆記-0-前言-8be08a00600) --- ## Part 1. Algorithm ### 1.1 Good Pseudo Code should be ... - **modularize** - depends on **speaker/listener** - usually **no formal definition** Here is an example : ```c++= // SELECTION-SORT(A) for i=1 to A.length m = GET-MIN-INDEX(A, i, A.length) SWAP(A[i], A[m]) return A // which has been sorted in place // GET-MIN-INDEX(A,i,r) m = i // store current min.index for i=i+1 to r // update if i-th element smaller if A[m] > A[i] m = i return m ``` 在以上的例子中，就不需要將 SWAP 的細節一一寫出來。 --- ### 1.2 Criteria of Algorithm - **input** - **definiteness** clear instructions (人要能看懂) - **effectiveness** feasible instructions (電腦要能做到) - **finiteness** completable instructions (跑得完) - **output** output (**correctness**) : needs proving with respect to requirements --- ### 1.3 如何寫證明 ? - #### Claim - **Correctness of GET-MIN-INDEX** Upon exiting GET-MIN-INDEX(A),      $A[m] = \underset{1 \leq j \leq n}{\min} A[j]$     with n = A.length - #### Proof of Loop Invariant => correctness claim of algorithm - **invariant within GET-MIN-INDEX** Upon finishing the loop with i = k, donate m by $m_k$      $A[m_k] \le A[j] \quad for \quad j = 1,2,...,k$ (loop) invariant: property that algorithm maintains - **Mathematical Induction** - Base when i = 2, invariant true because... - Assume assume invariant true for i = t-1; when i = t, ... - Proof --- ## Part 2. Data Structure Good Algorithm Needs Proper Data Structure ### 2.1 Data Structure Needs Accessing Algorithms 好的 Data Structure 需要拿出來與放進去的方法 - #### GET - GET-BY-INDEX : for arrays - GET-NEXT : for sequential access - GET(key) : for search generally assume to **read without deleting** - #### INSERT - INSERT-BY-INDEX : for arrays - INSERT-AFTER : for sequential access - INSERT(key, data) generally assume to **add without overriding** rule of thumb : often-GET $\iff$ place "nearby" (常用到的東西會放的近一點) --- ### 2.2 Data Structure Needs Maintenance Algorithms - #### CONSTRUCT - usually possible with multiple INSERT - sometimes possible to be faster than so - #### REMOVE - often viewed as deleting after GET - $\sim$ UNINSERT : often harder than INSERT - #### UPDATE - usually possible with REMOVE + INSERT - can be viewed as INSERT with overriding hidden cost of data structure : maintenance effort (esp. REMOVE & UPDATE) --- ### 2.3 INSERT of Ordered Array #### SWAP Version ```c++= // INSERT(A, data) n = A.length A[n+1] = data // put in the back for i=n downto 1 if A[i+1] < A[i] SWAP(A[i], A[i+1]) // cut in consecutive else return ``` #### Direct Cut-in Version ```c++= // INSERT(A, data) key = data i = A.length while i > 0 and A[i] > key A[i+1] = A[i] i = i - 1 A[i+1] = key return ``` INSERT : cut in from back --- ### 2.4 CONSTRUCT of Ordered Array #### SELECTION-SORT ```c++= // SELECTION-SORT(A) for i=1 to A.length GET-MIN-INDEX(A,i,A.length) SWAP(A[i],A[m]) return A // GET-MIN-INDEX(A,l,r) m = l // store current min.index for i=l+1 to r // update if i-th element smaller if A[m] > A[i] m = i return m ``` #### INSERTION-SORT ```c++= // INSERTION-SORT(A) for i=1 to A.length INSERT(A,i) return A // INSERT(A,m) key = A[m] i = m-1 while i > 0 and A[i] > key A[i+1] = A[i] i = i - 1 A[i+1] = key ``` INSERTION-SORT : CONSTRUCT with multiple INSERT --- ### 2.5 REMOVE and UPDATE of Ordered Array #### REMOVE ```c++= // REMOVE(A,m) i = m + 1 while i < A.length A[i-1] = A[i] // fill in i = i + 1 A.length = A.length - 1 ``` #### UPDATE ```c++= // UPDATE(A,m,data) key = data i = m if A[i] > key // insert to front i = i - 1 while i > 0 and A[i] > key A[i+1] = A[i] i = i - 1 A[i+1] = key else // insert to back i = i + 1 while i < A.length and A[i] < key A[i-1] = A[i] i = i + 1 A[i+1] = key ``` ordered array : more maintenance efforts $\implies$ faster GET? --- ### 2.6 Sequential Search Algorithm for Any Array SEQ-SEARCH : structurally similar to GET-MIN-INDEX Review of GET-MIN-INDEX : ```c++= // GET-MIN-INDEX(A,i,r) m = l // store current min.index for i=l+1 to r // update if i-th element smaller if A[m] > A[i] m = i return m ``` ```c++= // SEQ-SEARCH(A,key,l,r) for i = l to r // return when found if A[i] equals key return i return NIL ``` --- ### 2.7 Ordered Array : Sequential Search with Shortcut ordered : possibly easier to declare NIL ```c++= // SEQ-SEARCH-SHORTCUT(A,key,l,r) for i = l to r // return when found if A[i] equals key return i else if A[i] > key return NIL return NIL ``` --- ### 2.8 Ordered Array : Binary Search Algorithm ```c++= // BIN-SEARCH(A,key,l,r) while l <= r m = floor((l+r)/2) if A[m] equals key return m else if A[m] > key r = m - 1 // cut out end else if A[m] < key l = m + 1 // cut out begin return NIL ``` BIN-SEARCH : multiple shortcuts by quickly checking the middle #### Binary Search in Open Source Using java.util.Arrays, ```c++= private static int binarySearch(int[] a, int key) int low = 0; int high = a.length - 1; while (low <= high) { int mid = (low + high) >>> 1; int midVal = a[mid]; if (midVal < key) low = mid + 1; else if (midVal > key) high = mid - 1; else return mid; // key found } return -(low + 1) // key not found ``` --- ### 2.9 Why Data Structures and Algorithms? good program : proper use of resources - #### Space Resources - momory - disk(s) - transmission bandwidth usually cared by data structure - #### Computation Resources - CPU(s) - GPU(s) - computation power usually cared by algorithm - Other Resources - manpower - budget usually cared by management data structures and algorithms : for writing good program --- ### 2.10 Proper Use : Trade-off Different Factors - faster GET $\iff$ slower INSERT and/or maintenance - more space $\iff$ faster computation - harder to implement/debug $\iff$ faster computation good program needs understanding trade-off --- ### 2.11 Summary of Data Structure - definition of data structure organize data with access/maintenance algorithms - ordered array as data structure insert by cut-in, remove by fill-in - GET (search) in ordered array binary search using order for shortcuts - why data structures and algorithms study trade-off to move from coding to programming --- ### 2.12 Appendix :question: 生成測試資料的技巧 1. 思考什麼情形最容易出錯，例如最極端情形/邊緣情形 (boundary case)，需思考到全面狀況 2. 如何生成大規模測試資料 --- ## Part 3. Linked List ### 3.1 Issue of Order Array for Polynomial Computation Esssentially INSERT in ordered array cause many content moving when cutting in. (ordered array : less flexible for insertion & removal) Use Linked List for flexible insertion, no content moving, just changing pointers overhead of next $\implies$ flexible INSERT-AFTER --- ### 3.2 Algorithms (Operations) for Linked List - GET-DATA(clue) $\implies$ GET-DATA(node) return data[clue]         return node.data - GET-NEXT(clue) $\implies$ GET-NEXT(node) return next[clue]         return node.next ```c++= // INSERT-AFTER(node, data) newNode = NODE(data, node.next) node.next = newNode // REMOVE-AFTER(node) oldNode = node.next node.next = oldNode.next return oldNode ``` linked list : flexibility to allocate NODE anywhere --- ### 3.3 Linked List : Ordered v.s. Unordered #### Ordered Linked List - output polynomial orderly - INSERT : (sequential) search before insertion - UPDATE : remove then insert #### Unordered Linked List - general purposes - INSERT : usually just head insertion - UPDATE : simply change the node ordered v.s. unordered : trade-off between (orderly) output & maintenance time --- ### 3.4 REMOVE-HERE for Linked List For example: node@4.next = node@2.next. But here is the problem, do not know where node@4 is, unless sequential search Result : REMOVE-HERE is hard for singly linked list Then there is the solution for REMOVE-HERE : ### 3.5 Doubly Linked List : More Flexible REMOVE-HERE ```c++= // REMOVE-HERE(node) node.prev.next = node.next node.next.prev = node.prev return node ``` overhead of prev in doubly linked list $\implies$ flexible REMOVE-HERE --- ### 3.6 INSERT for Doubly Linked List ```c++= // INSERT-AFTER(node, data) newNode = NODE(data, node.next, node) node.next = newNode newNode.next.prev = newNode // INSERT-BEFORE(node, data) newNode = NODE(data, node, node.prev) node.prev = newNode newNode.prev.next = newNode ``` not just space overhead, but also time overhead for doubly linked list --- ### 3.7 Sparse Vector : Dense Array versus Linked List $[1,0,0,0,0,3,0,4] \in \mathbb{R}^8$ - #### Dense Array Representation | index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | | ----- | --- | --- | --- | --- | --- | --- | --- | --- | | value | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 4 | - #### Linked List Representation store **non-zero** values with ordered linked list of (index, value) $(1,1) \to (6,3) \to (8,4)$ storing only non-zeros : time/space efficient if many zeros --- ### 3.8 Merging Sparse Vectors         $[1,0,0,0,0,3,0,4] + [0,0,0,0,0,6,9,0]$ $(1,1) \to (6,3) \to (8,4) + (6,6) \to (7,9)$ ```c++= // SUM(A,B) L = empty list, ca = A.head, cb=B.head while ca != NIL and cb != NIL if ca.index equals cb.index if (ca.value+cb.value) != 0 L.INSERT-TAIL(NODE(ca.value+cb.value,NIL)) ca = ca.next, cb = cb.next else if ca.index > cb.index L.INSERT-TAIL(NODE(cb.value,NIL)) cb = cb.next else L.INSERT-TAIL(NODE(ca.value,NIL)) ca = ca.next [L.INSERT-TAIL on the non-NIL cursor] return L ``` #### Real-World Use of Sparse Vector : LIBSVM ```c++= // svm.cpp double Kernel::dot(const svm_node *px, const svm_node *py) { double sum = 0; while (px->index != -1 && py->index != -1) { if (px->index == py->index) { sum += px->value * py->value; ++px; ++py; } else { if (px->index > py->index) ++py; else ++px; } } return sum; } ``` --- ### 3.9 Summmary of Linked List - motivation of linked list overhead of next for flexibility - doubly linked list overhead of prev for more flexibility - linked list for sparse vector storing non-zeros for time/space efficiency - linked list in job interviews often used to test basic computational thinking --- ## Part 4. Analysis Tools ### 4.1 Properties of Good Programs #### Good Program - meet requirements, correctness: basic - clear usage document (external), readability (internal), etc. #### Proper Resource Usage - efficient use of computation resources (CPU,GPU,etc.) $\to$ time complexity - efficient use of storage resources (memory,disk,etc.) $\to$ space complexity --- ### 4.2 Space Complexity of GET-MIN-INDEX ```c++= // GET-MIN-INDEX(A,i,r) m = i // store current min.index for i=i+1 to r // update if i-th element smaller if A[m] > A[i] m = i return m ``` - array A : pointer size $d_1$ (not counting the actual input elements) - integer m : size $d_2$ - integer i : size $d_2$ total space $d_1 + 2d_2 (constant)$ within algorithm execution : **not dependent to n = A.length** --- ### 4.3 Space Complexity of GET-MIN-INDEX-WASTE ```c++= // GET-MIN-INDEX-WASTE(A) B = COPY(A,1,A.length) INSERTION-SORT(B) return B[1] // 傳回最小值 ``` - array A : pointer size $d_1$ (not counting the actual input elements) - array B : pointer size $d_1$ and n = A.length intergers $n \cdot d_2$ (跟著資料量成長最多) total space $2d_1 + d_2n$ (linear) : dependent to n --- ### 4.4 Time Complexity of Insertion Sort ```c++= // INSERTION-SORT(A) for m = 2 to A.length key = A[m] // Insert A[m] into the sorted sequence A[1...m-1] i = m - 1 while i > 0 and A[i] > key A[i+1] = A[i] i = i - 1 A[i+1] = key ``` total time $T(n)$ = $\color{blue}{d_1}n + \color{blue}{d_2}(n - 1) + \color{blue}{d_4}(n - 1) + \color{blue}{d_5}\sum_\limits{m=2}^{n}\color{red}{t_m} + \color{blue}{d_6}\sum_\limits{m=2}^{n}(\color{red}{t_m} - 1) + \color{blue}{d_7}\sum_\limits{m=2}^{n}(\color{red}{t_m} - 1) + \color{blue}{d_8}(n - 1)$ cost $d_x$ depends on machine type : total $T(n)$ depends on $n$ and $t_m$ number of **while** checks --- ### 4.5 Time Complexity Analysis in Reality - #### Common Focus worst-case time complexity - meaningful in practice (waiting time) - similar to average-case when near-worst-case often enough - #### Common Language 'rough' time needed w.r.t. n (input size) - care about larger n - leading (bigger) term more important - insensitive to constants --- ### 4.6 'Rough' Notation - care about larger $n$ - leading term more important - insensitive to constants In following case :           $\underbrace{cn^2 + bn + a}_\text{f(n)} = \underbrace{\Theta(n^2)}_\text{g(n)}$ for positive $f(n)$ and $g(n)$ [when n $\ge$ 1] #### Representing 'Rough' by Asymptotic Behavior           $\underbrace{cn^2 + bn + a}_\text{f(n)} = \underbrace{\Theta(n^2)}_\text{g(n)}$ - growth of $bn + a$ slower than $g(n) = n^2$ : removed by dividing $g(n)$ for large $n$ - asymptotically, two functions only differ by $c > 0$           $\lim\limits_{n\to\infty}\frac{f(n)}{g(n)} = c$ In conclusion, for positive $f(n)$ and $g(n)$,   $f(n) = \Theta{g(n)}\ \text{ if } \ \lim_{n\to\infty}\frac{f(n)}{g(n)} = c > 0$ **big-$\Theta$ : roughly the same** - Back to the before, definition meets criteria : - care about larger $n$ : yes $n \to \infty$ - leading term more important : yes, $n + \sqrt{n} + log{n} = \Theta(n)$ - insensitive to constants: yes, for example : $1126n = \Theta(n)$ - meaning : $f(n)$ grows roughly the same as $g(n)$ - "$=\Theta(\cdot)$" actually "$\in$" | | $\sqrt{n}$ | $0.1126n$ | $n$ | $112.6n$ | $n^{1.1}$ | $e^n$ | $n+\sqrt{n}$ | | ------------ | ---------- | --------- | --- | -------- | --------- | ----- | ----------- | | $\Theta(n)?$ | N | Y | Y | Y | N | N | Y | Asymptotic notation is the most used 'language' for time/space complexity --- ### 4.7 Big-O Notation Use binary search as example, which best case $O(1)$ and worst case $\Theta(log n)$ for 'any' binary search task, needed time roughly no more than $log(n)$ $\to$ but usually we don't need ~~best case~~ <img src='https://cstaleem.com/wp-content/uploads/2023/03/Theta-Notation-%CE%98-notation.png'> **Big-O Notation** $f(n)$ grows slower than or similar to $g(n)$ : ("$\le$") One side of $\Theta(\cdot)$ definition :          $f(n) = O(g(n))$ iff exist positive($c_2$,$n_0$) such that $f(n) \le c_2 \cdot g(n)$ for all $n \ge n_0$ Order (由小到大) :          $log_2n < n < nlog_2n < n^2 < n^3 < 2^n < n!$ --- ### 4.8 Three Big Asymptotic Notation - $f(n)$ grows slower than or similar to $g(n)$ : ("$\le$") $f(n) = O(g(n))$ iff exist $c$,$n_0$ such that $f(n) \le c \cdot g(n)$ for all $n \ge n_0$ - $f(n)$ grows faster than or similar to $g(n)$ : ("$\ge$") $f(n) = \Omega(g(n))$ iff exist $c$,$n_0$ such that $f(n) \ge c \cdot g(n)$ for all $n \ge n_0$ - $f(n)$ grows similar to $g(n)$ : ("$\approx$") $f(n) = \Theta(g(n))$ iff $f(n) = O(g(n))$ and $f(n) = \Omega(g(n))$ usually only need Big-O --- ### 4.9 Transitivity of Big-O #### Theorem if $f(n) = O(g(n))$ $[f \le g]$,   $g(n) = O(h(n))$ then $f(n) = O(h(n))$ $[g \le h]$ #### Proof - When $n \ge n_1$,  $f(n) \le c_1g(n)$ - When $n \ge n_2$,  $g(n) \le c_2h(n)$ - So, when $n \ge max(n_1, n_2)$,  $f(n) \le c_1c_2h(n)$ Similar for $\Omega$ and $\Theta$ --- ### 4.10 Closure of Big-O #### Theorem if $f(n) = O(h(n))$,   $g(n) = O(h(n))$ then $f(n) + g(n) = O(h(n))$ #### Proof - When $n \ge n_1$,   $f(n) \le c_1h(n)$ - When $n \ge n_2$,   $g(n) \le c_2h(n)$ - So, when $n \ge max(n_1, n_2)$,   $f(n) + g(n) \le (c_1 + c_2)h(n)$ Similar for $\Omega$ and $\Theta$ --- ### 4.11 Summary of Analysis Tools - motivation quantity time/space complexity to measure efficiency - cases of complexity analysis often focus on worst-case with 'rough' notations - asymptotic notation rough comparison of function for large n - usage of asymptotic notation describe $f(n)$ (time, space) by simpler $g(n)$ - operations on asymptotic complexity analyze pseudo code by pieces - practical complexity how much more resource needed? --- ## Part 5. Stack #### Last-In-First-Out (LIFO) Stack is a restricted data stucture, but important for computer science. All insertions and deletions of entries are made at one end, which called the **top** of the stack. ### 5.1 使用堆疊呼叫函式 - 呼叫函式時 <img src='https://miro.medium.com/v2/resize:fit:1100/1*nMAYYF1N04SjoRfdRmp4NA.gif'> - 執行函式時 <img src='https://miro.medium.com/v2/resize:fit:1100/1*Kk8KFEJ_mJmrv-hQN21gGg.gif'> --- ### 5.2 The Three Stack Operations #### PUSH ```c++= PUSH(S, data) // INSERT usually named PUSH // put data onto top of S ``` #### PEEP ```c++= PEEP(S) // GET usually named PEEP // return top element of S ``` #### POP ```c++= POP(S) // REMOVE usually named POP // remove and return top element of S ``` Sometimes other utility functions like SIZE() or ISEMPTY() --- ### 5.3 Stacks Implemented on Array #### PUSH ```c++= // PUSH(S, data) S.top = S.top + 1 S.arr[S.top] = data ``` #### POP ```c++= // POP(S) S.top = S.top - 1 return S.arr[S.top + 1] ``` Usually consecutive array with S.top at 'tail' of array for O(1) operations. --- ### 5.4 Stack Solution to Parentheses Balancing #### Condition Any ')' should match last unmatched '(' (LIFO) #### Method '(' : PUSH,  ')' : POP #### Parentheses Balancing Algorithm ```c++= S = empty stack for each c in input if c is '(' PUSH(S, c) else // c is ')' if not ISEMPTY(S) POP(S) else return FALSE return ISEMPTY(S) ``` --- ### 5.5 Stack Solution to Infix-Postfix Translation 由於電腦看不懂先乘除後加減，所以需用這個方式表達。補充文章 : [Link](https://www.geeksforgeeks.org/convert-infix-expression-to-postfix-expression/?ref=lbp) #### Postfix postfix evaluation with operand stack S ```c++= S = empty stack for each token in input if token is a number PUSH(S, token) else if token is an operator // 注意 a,b 先後，會影響關係 b = POP(S) a = POP(S) PUSH(S, token(a, b)) return POP(S) ``` #### Infix to Postfix Infix to Postfix with operator stack S2 ```c++= S2 = empty stack for each token in input if token is a number output token else if token is a operator while not ISEMPTY(S2) and PEEP(S2) is higher/same precedence output POP(S2) PUSH(S2, token) ``` Mixing the two Algorithms(or stacks S1, S2) will be a simple calculator --- ### 5.6 Summary of Stack - motivation temporary storage with LIFO - implementation $O(1)$ push/pop from tail of array - application: postfix evaluation stack as temporary storage of partial results - application: expression parsing stack as temporary storage of waiting operands --- ## Part 6. Queue #### First-In-First-Out (FIFO) Queue is a restricted data stucture, but important for computer science. ### 6.1 Packet Queue in Networking <img src='https://i.stack.imgur.com/LUpE6.gif'> Multiple queues with different priority for weighted fair queuing. In real world use, from simple/single queue to complicated/multiple queues. --- ### 6.2 Queue Implemented on Circular Array <img src='https://marxtudor.com/wp-content/uploads/2015/04/circular-queue-explained-1.jpg'> #### ENQUEUE ```c++= // ENQUEUE(Q, data) Q.arr[Q.tail] = data if Q.tail == Q.length Q.tail = 1 ``` #### DEQUEUE ```c++= // DEQUEUE(Q) x = Q.arr[Q.head] if Q.head == Q.length Q.head = 1 else Q.head = Q.head + 1 return x ``` --- ### 6.3 General Maze Algorithm #### Condition - start from some location - storage of future candidate - record of visited #### Code ```c++= // MAZE-OUT(M) Candidate = {} all Visited[i,j] = false Candidate.INSERT(M.begin_i, M.begin_j) // get out of M starting from (M.begin_i, M.begin_j) while Candidate not empty (i, j) = Candidate.REMOVE() if Visited[i,j] is false Visited[i,j] = true for each non-Visited(k,l) neighbor of (i,j) if (k,l) is exit return SUCCEED else Candidate.INSERT(k,l) return NO-WAY-OUT ``` Walk to exit by random remove from Candidate. --- ### 6.4 Maze Algorithm by Stack ```c++= // MAZE-OUT-STACK(M) Candidate = {} all Visited[i,j] = false Candidate.PUSH(M.begin_i, M.begin_j) // get out of M starting from (M.begin_i, M.begin_j) while Candidate not empty (i,j) = Candidate.POP() if Visited[i,j] is false Visited[i,j] = true for each non-Visited(k,l) neighbor of (i,j) if (k,l) is exit return SUCCEED else Candidate.PUSH(k,l) return NO-WAY-OUT ``` LIFO of Stack $\implies$ later neighbor first $\implies$ last path out Stack : last path out by **D**epth-**F**irst **S**earch --- ### 6.5 Maze Algorithm by Queue ```c++= // MAZE-OUT-QUEUE(M) Candidate = {} all Visited[i,j] = false Candidate.ENQUEUE(M.begin_i, M.begin_j) // get out of M starting from (M.begin_i, M.begin_j) while Candidate not empty (i,j) = Candidate.DEQUEUE() if Visited[i,j] is false Visited[i,j] = true for each non-Visited(k,l) neighbor of (i,j) if (k,l) is exit return SUCCEED else Candidate.ENQUEUE(k,l) return NO-WAY-OUT ``` FIFO of Queue $\implies$ nearby neighbors first $\implies$ shortest unit-step path out Queue : shortest unit-step path out by **B**readth-**F**irst **S**earch --- ### 6.6 Deques (雙頭佇列) **Deque = Stack + Queue + push_front** - action : [constant-time] push_back (like push and enqueue), pop_back (like pop), pop_front (like dequeue), push_front - application : job scheduling Can be implemented by circular array or doubly-linked list. --- ### 6.7 Summary of Queue - application : maze solving different data structure causes different algorithm behaviour - stack + queue = deque union of both --- ## Part 7. Tree ### 7.1 Nature of Data Structures | data structure | nature | | ----------------- | ---------------------------- | | array | indexed access | | linked list | dequential access | | stack/queue/deque | restricted (boundary) access | | tree | hierarchical access | --- ### 7.2 Formal Definition of (Rooted) Tree $T = (\color{blue}{root};$   $\color{red}{T_1, T_2, \dots, T_n})$ - recursive definition - T.root for starting tree access (like L.head) - disjoint sub-trees $(T_1, \dots, T_n)$ - recursion termination : $T_{LEAF}$ with no sub-trees - #### Height of Tree height = node max depth + 1 usually want small height to access efficiently from root <img src='https://media.geeksforgeeks.org/wp-content/uploads/20221107231243/heightexample2-660x399.JPG' style='width:400px'> - #### Node of Tree node degree = # sub-trees - internal nodes : degrees > 0 - external nodes : degrees = 0, leaves leaves some times called breadth of tree <img src='https://media.geeksforgeeks.org/wp-content/uploads/20221107231626/depthexample1-660x468.JPG' style='width:400px'> - #### Family Relatives of Tree <img src='https://hyosup0513.github.io/public/images/tree5.PNG' style="width:400px"> --- ### 7.3 Basic Algorithms (Operations) for Tree $T = (\color{blue}{root};$   $\color{red}{T_1, T_2, \dots, T_n})$ - #### GET-DATA - linked list ```c++= // GET-DATA(L.node) return L.node.data ``` - tree ```c++= // GET-DATA(T.node) return T.node.data ``` - #### GET-NEXT - linked list ```c++= // GET-NEXT(L.node) return L.node.next ``` - tree ```c++= // GET-SUBTREE(T.node, index) return T.node.subtree.GET(index) ``` - #### INSERT-AFTER / INSERT-CHILD - linked list ```c++= // INSERT-AFTER(L.node, data) newNode = NODE(data, L.node.next) L.node.next = newNode ``` - tree ```c++= newNode = NODE(data,[]) // no child T.node.subtree.INSERT(newNode) ``` In General, tree is extension of linked list. --- ### 7.4 Representing Trees with Arrays and Linked List #### Representing Trees with Arrays <img src='https://i.stack.imgur.com/7MkVp.png'> Can also have parent link (like doubly linked list) #### Representing Trees with Linked List Called left-child right-sibling (LCRS) <img src='https://examradar.com/wp-content/uploads/2016/10/Figure-5.2.7.-Linked-representation-for-the-binary-tree.png?x59255'> <img src='https://opendsa-server.cs.vt.edu/ODSA/Books/Everything/html/_images/ChildLst.png'> <img src='https://opendsa-server.cs.vt.edu/ODSA/Books/Everything/html/_images/GenLkFx.png'> <hr> ### 7.5 C&RT for Algorithm for Decision Tree ``` function DecisionTree(data) { if cannot branch anymore return best color else 1. learn branching criteria to cut the plane 2. split data to 2 parts 3. build sub-tree from each part 4. return (branching criteria; sub-trees) } ``` C&RT : 'divide-and-conquer' (based on selected components of $CART^{TM}$ of California Statistical Software) --- ### 7.6 Summary of Tree - intuition hierarchical(階層式) accessform root of tree - terminologies family relatives useful for describing node relations - implementations more complicated links than array/linked list - application : decision tree divide-and-conquer model in machine learning --- ## Part 8. Binary Tree - binary tree : rooted tree with node degree $\le$ 2 & left/right difference <img src='https://media.geeksforgeeks.org/wp-content/uploads/20200219144238/General-Tree-vs-Binary-Tree.png'> - extended binary tree : added empty external nodes so all degree = 2 & all original nodes blue <img src='https://easytechnotes.com/wp-content/uploads/2023/06/30_20231.png'> 補充連結 : [Link](https://hackmd.io/@vRN1CwEsTLyHOsG4mC0d4Q/SJJ6ofDn4) --- ### 8.1 Outputting Infix Expression from Expression Tree Expression tree : (binary) expression represented by a (binary) tree. (將運算語意用tree表達) #### want : 3 * 4 - ( 5 + 6 ) * 7 + 8 * 9 ```c++= // OUTPUT-INFIX(T) if IS-LEAF(T) print T.data else if (T.data higher precedence than T.left.data) print '(' OUTPUT-INFEX(T.left) if (T.data higher precedence than T.left.data) print')' print data if (T.data higher precedence than T.right.data) print')' OUTPUT-INFEX(T.right) if (T.data higher precedence thatn T.right.data) print ')' ``` OUTPUT-INFIX is variant of INORDER-TRAVERSAL <hr> ### 8.2 From OUTPUT-INFIX to INORDER-TRAVERSAL 左邊先做，再做中間，再做右邊。 #### OUTPUT-INFIX(T) ```c++= if IS-LEAF(T) print T.data else if ... OUTPUT-INFIX(T.left) if ... print T.data if ... OUTPUT-INFIX(T.right) if... ``` #### INORDER-TRAVERSAL(T) ```c++= if T is NIL else INORDER-TRAVERSAL(T.left) action with T.data INORDER-TRAVERSAL(T.right) ``` <hr> ### 8.3 POSTORDER-TRAVERSAL & PREORDER-TRAVERSAL <img src='https://leetcode.com/articles/Figures/145_transverse.png'> #### POSTORDER-TRAVERSAL(T) ```c++= if T is not NIL POSTORDER-TRAVERSAL(T.left) POSTORDER-TRAVERSAL(T.right) action with T.data ``` #### PREORDER-TRAVERSAL(T) ```c++= if T.is not NIL action with T.data PREORDER-TRAVERSAL(T.left) PREORDER-TRAVERSAL(T.right) ``` #### INORDER-TRAVERSAL(T) ```c+== if T is not NIL INORDER-TRAVERSAL(T.left) action with T.data INORDER-TRAVERSAL(T.right) ``` #### Sorting <img src='https://d18l82el6cdm1i.cloudfront.net/uploads/jxTW76g4MR-traversal.gif'> <hr> ### 8.4 Full Tree and Complete Tree #### Node 與 Height 的關係 :     $h \le n \le 2^h - 1 \iff lg(n+1) \le h \le n$   (備註: h is height, n is nodes) **Note : 樹的演算法的複雜度，經常與高度有關** #### Tree 種類圖 : [Full Tree v.s. Complete Tree 對比解說](http://alrightchiu.github.io/SecondRound/binary-tree-introjian-jie.html) <img src='https://i.stack.imgur.com/n2CFS.png'> <hr> ### 8.5 Summary of Binary Tree - terminologies binary tree = tree with left/right sub-trees - bianry tree traversals templates for (recursive) tree algorithm design - full and complete binary trees can be implemented in array without wasting space <hr> ## Part 9. Priority Queue Priority Queue : 'extension' of queue for more realistic use, Max-Priority-Out. ### 9.1 Priority Queue with Binary Tree - #### previously | | data | | ---- | ----- | | left | right | - #### next | key | data | | --- | ---- | | left |right | key : priority (the larger the better) data : item in todo list (will show key only for simplicity) Goal : get the node with largest key fast (& remove it) <hr> ### 9.2 Max Tree faster access to lartest key if put at T.root max-root (binary) tree : largest key @ root 以上的問題是 hard to remove largest fast if cannot easily locate second largest 解決方法 : 保持第二、三優先度(維持最大的)在頂端，若是root消失，下一層對比找大的補上來。 max (binary) tree : largest key @ root for every subtree <hr> ### 9.3 Simple Removal for Max Tree ```c++= // REMOVEL-LARGEST(max tree T) result = T.data cursor = T.root // 要缺位的 while (cursor not NIL) larger = COMPARE(cursor.left, cursor.right) if (larger not NIL) cursor.key = larger.key cursor.data = larger.data cursor = larger return result ``` REMOVE-LARGEST for height-h max-tree takes O(h) time. <hr> ### 9.4 Max Heap : "Shortest" Max Tree ### 9.5 REMOVE-LARGEST for Max Heap **max heap = max tree + complete binary tree** 由於先維持 Max Tree 再維持 Complete Tree 不容易，所以先維持 Complete Tree。 ```c++= // REMOVE-LARGEST(max heap T) result = T.data remove T.tail & copy it to T.root // tail 是尾巴(編號最大)的那個 cursor = T.root while (cursor != NIL) largest = LARGEST(cursor, cursor.left, cursor.right) if (largest == cursor) break else SWAP(cursor, largest) cursor = largest return result ``` 大約 $O(logn)$ <hr> ### 9.6 INSERT for Max Heap ```c++= // INSERT(max heap T, key, data) insert(key, data) to T.tail cursor = T.tail while (cursor!= NIL) larger = LARGER(cursor, cursor.parent) if (larger == cursor) break else SWAP(cursor, larger) cursor = larger ``` <hr> ## Part 10. Heap #### Heap $\iff$ Partially Ordered Array unordered array $\to$ heap selection sort $\to$ heap sort ### 10.1 Selection Sort v.s. Heap Sort #### Selection Sort ```c++= // SELECTION-SORT(A) for i=1 to A.length m = GET-MIN-INDEX(A, i, A.length) SWAP(A[i], A[m]) return A // which has been sorted in place // GET-MIN-INDEX(A,i,r) m = i // store current min.index for i=i+1 to r // update if i-th element smaller if A[m] > A[i] m = i return m ``` time : $O(n) \cdot O(n) = O(n^2)$ without any preprocessing #### Heap Sort 1. Complete Tree : 所有節點靠左 2. Maximum Heap : 上層堆滿後才可以堆下一層 (heap)，且父節點的 key 比子節點的 key 大 ```c++= // HEAP-SORT(A) for i = A.length downto 1 (key, data) = REMOVE-LARGEST(A, 1, A.length) copy (key, data) to A[i] // original A[i] is already in A[1] return A // which has been sorted in place ``` time : $O(n) \cdot O(log n) = O(nlogn)$ after building heap Heap Sort : faster selection sort algorithm with help of heap data structure，可應用於 Priority Queue，只能用 Contiguous 實作。 <hr> ### 10.2 Huffman Tree <img src='https://www.researchgate.net/publication/323883375/figure/fig3/AS:606308860444673@1521566705849/Huffman-coding-the-final-tree-of-the-code.png'> #### Rule - each leaf node stores a symbol (i.e. 'B', 'R') - path to leaf node $\equiv$ code of symbol - compress symbol string under some assumptions **Important behind lossless compression (zip, inside jpeg)** <hr> ### 10.3 Huffman Tree Construction - **compute frequency of each symbol** 紀錄每個字元出現的頻率(個數) - **remove two smallest-frequency nodes; build sub-tree & insert** - **repeat the step (tile done)** <img src='https://www.researchgate.net/publication/3337841/figure/fig3/AS:394708393185337@1471117222874/Example-of-the-Huffman-tree-and-its-three-possible-encodings-a-Illustration-example.png'> <hr> ### 10.4 DSA Takeaway - binary tree is useful $\to$ huffman tree - heap is useful $\to$ 'remove two smallest-freq' when constructing huffman tree <hr> ### 10.5 Summary of Heap - motivation possibly efficient priority queue by max tree - max heap max + complete binary tree for $O(log n)$ removal / insertion - heap sort max heap ini array to speed up selection sort - application : huffman tree building priority queue to build huffman tree for compression <hr> ## Part 11. Binary Search Tree tree 在 memory 上比較 flexible (相較於array)，比較 binary search tree 以及 binary search。 #### Design wishes - More flexible insertion than array : (binary) tree - Similar algorithmic structure to BIN-SEARCH #### Key idea root $\sim$ middle - Ordered array : [ left, middle ) < middle < (middle, right] - Binary search tree : left sub-tree < root < right sub-tree for every tree, sub-tree, sub-sub-tree, $\dots$ (只要看左邊或右邊就好) - 在插入的順序中，小'於 parent 的節點往左邊放，大於 parent 的節點往右邊放 $\implies$ 所以從整棵樹來看，不是根據 root 大小來判斷節點在樹的左右邊 #### Requirements (of Binary Tree) - The list must be ordered - Rapid random access is required, so cannot use linked list binary search tree : binary search with tree rather than ordered array ### 11.1 Binary Search Tree 將 binary search 的 middle 替換成 T.root (假的中點) #### Binary Search ```c++= // BIN-SEARCH(A,key) left = 1, right = A.length while left <= right mid = floor((left+right)/2) if A[mid] = key return mid else if A[mid] > key right = mid - 1 // cut out end else if A[mid] < key left = mid + 1 // cut out begin return FAIL ``` complexity : $O(log n)$ #### Binary Search Tree ```c++= // BST-SEARCH(T, key) while T != NIL mid = T // root if mid.key = key return mid else if mid.key < key T = T.right else if mid.key > key T = T.left return FAIL ``` complexity : $O(h)$ <hr> ### 11.2 In-Order Traversal on BST ```c++= // INORDER-TRAVERSAL(T) if T is NIL return else INORDER-TRAVERSAL(T.left) action with T.data INORDER-TRAVERSAL(T.right) ``` <hr> ### 11.3 Linked List v.s. Contiguous Array implementation #### 用 Linked List 實作 - 容易增刪 - 容易 Merge Sort - 做二元搜尋 (例如 Binary Search) 時效率差 #### 用 Contiguous Array 實作 - 不容易增刪 - 容易二元搜尋 (例如 Quick Sort、Heap Sort) <hr> ### 11.4 Summary of Binary Search Tree - 容易增刪資料 增加樹節點 - 容易排序 中序尋訪 - 容易搜尋 直接在樹狀結構上搜尋，複雜度 $O(logn)$ <hr> ## Part 12. Sort ### 12.1 Selection Sort : Review and Refinements #### Idea Linearly select the minimum one from "unsorted" part; put the minimum one to the end of the "sorted" part #### Implementations - common implementation : swap minimum with a[i] for putting in i-th iteration - rotate implementation : rotate minimum down to a[i] in i-th iteration - linked-list implementation : insert minimum to the i-the element #### About - space $O(1)$ in-place - time $O(n^2)$ and $\Theta(n^2)$ - rotate/linked-list : stable by selecting minimum with smallest index (stable means same-valued elements keep their index orders) - common implementation : unstable <hr> ### 12.2 Heap Sort : Review and Refinements #### Idea Selection sort with a max-heap in original array rather than unordered pile #### About - space $O(1)$ in-place - time $O(n log n)$ - 三搶一 : $2$ 個比較 - 樹的深度 : $log_2n$ - 執行 $n$ 次 - 複雜度 = $2 * log_2n * n \approx 2nlog_2n$ - not stable - usually preferred over selection (faster) - 效能的變異較小 #### Sorting <img src='https://d18l82el6cdm1i.cloudfront.net/uploads/uv9rgMfetq-heapsort-example.gif'> <hr> ### 12.3 Bubble Sort : Review and Refinements #### Idea Swap disordered neighbors repeatedly #### About - space $O(1)$ in-place - time $O(n^2)$ - 每個做 $(n-1)$ 次比較 - 最少 1 個 cycle (in-place); 最多 n 個 cycle $\implies$ 平均值 = $(n+1)/2$ - 複雜度 : $(n-1) * (n+1)/2 \approx O(n^2)$ - stable - adaptive : can early stop - a deprecated choice except in vary specific applications with a few disordered neighbors or if swapping neighbors is cheap (old tape days) <hr> ### 12.4 Insertion Sort : Review and Refinements #### Idea Insert a card from the unsorted pile to its place in the sorted pile #### Implementations - naive implementation : sequential search sorted pile from the front $O(n)$ time per search, $O(n)$ per insert - backwise implementation : sequential search sorted pile from the back $O(n)$ time per search, $O(n)$ per insert - binary-search implementation : binary search the sorted pile $O(log n)$ time per search, $O(n)$ per insert - linked-list implemenation : same as naive but on linked lists $O(n)$ time per sesarch, $O(1)$ per insert #### About - space $O(1)$ - time $O(n^2)$ - stable - backwise implementation adaptive (adaptive means 不管有無排好，所需花費的力氣都是一樣的) - usually preferred over selection (adaptive) <hr> ### 12.5 Tree Sort : Review and Refinements #### Idea Replace heap with a BST; an in-order traveral outputs the sorted result #### About - space $O(1)$ - time $O(n \cdot h)$, with worst $O(n^2)$ (unbalanced tree), average $O(nlogn)$, careful BST $O(nlogn)$ - unstable - suitable for stream data and incremental sorting <hr> ### 12.6 Quick Sort : Introduction #### Idea 1. 從序列中找一個樞紐元素 (pivot) 2. 將 pivot 與序列排頭置換 (swap) 3. 根據 pivot 將其餘元素分成「小於樞紐值」以及「大於等於樞紐值」 4. 將 pivot 移回「小於」段落的排尾 5. 對此兩段落遞迴呼叫，重複執行 1~4 步驟，直至排序完成 - #### Tree Sort Revisited ```c++= // make a[0] the root of a BTS for i = 1, ..., n-1 do if a[i] < a[0] insert a[i] to the left-subtree of BST else // a[i] > a[0] insert a[i] to the right-subtree of BST end if end for // in-order traversal of left-subtree, then root, then right-subtree ``` - #### Quick Sort ```c++= // name a[0] the pivot(基準點) for i = 1, ..., n-1 do if a[i] < a[0] put a[i] to the left pile of the pivot else // a[i] > a[0] put a[i] to the right pile of the pivot end if end for // output quick-sorted left; output a[0]; output quick-sorted right ``` #### Implementations - naive implementation : pick first element in the pile as pivot - random implementation : pick a random element in the pile as pivot - median-of-3 implementation : pick median(front, middle, back) as pivot Contiguous 或是 Linked List 均可實作 Quick Sort。 <img src='https://wat-images.s3.ap-south-1.amazonaws.com/images/course/ci6ldqnqthum/Quick_Sort_0.png'> --- ### 12.7 Quick Sort : Review and Refinement #### Idea Simulate tree sort without building the tree #### About - space : worst $O(n)$, average $O(log n)$ on stack calls - time : worst $O(n^2)$, average $O(n log n)$ - not stable (效能變異比 Merge Sort 大) - usually best choice for large data (if not requiring stability), can be mixed with other sorts for small data <hr> ### 12.8 Merge Sort : Introduction #### Idea 1. 將序列分割成兩個子序列 (等長或長度差一) 2. 對兩個子序列分別排序 (遞迴呼叫) 3. 將兩個排序完成的子序列合併 #### Implementations - Bottom-up implementation: | data | | | | | | | | sorted | | ---- | --- | --- | --- | --- | --- | --- | --- | --------------- | | 6 | 5 | 4 | 7 | 8 | 3 | 1 | 2 | (size-1 sorted) | | 5 | 6 | 4 | 7 | 3 | 8 | 1 | 2 | (size-2 sorted) | | 4 | 5 | 6 | 7 | 3 | 8 | 1 | 2 | (size-4 sorted) | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | (size-8 sorted) | - $O(log n)$ loops, the i-th loop combines size-$2^i$ arrays $O(\frac{n}{2^i})$ times - Combine size-$l$ array can take $O(l)$ time but need $O(l) space!$ - Thus, bottom-up Merge Sort takes $O(n log n)$ time - Top-down implementation: MergeSort(arr, left, right) = combine(MergeSort(arr, left, mid), MergeSort(arr, mid+1, right)); - divide and conquer, $O(log n)$ level recursive calls - Analysis Diagram <img src='https://www.andrew.cmu.edu/course/15-121/lectures/Sorting%20Algorithms/pix/mergesort_complexity.bmp'> <hr> ### 12.9 Merge Sort : Review and Refinements #### Idea Combine sorted parts repeatedly to get everything sorted #### About - time $O(n log n)$ in both implementations - usually stable (if carefully implemented), parallellize well (很好平行化，可以小堆小堆進行) - popular in external sort - 需要額外的空間儲存較小序列 --- ### 12.10 Time Complexity and Space Complexity of Each Sort <img src='https://edurev.gumlet.io/ApplicationImages/Temp/a5d8f710-434d-4b40-8833-75a5b661f2eb_lg.jpeg?w=576&dpr=1.0'>