--- title: CS2040C Notes tags: CS2040C breaks: false --- # CS2040C Data Structures and Algorithms Notes. Some algorithms are in psedocode/Java, some are C++ code. Useful Website: - visualgo.net - open.kattis.com ## Introduction :::spoiler Table of Content (Link) <!-- TOC start --> - [Searching & Sorting](#Searching-amp-Sorting) * [Binary Search](#Binary-search) * [Merge Sort](#Merge-Sort) * [Quick Sort](#Quick-Sort) - [Linked List](#Linked-List) - [Binary Search Tree (BST)](#Binary-Search-Tree-BST) - [Hash Table](#Hash-Table) * [Hash Collisions](#Hash-Collisions) + [Hashing with Chaining](#Hashing-With-Chaining) + [Open Addressing](#Open-Addressing) * [Hashing Techniques](#Hashing-Techniques) + [Division Method](#Division-Method) + [Multiplication Method](#Multiplication-Method) * [Hash Table Size](#Hash-Table-Size) * [Extra Questions](#Extra-Questions) - [Priority Queue](#Priority-Queue) * [Priority Queue Operations](#Priority-Queue-Operations) * [Store Heap in Array](#Store-Heap-in-Array) * [Heap Sort](#Heap-Sort) * [Extra Questions](#Extra-Questions1) - [Disjoint Set](#Disjoint-Set) * [Quick Find](#Quick-Find) * [Quick Union](#Quick-Union) * [Weighted Union](#Weighted-Union) * [Weighted Union with Path Compression](#Weighted-Union-with-Path-Compression) * [Union-Find Algorithm Summary](#Union-Find-Algorithm-Summary) * [Extra Questions](#Extra-Questions2) - [Graph](#Graph) * [Graph Representations](#Graph-Representations) + [Adjacency List](#Adjacency-List) + [Adjacency Matrix](#Adjacency-Matrix) + [Comparison](#Comparison) * [Travesal Algorithms](#Travesal-Algorithms) + [Depth-First Search (DFS)](#Depth-First-Search-DFS) + [Breadth-First Search (BFS)](#Breadth-First-Search-BFS) + [Analysis](#Analysis) * [Shortest Path Algorithms](#Shortest-Path-Algorithms) + [Dijkstra's algorithm](#Dijkstras-algorithm) + [Bellman-Ford algorithm](#Bellman-Ford-algorithm) + [Analysis](#Analysis1) * [Minimum Spanning Tree Algorithms](#Minimum-Spanning-Tree-Algorithms) + [Prim's algorithm](#Prims-algorithm) + [Kruskal's algorithm](#Kruskals-algorithm) + [Boruvka’s algorithm](#Boruvkas-algorithm) + [Analysis](#Analysis2) * [Topological Sort](#Topological-Sort) - [Mix and Match](#Mix-and-Match) * [SkipList](#SkipList) - [Binary Space Partitioning](#Binary-Space-Partitioning) - [Convex Hull](#Convex-Hull) * [Convex Hull Algorithms](#Convex-Hull-Algorithms) + [Gift Wrapping algorithm](#Gift-Wrapping-algorithm) + [Graham's scan](#Grahams-scan) + [Divide and Conquer](#Divide-and-Conquer) + [Incremental method](#Incremental-method) + [Quickhull](#Quickhull) + [Analysis](#Analysis3) <!-- TOC end --> ::: :::spoiler Table of Content (Tree) <!-- https://tree.nathanfriend.io/ --> ``` CS2040C Data Structures and Algorithms ├── Searching/Sorting │ └── Binary Search, Merge Sort, Quick Sort, Heap Sort* ├── Linked List ├── BST ├── Hash Table │ ├── Hash Collision │ │ ├── Chaining │ │ └── Open Addressing │ │ ├── Linear Probing │ │ ├── Quadratic Probing │ │ └── second hash function (cuckoo hashing) │ ├── Hashing Techniques │ │ ├── Division Method │ │ └── Multiplication Method │ └── Hash Table Size ├── Priority Queue │ ├── Priority Queue Operations │ ├── Store Heap in Array │ └── Heap Sort ├── Disjoint Set │ ├── Quick Find │ ├── Quick Union │ ├── Weighted Union │ ├── Weighted Union with Path Compression │ └── Union Find Algorithm Summary ├── Graph │ ├── Graph Representations │ ├── Travesal Algorithms │ │ ├── DFS │ │ ├── BFS │ │ └── Analysis │ ├── Shortest Path Algorithms │ │ ├── Dijkstra's algorithm │ │ ├── Bellman-Ford algorithm │ │ └── Analysis │ ├── Minimum Spanning Tree Algorithms │ │ ├── Prim's algorithm │ │ ├── Kruskal's algorithm │ │ ├── Boruvka’s algorithm │ │ └── Analysis │ └── Topological Sort └── Binary Space Partitioning ``` ::: :::info Data structures serve as the basis for abstract data types (ADT). ADT is in the logical level and data structure is in the implementation level. |ADT|Data Structure| |---|---| |Stack, Queue|Linked List, Array| |Tree|Linked List, Array (too slow for AVL)| |Binary Search Tree, Balanced Tree|Tree| |Priority Queue|Binary Heap (using array)| |Disjoint Set|| |Graph|Graph*| |-|Hash Table| > Put simply, whether one thing is a data structure or not really depends on how we look at it. [stackoverflow](https://stackoverflow.com/questions/28605930/is-tree-a-data-structure-or-abstract-data-type) * An ADT can be both ADT and data structure? * TBH I am still not sure which one is which one ::: ## Searching & Sorting #### Binary Search ```cpp int arr[100] = {...}; int binarySearch(const int arr[], int size, int target) { int start = 0; int end = size; int mid = 0; while (start < end) { mid = (start + end) / 2; if (arr[mid] == target) { return mid; } if (arr[mid] > target) { end = mid - 1; } else { start = mid + 1; } } return -1; } ``` #### Merge Sort ```cpp // mergeSort(arr, 0, 100) <- not 99 ! int arr[100] = {...}; void merge(int arr[], int start, int mid, int end) { int leftSize = mid - start + 1; int rightSize = end - mid; int currIndexLeft = 0; int currIndexRight = 0; int currIndexMerged = start; // temporary array int* leftArray = new int[leftSize]; int* rightArray = new int[rightSize]; for (int i = 0; i < leftSize; i += 1) { leftArray[i] = arr[start + i]; } for (int i = 0; i < rightSize; i += 1) { rightArray[i] = arr[mid + 1 + i]; } // merge while (currIndexLeft < leftSize && currIndexRight < rightSize) { if (leftArray[currIndexLeft] <= rightArray[currIndexRight]) { arr[currIndexMerged] = leftArray[currIndexLeft]; currIndexLeft += 1; } else { arr[currIndexMerged] = rightArray[currIndexRight]; currIndexRight += 1; } currIndexMerged += 1; } // remaining left while (currIndexLeft < leftSize) { arr[currIndexMerged] = leftArray[currIndexLeft]; currIndexLeft += 1; currIndexMerged += 1; } // remaining right while (currIndexRight < rightSize) { arr[currIndexMerged] = rightArray[currIndexRight]; currIndexRight += 1; currIndexMerged += 1; } delete[] leftArray; delete[] rightArray; } void mergeSort(int arr[], int start, int end) { if (start < end) { int mid = start + (end - start) / 2; // prevent overflow for big number mergeSort(arr, start, mid); mergeSort(arr, mid + 1, end); merge(arr, start, mid, end); } ``` #### Quick Sort ```cpp // quickSort(arr, 0, 100) <- not 99 ! int arr[100] = {...}; int partition(int arr[], int start, int end) { // A = [1...n], A[end - 1] = n int pivotIndex = start; int pivot = arr[start]; // select first element as pivot //int pivot = arr[end-1]; // select last element as pivot start += 1; // since A[start] is pivot while (start < end) { while ((arr[start] <= pivot) && (start < end)) { start++; } while ((arr[end - 1] > pivot) && (start < end)) { end--; } if (start < end) { // swap A[start], A[end-1]; int temp = arr[start]; arr[start] = arr[end - 1]; arr[end - 1] = temp; } } // swap A[pivotIndex], A[start - 1]; arr[pivotIndex] = arr[start - 1]; arr[start - 1] = pivot; return (start - 1); } void quickSort(int arr[], int start, int end) { // A = [1...n], A[end - 1] = n if (start < end) { int p = partition(arr, start, end); // p is "sorted" quickSort(arr, start, p); quickSort(arr, p + 1, end); } } ``` ## Linked List :::warning TODO ::: **Linked List Tricks**: Middle of Linked List: ```java public Node middleNode(Node head) { Node fast = head; Node slow = head; // Here is one tricky part, while giving condition of loop // For list with even length, fast pointer is going to reach end i.e. null // For off length, fast pointer is going to be at the tail // so we have to terminate the loop // if fast becomes null && fast.next becomes null while( fast!=null && fast.next!=null){ fast = fast.next.next; // moving fast by 2 step slow = slow.next; } return slow; } ``` Nth node from the end of the Linked List: ```java public getNodeFromEnd(Node head, int n){ Node fast = head; Node slow = head; //moving fast ahead by n, to make gap for(int i=0;i<n;i++){ fast = fast.next; } // when the fast reaches end, slow will be the nth node from end while(fast.next!=null){ fast = fast.next; slow = slow.next; } return slow; } // NOTE: in case n is more than length of list, handle that case according to // problem given ``` Detect Cycle ```cpp // code from my leetcode solution /** * Definition for singly-linked list. * struct ListNode { * int val; * ListNode *next; * ListNode(int x) : val(x), next(NULL) {} * }; */ class Solution { public: bool hasCycle(ListNode *head) { // Not so good solution but worth trying: hash table / store node visited if (head) { ListNode* slow = head; ListNode* fast = head->next; while (fast) { if (fast == slow) { return true; } fast = fast->next; if (fast) { fast = fast->next; } slow = slow->next; } } return false; } }; ``` ## Binary Search Tree (BST) Tree: [*graph*](## "a mathematical structures used to model pairwise relations between objects") that is [*circuit*](## "a circuit is a path that begins and ends at the same vertex") free and [*connected*](## "a graph is a connected graph iff for each pair of vertices, there exists a path which joins them") Binary Tree: a [*rooted tree*](## "a tree in which there is one vertex that is distinguished from the others and is called the root") in which every parent has at most 2 children, which is referred to as the left child and the right child Binary Search Tree (BST): a rooted binary tree with the key of each internal node being greater than all the keys in the respective node's left *subtree* and less than the ones in its right *subtree* Height-balanced BST: a BST in which the *height* of the left and the right *subtree* of any node differ by not more than 1* - *different height-balanced BST will have slightly different definition - e.g. AVL tree and Red-black tree :::warning TODO ::: ## Hash Table **Motivation:** When the ordering of the data is not important, hash table is a better choice as it provides *constant time** search, insert and delete operations. \* performance degrades to $O(n)$ as load factor increase ### Hash Collisions **Definition:** Two distinct keys $k_1$ and $k_2$ collide $\iff$ $h(k_1)=h(k_2)$ Impossible to have a hash function with no collisions, by *pigeonhole principle* Resolving collision: 1. chaining (put both items in the same bucket) 2. open addressing (find another bucket for the new item) - linear probing - quadratic probing - second hash function (cuckoo hashing) #### Hashing with Chaining Idea: Each bucket contains a linked list of items Space complexity: $O(m+n)$, where $m$ = table size and $n$ = linked list size Operations: ``` // Insertion 1. insert(key, value) 2. Calculate h(key) 3. Lookup h(key) and add (key,value) to the linked list ``` ``` // Searching / Deletion 1. Calculate h(key) 2. Search for (key,value) in the linked list ``` Analysis: Optimistic - Consider the *Simple Uniform Hashing Assumption (SUHA)*, where every key is equally likely to map to every bucket. Then, as long as there are enough buckets, we won’t get too many keys in any one bucket. - *Simple Uniform Hashing Assumption*: Suppose $n$ items and $m$ buckets, then define $\text{load(hash table)}=\frac{n}{m}=\text{average # }\frac{items}{buckets}$ - Then by (SUHA), expected search time $= \underbrace{1}_{\text{hash function + array access}} + \overbrace{\frac{n}{m}}^{\text{linked list traversal}}$. - If $m>n$, expected search time = $O(1)$ - Expected search time $= 1 + \frac{n}{m} = O(1)$ - Worst-case search time $=O(n)$ - Worst-case insertion time $= O(1)$ Reality - BUT *Simple Uniform Hashing* doesn’t exist. In reality, keys are not random and have lots of regularity which can induce patterns in hash functions. #### Open Addressing Idea: When collision happened, probe a sequence of buckets until you find an empty one. Probing: - Find the next position that’s empty to insert a key. - Originally try $h(key)$, if collides, try $(h(key)+f(i))\text{ mod }m$ for $i$ collisions. - $f(i)$ can be any function - Linear Probing: $f(i)=i$, so $h(key,i)=h(key)+i$ - Quadratic Probing: $f(i)=i^2$, so $h(key,i)=h(key)+i^2$ - Double Hashing: $h(key,i)=h(key)+i\times g(key)$ - e.g.: Cuckoo Hashing - [Cuckoo Hashing for Undergraduates](https://www.itu.dk/people/pagh/papers/cuckoo-undergrad.pdf) - [Cuckoo Hashing, Theory and Practice](https://mybiasedcoin.blogspot.com/2007/06/cuckoo-hashing-theory-and-practice-part.html) - [Bounds for Basic Cuckoo Hashing](https://infoweekly.blogspot.com/2010/02/cuckoo-hashing.html) Operation: ``` //searching function search(key) { int i = 0; while (i <= m) { int bucket = h(key, i); if (T[bucket] == null) // Empty bucket! return key-not-found; if (T[bucket].key == key) // Full bucket. return T[bucket].data; i++; } return key-not-found; // Exhausted entire table. } ``` ``` // deletion 1. search() to get the position 2. Set the slot to be “DELETED”, DON'T set to "NULL" ``` ``` // insertion 1. algorithm similar to search() 2. replace an item on the slot of "DELETED" ``` Analysis: Performance - When $m == n$, the table is full and we cannot insert any more items, cannot search efficiently (chaining can still search and insert efficiently). - Define $load, \alpha = \frac{n}{m}$ and assume $\alpha < 1$, assuming *SUHA*, the expected cost of operation, $E\le\frac{1}{1-\alpha}$ - Prove for $E\le\frac{1}{1-\alpha}$ is as below. :::spoiler Prove for cost of operation, E First probe: probability that first bucket is full is: $\frac{n}{m}$ Second probe: if first bucket is full, then the probability that the second bucket is also full: $\frac{n-1}{m-1}$ Third probe: probability is full: $\frac{n-2}{m-2}$ Expected #probes $$= \overbrace{1}^{\text{first probe}} + \underbrace{\frac{n}{m}}_{\text{probability of collision on first probe}}\left(\text{Expected cost of remaining probes} \right)$$ $$= 1+\frac{n}{m}\left(\underbrace{1+ \frac{n-1}{m-1}}_{\text{probability of collision on second probe}}+\text{Expected cost of remaining probes} \right)$$ $$= 1+\frac{n}{m} \left(1+\frac{n-1}{m-1}\left(1+\frac{n-2}{m-2}\left(1+\frac{n-3}{m-3} \right) \left(\cdots \right) \right) \right) \tag{$*$}$$ Note that, $$\frac{n-i}{m-i}\le\frac{n}{m}\le\alpha$$ So, $$(*)\le 1+\alpha \left(1+\alpha\left(1+\alpha\left(1+\alpha \right) \left(\cdots \right) \right) \right)$$ $$\le 1+\alpha^2+\alpha^3+\alpha^4+\cdots$$ $$\le \frac{1}{1-\alpha}$$ ::: &nbsp; Advantages - Save space - Rarely allocate memory, as no new list-node allocations - Better cache performance - table all in one place in memory - fewer accesses to bring table into cache. - linked lists can wander all over the memory Disadvantages - More sensitive to choice of hash functions - clustering is a common problem - More sensitive to load - performance degrades badly as $\alpha \to 1$ ### Hashing Techniques Two common hashing techniques: - Division Method - Multiplication Method #### Division Method Hash Function: $h(k)=k\text{ mod }m$, where $m=\text{table size}$ Then, two keys $k_1$ and $k_2$ collide $\iff$ $k_1\text{ mod }m=k_2\text{ mod }m$ |Problem|Solutions| |---|---| |If $k$ and $m$ has a common divisor $d$, then $k=k\text{ mod }m+i*m$, so $h(k)=k\text{ mod }m$ is divisible by $d$. For all $k$ that are divisible by $d$, only $\frac{1}{d}$ fraction of the hash table will be utilized|Choose $m$ to be a prime number| |Division is slow|| #### Multiplication Method Hash Function: $h(k)=(Ak)\text{ mod }2^w \gg (w-r)$, where $m=2^r=\text{table size, for some constant }r$ $w = \text{size of a key in bits}$ $A = \text{(odd) constant}$ :::warning TODO: need more diagram to illustrate how it work ::: Pros: - Multiplication method is faster than Division Method - Works reasonably well when $A$ is an odd integer $>$ $2^{w-1}$ - Odd: if it is even, then lose at least one bit’s worth - Big enough: use all the bits in $A$ ### Hash Table Size Let number of items $=n$, table size $=m$ If $m \lt 2n$, then too many collisions If $m \gt 10n$, then too much wasted space Also, we don’t know $n$ in advance Idea: - Start with small (constant) table size - Grow (and shrink) table as necessary Increasing table size: - ~~Increment table size by 1~~ - Double the size of the table - Cannot wait until $n==m$ - Resizing when $n$ was $\frac{n}{2},\frac{n}{4},\cdots$ - Total time complexity = $O(1+\cdots+\frac{n}{8}+\frac{n}{4}+\frac{n}{2}+n)=O(n)$ - In average, every addition of an item cost $O(1)$ - ~~Square the size of the table~~ Shrinking table: - Similar to increase table size - Need to allow some buffer before shrinking Analysis: Cost - Scanning old hash table: $O(\text{size of the old hash table}, m_1)$ - Creating new hash table: $O(\text{size of the new hash table}, m_2)$ - Inserting each element in new hash table: $O(1)$ - Total cost: $O(m_1 + m_2 + n)$ ### Extra Questions :::spoiler Extra Questions :::info 1. Let's say a company has about 600+ employee and we would like to store them into a hash table with size $m=4096$. In order to avoid confusion if there are two employees with the same name from different department, we will hash the concatenation of an employee name with his/her department. We will then convert the string into "binary" performing $\mod 2$ on the ASCII value of each character. For example, $$\text{KIMACCOUNTING}\implies \text{75 73 77 65 67 67 79 85 78 84 73 78 71}\implies \text{1 1 1 1 1 1 1 1 0 0 1 0 1}.$$ After $\mod 2$ with each number, the binary number will be $$1 1 1 1 1 1 1 1 0 0 1 0 1 = 8165$$ in decimal. And the final hash value is $$h(8165)=8165\mod m=4096.$$ Is this hashing method good or bad? Give reasons to support your answer. :::spoiler - There will be a lot of collision - A lot of hash table entries are empty. Because after $mod 4096$, it basically just use the least 12 significant digits that are all coming from the department. ::: ::: ## Priority Queue The aim is to develop an ADT that manages a collection of objects with associated priorities. This ADT requires several operations: |Operations|Description| |---|---| |`insert(Key k, Priority p)`|insert k with priority p| |`extractMin()` / `extractMax()`|remove key with minimum / maximum priority| |`decreaseKey(Key k, Priority p)` / `increaseKey(Key k, Priority p)`|reduce / increase the priority of key k to priority p| |`contains(Key k)`|does the priority queue contain key k?| |`isEmpty()`|is the priority queue empty?| :::warning Assume data items are unique. ::: Why Binary Heap? <table> <tr> <th>Data Structure</th> <th colspan="2">Operations</th> <th colspan="2">Priority Queue Sort</th> </tr> <tr> <th></th> <th>insert(x)</th> <th>extractMax()</th> <th>Time</th> <th>In-place?</th> </tr> <tr> <td>Sorted Array</td> <td>O(n)</td> <td>O(1)</td> <td>O(n^2)</td> <td>Yes (insertion sort)</td> </tr> <tr> <td>Unsorted Array</td> <td>O(1)</td> <td>O(n)</td> <td>O(n^2)</td> <td>Yes (selection sort)</td> </tr> <tr> <td>AVL Tree</td> <td>O(logn)</td> <td>O(logn)</td> <td>O(nlogn)</td> <td>No (AVL sort)</td> </tr> <tr> <th>Heap / Binary Heap</th> <th>O(logn)</th> <th>O(logn)</th> <th>O(nlogn)</th> <th>Yes (Heap sort)</th> </tr> </table> A binary heap can be visualized as a binary tree (NOT BST), and it can be implemented using an array. :::info Q: Heap vs. AVL Tree - same cost for operations - slightly simpler - no rotations - implement using array - slightly better concurrency Q: Can we implement AVL tree using array? - too slow for AVL - how many operations/items needed for right/left rotate? - AVL tree is not a complete tree $\implies$ waste of memory if use array ::: :::success Properties of Heap: 1. Heap Ordering: - Max Heap: `priority[parent] >= priority[child]` - Min Heap: `priority[child] >= priority[parent]` 2. Complete binary tree: - every level is full, except possible the last - all nodes are as far left as possible 3. Height: $O(\log n)$ ::: ### Priority Queue Operations :::warning Refer to the slides or VisualAlgo for detailed explanations and motivations behind the operations. ::: ```cpp // bubble up function bubbleUp(Node v) { while (v != null) { if (priority(v) > priority(parent(v))) { swap(v, parent(v)); v = parent(v); } else { return; } } } ``` ```cpp // bubble down function bubbleDown(Node v) { while (!leaf(v)) { leftP = priority(left(v)); rightP = priority(right(v)); biggerChild = leftP > rightP ? left(v) : right(v); if (priority(biggerChild) > priority(v)) { swap(v, biggerChild); v = biggerChild; } else { return; } } } ``` ```cpp // increase key, O(log n) 1. update priority: increaseKey(5 -> 25) 2. bubble up ``` ```cpp // decrease key, O(log n) 1. update priority: decreaseKey(25 -> 5) 2. bubble down ``` ```cpp // insert, O(log n) 1. add a new leaf with priority p 2. bubble up function insert(Priority p, Key k) { Node v = m_completeTree.insert(p,k); bubbleUp(v); } ``` ```cpp // delete, O(log n) 1. swap node to be deleted with last node (right most leaf) 2. remove last node (the node to be deleted, since we swapped earlier) 3. bubble down IF last() > deleted ``` ```cpp // extract max priority, O(log n) 1. Node v = root; 2. delete(root); ``` ### Store Heap in Array - map each node in complete binary tree into a slot in an array - each level $i$ starts from the array index $2^{i}-1$, assuming root is level 0 from top - let $x$ be array index of node $v$ - $\text{index(left(v))} = 2x+1$ - $\text{index(right(v))} = 2x+2$ - $\text{index(parent(v))} = \lfloor\frac{x-1}{2}\rfloor$ Illustration: ```graphviz graph max_heap { rankdir=TD; // LR ordering=out; a[label=24] b[label=10] c[label=17] d[label=8] e[label=4] f[label=6] g[label=7] h[label=1] node[shape=plaintext] dummy[label="dummy", style=invis] {rank=same a} {rank=same b c} {rank=same d e f g} {rank=same h} a -- b; a -- c; b -- d; b -- e; c -- f; c -- g; d -- h; d -- dummy[style=invis]; } ``` ```graphviz digraph array { node [shape=plaintext]; "heapArray:" -> "indices:" [color=white]; node[shape=record, fontcolor=black, width=5, fixedsize=true]; indices[label="0 | 1 | 2 | 3 | 4 | 5 | 6 | 7", color=white]; node[shape=record, fontcolor=black, width=4.75, fixedsize=true]; values[label="24 | 10 | 17 | 8 | 4 | 6 | 7 | 1"]; nodesep = .0; { rank=same; "indices:"; indices } { rank=same; "heapArray:"; values } } ``` ### Heap Sort ``` // heap sort 1. convert unsorted array into heap array (heapify) 1.1 idea: given two heaps and one new node, how to join all of them into a single heap? 1.2 join them and bubble down the root 1.3 base case: each leaf is a heap, so half of the tree no need to bubble down 1.4 recursion: left + right are heaps int[] A = array of unsorted integers for (int i = (n-1); i >= 0; i--) { bubbleDown(i, A); // O(height) } for (int i = 0; i < n; i++) { int value = A[i]; A[i] = EMPTY; heapInsert(value, A, 0, i); } // NOTE: the above algorithm is not correct, still need modification 2. convert heap array into sorted array int[] A = array stored as a heap for (int i = (n-1); i >= 0; i--) { // O(n log n) int value = extractMax(A); // O(log n) A[i] = value; } ``` Analysis: 1. unsorted array $\to$ heap array - cost of `bubbleDown()` is dependent on height of tree - obversation: at least $\frac{n}{2}$ nodes are - observation: most nodes have small height |Height|Number of Nodes| |---|---| |0|$\frac{n}{2}$| |1|$\frac{n}{4}$| |2|$\frac{n}{8}$| |3|$\frac{n}{16}$| |$\cdots$|$\cdots$| |$\log(n)$|1| - Cost of building a heap: $O(n)$ :::spoiler Cost of building a heap is O(n) $$\boxed{ \sum_{h=0}^{\log n}{\overbrace{\frac{n}{2^{h}}}^{\text{ upper bound on number of nodes at level }h}\times\underbrace{O(h)}_{\text{ cost for bubbling a node at level }h}} = cn\sum_{h=0}^{\log n}{\frac{1}{2^{h}}O(h)} = cn\cdot O\left( \overbrace{\sum_{h=0}^{\log n}{\frac{h}{2^{h}}}}^{\le 2,\text{ can derive using geometric series }} \right) = \cdots = 2O(n) = O(n) }$$ ::: &nbsp; 2. heap array $\to$ sorted array - `extractMax()` is $O(\log n)$ - so the loop takes $O(n\log n)$ 3. Summary: - unsorted array $\to$ heap array: $O(n)$ - heap array $\to$ sorted array: $O(n\log n)$ - so overall is $O(n\log n)$ - heap sort is in-place, but unstable - what if two different keys have same priority? :::info Q: Why not just use merge sort or quick sort? - heap sort is in place - in the scenario where **sorting the whole array is not necessary**, only the first $m$ element need to be sorted, heap sort is efficient, the time complexity for this will be $O(n + \text{ number of items } + \log n)$ - a little slower than quick sort, but always completes in $O(n\log n)$ Q: Can we improve heap sort? - ternary (3-way) heap sort is a little faster - ternary heap sort is not difficult to implement, we just need to figure out the index of parent and children - the time complexity is still the same, we just improve from $\log_{2} \to \log_{3}$ ::: ### Extra Questions :::spoiler Extra Questions :::info 1. Given an array of max heap, how to convert it into a min heap. :::spoiler Answer 1 - Approach 1: treat the “max heap” as an unsorted array and heapify (on min heap) $O(n)$. - Aprroach 2: If you are given max heap implementation and its raw array and the contents are numbers (integers or floating points), you can simply just reinsert the negations of each numbers of that max heap. This essentially converts a max heap into a min heap (just ignore the presence of -ve symbols that is only used to reverse the meaning of min and max). ::: 2. Given $n$ items, they are in $k$ sorted list, so each of the sorted list has $\frac{n}{k}$ items. How fast can you merge them into one single sorted list? What if $k=O(n)$, e.g. $k=\frac{n}{10}$? :::spoiler Answer 2 - Idea: min heap array + new sorted array, put every first element of the $k$ sorted lists into heap array, extract min from heap array, choose another element from old sorted list into heap array and then bubble - $O(k)$ ::: 3. If a maxheap is stored in an array, we can always convert a max heap to a min heap by reversing the array, and vice versa. For example, if an array A store a max heap of [10,9,8,7] then reversing A as [7,8,9,10] will be a min heap. :::spoiler Answer 3 False. Counterexample: A = [10,9,3,8,7,1]. ::: ::: ## Disjoint Set **Definition:** A disjoint-set ADT maintains a collection of disjoint (non-overlapping) sets. It supports two primary operations: find and union. **Purpose:** Primarily used to track elements split into multiple sets and to easily merge these sets. **Core Operations:** |Operations|Description| |---|---| |`DisjointSet(int N)`|constructor: N objects| |`boolean find(Key p, Key q)`|are `p` and `q` in the same set? (is there a path connecting the two objects?)| |`void union(Key p, Key q)`|replace sets containing `p` and `q` with their union| ### Quick Find **Data Structure:** - uses an array `componentID[]` of size `N`, where `N` is the number of elements - initialy, each element is in its own set, so `componentID[i] = i`, for $0\le i\le N-1$ - two objects are connected if they have the same component identifier ```graphviz graph component { rankdir=TD; ordering=out; a[label=0] b[label=1] c[label=2] d[label=3] e[label=4] f[label=5] g[label=6] h[label=7] {rank=same a b d} {rank=same e g h} b -- e b -- d c -- g } ``` ```graphviz digraph array { node [shape=plaintext]; "componentID:" -> "object:" [color=white]; node[shape=record, fontcolor=black, width=5, fixedsize=true]; object[label="0 | 1 | 2 | 3 | 4 | 5 | 6 | 7", color=white]; node[shape=record, fontcolor=black, width=4.75, fixedsize=true]; values[label="0 | 1 | 2 | 1 | 1 | 5 | 2 | 7"]; nodesep = .0; { rank=same; "object:"; object } { rank=same; "componentID:"; values } } ``` ```cpp // find bool find(int p, int q) { // O(1) return componentID[p] == componentID[q]; } ``` ```cpp // union void union(int p, int q) { // O(n) for (int i = 0; i < componentID.size(); i++) { if (componentID[i] == componentID[q]) { componentID[i] = componentID[p]; } } } ``` ### Quick Union **Data Structure:** - uses an array `parent[]` of size `N`, where `N` is the number of elements - initialy, each element is in its own set, so `parent[i] = i`, for $0\le i\le N-1$ - two objects are connected if they are part of the same tree ```graphviz graph component { rankdir=TD; ordering=out; a[label=0] b[label=1] c[label=2] d[label=3] e[label=4] f[label=5] g[label=6] h[label=7] i[label=8] node[shape=plaintext] dummy[label="dummy", style=invis] g -- a;g -- e; h -- c;h -- i; c -- b;c -- dummy[style=invis] b -- f;b -- dummy[style=invis] } ``` ```graphviz digraph array { node [shape=plaintext]; "parent:" -> "object:" [color=white]; node[shape=record, fontcolor=black, width=5, fixedsize=true]; object[label="0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8", color=white]; node[shape=record, fontcolor=black, width=4.75, fixedsize=true]; values[label="6 | 2 | 7 | 3 | 6 | 1 | 6 | 7 | 7"]; nodesep = .0; { rank=same; "object:"; object } { rank=same; "parent:"; values } } ``` ```cpp // find bool find(int p, int q) { // O(n) while (parent[p] != p) { p = parent[p]; } while (parent[q] != q) { q = parent[q]; } return p == q; } ``` ```cpp // union void union(int p, int q) { // O(n) while (parent[p] != p) { p = parent[p]; } while (parent[q] != q) { q = parent[q]; } parent[p] = q; } ``` ### Weighted Union **Data Structure:** - enhanced version of quick union - uses an array `parent[]` of size `N`, where `N` is the number of elements - an additional `size[]` array that tracks the size of each tree - when merging 2 trees, always attach the smaller tree to the root of the larger tree - the maximum depth of tree will be $O(\log n)$, *prove refers to slides* ```graphviz graph component { rankdir=TD; ordering=out; a[label=0] b[label=1] c[label=2] d[label=3] e[label=4] f[label=5] g[label=6] h[label=7] i[label=8] node[shape=plaintext] dummy[label="dummy", style=invis] g -- a;g -- e; h -- c;h -- i; c -- b;c -- dummy[style=invis] b -- f;b -- dummy[style=invis] } ``` ```graphviz digraph array { node [shape=plaintext]; "parent:" -> "size:" -> "object:" [color=white]; node[shape=record, fontcolor=black, width=5, fixedsize=true]; object[label="0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8", color=white]; node[shape=record, fontcolor=black, width=4.75, fixedsize=true]; size[label="1 | 2 | 3 | 1 | 1 | 1 | 3 | 5 | 1"]; node[shape=record, fontcolor=black, width=4.75, fixedsize=true]; values[label="6 | 2 | 7 | 3 | 6 | 1 | 6 | 7 | 7"]; nodesep = .0; { rank=same; "object:"; object } { rank=same; "size:"; size } { rank=same; "parent:"; values } } ``` ```cpp // find (same as quick union) bool find(int p, int q) { // O(log n) while (parent[p] != p) { p = parent[p]; } while (parent[q] != q) { q = parent[q]; } return p == q; } ``` ```cpp // union void union(int p, int q) { // O(log n) while (parent[p] != p) { p = parent[p]; } while (parent[q] != q) { q = parent[q]; } if (size[p] > size[q]) { parent[q] = p; size[p] += size[q]; } else { parent[p] = q; size[q] += size[p]; } } ``` ### Weighted Union with Path Compression - extended from weighted union - after finding the root, set the parent of each traversed node to the root Example: `find(11, 32)` ```graphviz graph component { rankdir=TD; ordering=out; label="Before Path Compression" a[label=0]; b[label=1,color="red"] c[label=2,color="red"]; d[label=3] e[label=4]; f[label=5] g[label=6,color="red"]; h[label=7,color="red"] i[label=8]; j[label=10] k[label=11,color="red"]; l[label=12] m[label=13] node[shape=plaintext] dummy[label="", style=invis] a -- j;a -- dummy[style=invis] k -- l;k--m g -- a;g -- k b -- g;b -- e c -- b;c -- d;c -- f h -- c;h-- i } ``` ```graphviz graph component { rankdir=TD; ordering=out; label="After Path Compression" a[label=0]; b[label=1,color="red"] c[label=2,color="red"]; d[label=3] e[label=4]; f[label=5] g[label=6,color="red"]; h[label=7,color="red"] i[label=8]; j[label=10] k[label=11,color="red"]; l[label=12] m[label=13] node[shape=plaintext] dummy[label="", style=invis] h -- g;h -- k;h -- b;h -- c;h -- i g -- a a -- j k -- l;k -- m b -- e c -- d;c -- f {rank=same; g k b c i} } ``` ```cpp // findRoot v1 // also need to update size[] array, ommited here int findRoot(int p) { int root = p; while (parent[root] != root) { root = parent[root; } while (parent[p] != p) { int temp = parent[p]; parent[p] = root; p = temp; } return root; } ``` ```cpp // findRoot v2 // also need to update size[] array, ommited here int findRoot(int p) { int root = p; while (parent[root] != root) { parent[root] = parent[parent[root]]; root = parent[root]; } return root; } ``` ### Union-Find Algorithm Summary |Algorithm|find|union| |---|---|---| |Quick Find|$O(1)$|$O(n)$| |Quick Union|$O(n)$|$O(n)$| |**Weighted Union**|$O(\log n)$|$O(\log n)$| |**Path Compression**|$O(\log n)$|$O(\log n)$| |**Weighted Union with Path Compression**|$\alpha(m, n)$|$\alpha(m, n)$| * $\alpha$ is *Inverse Ackermann function* ### Extra Questions :::spoiler Extra Questiosn :::info 1. Given an array equations of strings that represent relationships between variables in one of two different forms: `"a==b"` or `"a!=b"`. Return true if and only if it is possible to assign integeres to variable names so as to satisfy all the given equations. :::spoiler Answer 1 1. **Create a Union-Find Data Structure**: - For each variable in the equations, create a disjoint set. Initially, each variable is its own set. 2. **Process "==" Equations**: - Iterate through the equations and for each "a==b" equation, union the sets to which variables 'a' and 'b' belong. 3. **Check "!=" Equations**: - Iterate through the equations again and for each "a!=b" equation, check if 'a' and 'b' belong to the same set in the Union-Find data structure. If they do, the equations are contradictory, and you return False. 4. **If No Contradictions, Return True**: - If you have iterated through all the equations without finding any contradictions, return True, indicating that it's possible to assign integers to variables to satisfy all the given equations. ::: 2. [Leetcode 100](https://leetcode.com/problems/number-of-islands/description/) :::spoiler Answer 2 - Approach 1: DFS/BFS - Approach 2: UF - Union-Find (or Disjoint Set) data structure allows you to efficiently perform union and find operations. Union operation merges two sets into one, and find operation determines the set to which an element belongs. - For each '1' cell, check its adjacent cells. - If adjacent cells are also '1', it means they are part of the same island. Therefore, union these cells in the Union-Find data structure. - The number of disjoint sets remaining in the Union-Find data structure after processing all cells represents the number of islands. Each disjoint set represents a separate island. ::: ::: ## Graph For more detailed graph basics, refer CS1231S notes. **Notations:** $G=(V,E)\text{, comprising}$ - nonempty set of vertices/nodes, $V=\{v_1, v_2, \cdots, v_n\}$ - set of edges, $E=\{e_1, e_2, \cdots, e_k\}$ - $e=\{v,w\}$ for undirected edge $E$ incident of vertices $v$ and $w$ - $e=(v,w)$ for directed edge $E$ from vertex $v$ to vertex $w$ **Definition:** (not in CS1231S) - Diameter: Maximum distance between two nodes, following the shortest path. **Motivation:** - Graphs can model a wide array of real-world problems, which we can then apply graph-based algorithms to solve these problems efficiently. ### Graph Representations #### Adjacency List - an array of linked list - the index represents a node, and each element in the linked list represents the nodes that are connected to it - the image below shows an undirected graph, can also be a directed graph |![image](https://hackmd.io/_uploads/Bk9yIzWxR.png)|![image](https://hackmd.io/_uploads/BkTWUzWlA.png)| |---|---| ```cpp // structure // there are many different implementation // refer assignment 5 and lecture slides ``` #### Adjacency Matrix - a 2D array, where `matrix[u][v]` indicates the presence/weight of an edge between `u` and `v` - Neat Property: Let $A$ be a adjacency matrix, then number of walks of length $n$ from $v_i$ to $v_j = i,j$ entry of $A^n$ - the image below shows an undirected graph, can also be a directed graph |![image](https://hackmd.io/_uploads/Sk-K5HWlA.png)|![image](https://hackmd.io/_uploads/HyTK9BZxR.png)| |---|---| ```cpp // structure class Graph { std::vector<std::vector<int>> _adjMatrix; } ``` #### Comparison ||Adjacency List|Adjacency Matrix| |---|---|---| |Memory Usage for graph, $G=(V,E)$|$O(V + E)$|$O(V^2)$| |Memory Usage for graph with cycle|$O(V)$|$O(V^2)$| |Memory Usage for graph with clique|$O(V^2)$|$O(V^2)$| |find any neighbour of `v`|fast|slow| |are `v` and `w` neighbours?|slow|fast| |enumerate all neighbours|fast|slow| Base rule: if graph is dense then use an adjacency matrix; else use an adjacency list. ### Travesal Algorithms #### Depth-First Search (DFS) - DFS uses stack Psedocode: ``` 1. Add start-node to stack. 2. Repeat until stack is empty: 2.1 Pop node v from the top of the stack. 2.2 Visit v. 2.3 Explore all outgoing edges of v. 2.4 Push all unvisited neighbors of v on the top of the stack. ``` C++ code: ```cpp // code snippet from TIC2001 AY2020 S1 PE // maybe not be 100% correct void Graph::BFS(int s, List<int>& output, bool resetVisited) { Stack<int> nodes; if (resetVisited) { _resetVisited(); } nodes.push(s); _setVisited(s); while (!nodes.empty()) { int currNode = nodes.pop(); _setVisited(currNode); output.insertTail(currNode); _al[currNode].start(); while (!_al[currNode].end()) { int n = _al[currNode].current(); if (!_isVisited(n)) { nodes.push(n); _setVisited(n); } _al[currNode].next(); } } } ``` #### Breadth-First Search (BFS) - BFS uses queue Psedocode: ``` 1. Add start-node to queue. 2. Repeat until queue is empty: 2.1 Remove node v from the front of the queue. 2.2 Visit v. 2.3 Explore all outgoing edges of v. 2.4 Add all unvisited neighbors of v to the queue. ``` C++ code: ```cpp // code snippet from TIC2001 AY2020 S1 PE // maybe not be 100% correct void Graph::BFS(int s, List<int>& output, bool resetVisited) { Queue<int> nodes; if (resetVisited) { _resetVisited(); } nodes.enqueue(s); _setVisited(s); while (!nodes.empty()) { int currNode = nodes.dequeue(); _setVisited(currNode); output.insertTail(currNode); _al[currNode].start(); while (!_al[currNode].end()) { int n = _al[currNode].current(); if (!_isVisited(n)) { nodes.enqueue(n); _setVisited(n); } _al[currNode].next(); } } } ``` #### Analysis :::warning TODO AFTER PRACTICAL EXAM, FOR FINAL EXAM ::: ### Shortest Path Algorithms #### Dijkstra's algorithm ``` ``` #### Bellman-Ford algorithm ``` ``` #### Analysis Questions: why can't we use DFS or BFS to find shortest path ### Minimum Spanning Tree Algorithms #### Prim's algorithm ``` For a connected weighted graph G with n vertices: 1. pick any vertex v of G and let T be the graph with this vertex only 2. let V be the set of all vertices of G except v 3. for (i = 0 to n − 1) 3.1 find the edge e in G with the least weight of all the edges connected to T. 3.2 let w be the endpoint of e 3.3 add e and w to the edge and vertex sets of T 3.4 delete w from v ``` #### Kruskal's algorithm ``` For a connected weighted graph G with n vertices: 1. initialise T to have all the vertices of G and no edges 2. let E be the set of all edges in G; let m = 0 3. while (m < n-1) 3.1 find and remove the edge e in E of least weight 3.2 if adding e to the edge set of T does not produce a circuit: i. add e to the edge set of T ii. set m = m + 1 ``` #### Boruvka’s algorithm ``` ``` #### Analysis https://cs.stackexchange.com/questions/18797/minimum-spanning-tree-vs-shortest-path ### Topological Sort - Post Order DFS Psedocode: ``` For a graph with n vertices and m directed edges: 1. Initialize a stack and a visited array of size n 2. For each unvisited vertex in the graph: 2.1 Call the DFS function with the vertex as the parameter 2.2 In the DFS function, mark the vertex as visited and recursively call the DFS function for all unvisited neighbors of the vertex 2.3 Once all the neighbors have been visited, push the vertex onto the stack 3. After all vertices have been visited, pop elements from the stack and append them to the output list until the stack is empty ``` C++ code: ```cpp #include <forward_list> #include <stack> #include <vector> using std::forward_list; using std::vector; using std::stack; class GraphEdge { private: int _dest; public: GraphEdge() : _dest(-1), _weight(-1) {} // for vector GraphEdge(int index) : _dest(index) {} GraphEdge(const GraphEdge& e) : _dest(e._dest) {} int dest() const { return _dest; } } class Graph { private: vector<forward_list<GraphEdge>> _vertices; void _topologicalSortUtil(int, vector<bool>&, stack<int>&); public: Graph(int n) : _vertices(n) {} void addEdge(int source, int dest, int weight); void topologicalSort(); } void Graph::addEdge(int source, int dest) { // Assumes that an edge doesn't already exist! _vertices[source].emplace_front(dest, weight); } void Graph::_topologicalSortUtil(int v, vector<bool>& visited, stack<int>& Stack) { visited[v] = true; forward_list<GraphEdge> curr_edge_neighbour; curr_edge_neighbour = _vertices[v]; for (auto i = curr_edge_neighbour.begin(); i != curr_edge_neighbour.end(); i++) { if (!visited[i->dest()]) { topologicalSortUtil(i->dest(), visited, Stack); } } Stack.push(v); } void Graph::topologicalSort() { stack<int> Stack; vector<bool> visited(_vertices.size(), false); /* bool* visited = new bool[_vertices.size()]; for (int i = 0; i < _vertices.size(); i++) { visited[i] = false; } */ for (int i = 0; i < _vertices.size(); i += 1) { if (!visited[i]) { _topologicalSortUtil(i, visited, Stack); } } // print the content of the stack // output here while (!Stack.empty()) { std::cout << Stack.top() << " "; Stack.pop(); } std::cout << std::endl; } ``` ## Mix and Match **Motivation:** Combine multiple basic data structure to achieve better performance. Example: - adjacency list - array of linked list - a linked list can also be a BST, by augmenting the nodes with `*nextPtr`, `*leftPtr`, `*rightPtr` - queue - `printInOrder()` will be $O(n\log n)$ - TBC **Conclusion:** Pointers is powerful and useful, but complicated to manage. ### SkipList **Data Structure:** - TODO **Analysis:** - TODO ## Binary Space Partitioning **Motivation:** ## Convex Hull **Definition:** Convex: A point set $\huge P$ is **convex** if $\huge \forall x,y\in P, \overline{xy} \subseteq P$ Convex Combination: Let $S = \{ p_{i}\in R^{d} \mid i = 1 \cdots n \}$, an **convex combination** of $$S = \sum_{p_{i}\in S}(\lambda_{i}p_{i})$$ such that $$\sum^{n}_{i=1} \lambda_{i}= 1, \forall \lambda_{i} \ge 0$$ Convex Hull: The **convex hull** of $S, \text{conv}(S)$ is the collection of all convex combinations :::warning Font size of math is increased because there is some rendering problem with `$\overline$` ::: :::success Alternative explanation: Let $C \subseteq \mathbb{R}^{n}$. $C$ is **convex** if, for all $x,y\in C$, the line segment connecting $x,y$ is included $C$. A **convex combination** is a linear combination of points (which can be vectors, scalars, or more generally points in an affine space) where all coefficients are non-negative and sum to 1. Source: Wikipedia ::: **Motivation:** Application of ConveX Hull: - In general, a “simpler” representation of a complicated object - Graphics: Bezier curve drawing/intersection, ray tracing - Data mining/computer vision: Classification of data - GIS, path finding ### Convex Hull Algorithms #### Gift Wrapping algorithm a.k.a *Jarvis march* ``` 1. Take the leftmost vertex 2. Repeat 2.1 Search for the next vertex on the convex hull by choosing the one with the minimal turning angle ``` #### Graham's scan ``` 1. Starting from the left most vertex, go around the polygon 3. If it’s a convex corner, continue 2. If it is a concave vertex, “make it convex” by “filling” it - By connecting its two neighbors ``` #### Divide and Conquer ``` 1. Sort vertices in a direction, e.g. y direction 2. Repeat: 2.1 Merge every neighboring convex hull 2.2 For every merge, find the two “supporting lines” 2.2.1 By taking a walk from an initial line (by joining the two neighbors) ``` #### Incremental method ``` 1. Adding a point according to a sorted direction incrementally ``` #### Quickhull ``` - General idea: discard points that are not on the hull as quickly as possible - First find the maximum and minimum in x and y directions 1. Construct a quadrilateral by these four points 2. Discard any point inside 3. Repeat: 3.1 For each side, find the furthermost point 3.2 Include in the convex hull 3.3 Eliminate any point within ``` #### Analysis |Algorithm|Expected Time|Worst Time|Moving up to 3D| |---|---|---|---| |Gift Wrapping|$O(nh)$|$O(n^2)$|$O(fn)$| |Graham's scan|$O(n\log n)$|$O(n\log n)$|-| |Divide and Conquer|$O(n\log n)$|$O(n\log n)$|$O(n\log n)$| |Incremental method|$O(n \log n)$|$O(n \log n)$|$O(n \log n)$| |Quickhull|$O(n\log n)$|$O(n^2)$|$O(n^2)$| - $n$ is the number of points in the set - $h$ is the number of points in the hull - $f$ is the number of faces on the convex hull which is $O(n)$ :::info Can we do it in $O(n)$ ? - No. Otherwise, we can solve sorting problem in $O(n)$ time (comparison sort) - Idea: For $n$ numbers, project them onto a parabola :::