CS2040C Data Structures and Algorithms

Notes.

Some algorithms are in psedocode/Java, some are C++ code.

Useful Website:

Introduction

Table of Content (Link)

Searching & Sorting
Linked List
Binary Search Tree (BST)
Hash Table
Priority Queue
Disjoint Set
Graph
Mix and Match
- SkipList
Binary Space Partitioning
Convex Hull
- Convex Hull Algorithms

Table of Content (Tree)

CS2040C Data Structures and Algorithms
├── Searching/Sorting
│   └── Binary Search, Merge Sort, Quick Sort, Heap Sort*
├── Linked List
├── BST
├── Hash Table
│   ├── Hash Collision
│   │   ├── Chaining
│   │   └── Open Addressing
│   │       ├── Linear Probing
│   │       ├── Quadratic Probing
│   │       └── second hash function (cuckoo hashing)
│   ├── Hashing Techniques
│   │   ├── Division Method
│   │   └── Multiplication Method
│   └── Hash Table Size
├── Priority Queue
│   ├── Priority Queue Operations
│   ├── Store Heap in Array
│   └── Heap Sort
├── Disjoint Set
│   ├── Quick Find
│   ├── Quick Union
│   ├── Weighted Union
│   ├── Weighted Union with Path Compression
│   └── Union Find Algorithm Summary
├── Graph
│   ├── Graph Representations
│   ├── Travesal Algorithms
│   │   ├── DFS
│   │   ├── BFS
│   │   └── Analysis
│   ├── Shortest Path Algorithms
│   │   ├── Dijkstra's algorithm
│   │   ├── Bellman-Ford algorithm
│   │   └── Analysis
│   ├── Minimum Spanning Tree Algorithms
│   │   ├── Prim's algorithm
│   │   ├── Kruskal's algorithm
│   │   ├── Boruvka’s algorithm
│   │   └── Analysis
│   └── Topological Sort
└── Binary Space Partitioning

Data structures serve as the basis for abstract data types (ADT). ADT is in the logical level and data structure is in the implementation level.

ADT	Data Structure
Stack, Queue	Linked List, Array
Tree	Linked List, Array (too slow for AVL)
Binary Search Tree, Balanced Tree	Tree
Priority Queue	Binary Heap (using array)
Disjoint Set
Graph	Graph*
-	Hash Table

Put simply, whether one thing is a data structure or not really depends on how we look at it. stackoverflow

An ADT can be both ADT and data structure?
TBH I am still not sure which one is which one

Searching & Sorting

Binary Search

int arr[100] = {...};

int binarySearch(const int arr[], int size, int target) {
    int start = 0; 
    int end = size;
    int mid = 0;
    while (start < end) {
        mid = (start + end) / 2;
        if (arr[mid] == target) {
            return mid;
        }
        if (arr[mid] > target) {
            end = mid - 1;
        } else {
            start = mid + 1;
        }
    }
    return -1;
}

Merge Sort

// mergeSort(arr, 0, 100) <- not 99 !
int arr[100] = {...};

void merge(int arr[], int start, int mid, int end) {
    int leftSize = mid - start + 1;
    int rightSize = end - mid;
    int currIndexLeft = 0;
    int currIndexRight = 0;
    int currIndexMerged = start;

    // temporary array
    int* leftArray = new int[leftSize];
    int* rightArray = new int[rightSize];

    for (int i = 0; i < leftSize; i += 1) {
        leftArray[i] = arr[start + i];
    }
    for (int i = 0; i < rightSize; i += 1) {
        rightArray[i] = arr[mid + 1 + i];
    }
    
    // merge
    while (currIndexLeft < leftSize && currIndexRight < rightSize) {
        if (leftArray[currIndexLeft] <= rightArray[currIndexRight]) {
            arr[currIndexMerged] = leftArray[currIndexLeft];
            currIndexLeft += 1;
        } else {
            arr[currIndexMerged] = rightArray[currIndexRight];
            currIndexRight += 1;
        }
        currIndexMerged += 1;
    }

    // remaining left
    while (currIndexLeft < leftSize) {
        arr[currIndexMerged] = leftArray[currIndexLeft];
        currIndexLeft += 1;
        currIndexMerged += 1;
    }

    // remaining right
    while (currIndexRight < rightSize) {
        arr[currIndexMerged] = rightArray[currIndexRight];
        currIndexRight += 1;
        currIndexMerged += 1;
    }

    delete[] leftArray;
    delete[] rightArray;
}

void mergeSort(int arr[], int start, int end) {
    if (start < end) {
        int mid = start + (end - start) / 2; // prevent overflow for big number
        mergeSort(arr, start, mid);
        mergeSort(arr, mid + 1, end);
        merge(arr, start, mid, end);
}

Quick Sort

// quickSort(arr, 0, 100) <- not 99 !
int arr[100] = {...};

int partition(int arr[], int start, int end) { // A = [1...n], A[end - 1] = n
    int pivotIndex = start;
    int pivot = arr[start]; // select first element as pivot
    //int pivot = arr[end-1]; // select last element as pivot
    
    start += 1; // since A[start] is pivot
    while (start < end) {
        while ((arr[start] <= pivot) && (start < end)) {
            start++;
        }
        while ((arr[end - 1] > pivot) && (start < end)) {
            end--;
        }
        if (start < end) {
            // swap A[start], A[end-1];
            int temp = arr[start];
            arr[start] = arr[end - 1];
            arr[end - 1] = temp;
        }
    }
    // swap A[pivotIndex], A[start - 1];
    arr[pivotIndex] = arr[start - 1];
    arr[start - 1] = pivot;
    return (start - 1);
}

void quickSort(int arr[], int start, int end) { // A = [1...n], A[end - 1] = n
    if (start < end) {
        int p = partition(arr, start, end); // p is "sorted"
        quickSort(arr, start, p);
        quickSort(arr, p + 1, end);
    }
}

Linked List

TODO

Linked List Tricks:

Middle of Linked List:

public Node middleNode(Node head) {
     Node fast = head;
     Node slow  = head;

// Here is one tricky part, while giving condition of loop
// For list with even length, fast pointer is going to reach end i.e. null
// For off length, fast pointer is going to be at the tail
// so we have to terminate the loop 
// if fast becomes null && fast.next becomes null

     while( fast!=null && fast.next!=null){
         fast = fast.next.next; // moving fast by 2 step
         slow = slow.next;
     }
     return slow;
}

Nth node from the end of the Linked List:

public getNodeFromEnd(Node head, int n){
    Node fast = head;
    Node slow = head;
    
//moving fast ahead by n, to make gap
    for(int i=0;i<n;i++){
         fast = fast.next; 
    }

// when the fast reaches end, slow will be the nth node from end
    while(fast.next!=null){
         fast = fast.next;
         slow = slow.next;
    }
    return slow; 
}
// NOTE: in case n is more than length of list, handle that case according to
// problem given

Detect Cycle

// code from my leetcode solution
/**
 * Definition for singly-linked list.
 * struct ListNode {
 *     int val;
 *     ListNode *next;
 *     ListNode(int x) : val(x), next(NULL) {}
 * };
 */
class Solution {
public:
    bool hasCycle(ListNode *head) {
        // Not so good solution but worth trying: hash table / store node visited
        if (head) {
            ListNode* slow = head;
            ListNode* fast = head->next;
            while (fast) {
                if (fast == slow) {
                    return true;
                }
                fast = fast->next;
                if (fast) {
                    fast = fast->next;
                }
                slow = slow->next;
            }
        }
        return false;
    }
};

Binary Search Tree (BST)

Tree: graph that is circuit free and connected

Binary Tree: a rooted tree in which every parent has at most 2 children, which is referred to as the left child and the right child

Binary Search Tree (BST): a rooted binary tree with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree

Height-balanced BST: a BST in which the height of the left and the right subtree of any node differ by not more than 1*

*different height-balanced BST will have slightly different definition
e.g. AVL tree and Red-black tree

TODO

Hash Table

Motivation:

When the ordering of the data is not important, hash table is a better choice as it provides constant time* search, insert and delete operations.

* performance degrades to

O (n)

as load factor increase

Hash Collisions

Definition: Two distinct keys

k_{1}

and

k_{2}

collide

⟺

h (k_{1}) = h (k_{2})

Impossible to have a hash function with no collisions, by pigeonhole principle

Resolving collision:

chaining (put both items in the same bucket)
open addressing (find another bucket for the new item)
- linear probing
- quadratic probing
- second hash function (cuckoo hashing)

Hashing with Chaining

Idea: Each bucket contains a linked list of items

Space complexity:

O (m + n)

, where

m

= table size and

n

= linked list size

Operations:

// Insertion
1. insert(key, value)
2. Calculate h(key)
3. Lookup h(key) and add (key,value) to the linked list

// Searching / Deletion
1. Calculate h(key)
2. Search for (key,value) in the linked list

Analysis:

Optimistic

Consider the Simple Uniform Hashing Assumption (SUHA), where every key is equally likely to map to every bucket. Then, as long as there are enough buckets, we won’t get too many keys in any one bucket.
Simple Uniform Hashing Assumption: Suppose
$n$ items and
$m$ buckets, then define
$load(hash table) = \frac{n}{m} = average # \frac{i t e m s}{b u c k e t s}$
Then by (SUHA), expected search time
$= \underset{hash function + array access}{\underset{⏟}{1}} + \overset{linked list traversal}{\overset{⏞}{\frac{n}{m}}}$ .
If
$m > n$ , expected search time =
$O (1)$
Expected search time
$= 1 + \frac{n}{m} = O (1)$
Worst-case search time
$= O (n)$
Worst-case insertion time
$= O (1)$

Reality

BUT Simple Uniform Hashing doesn’t exist. In reality, keys are not random and have lots of regularity which can induce patterns in hash functions.

Open Addressing

Idea: When collision happened, probe a sequence of buckets until you find an empty one.

Probing:

Find the next position that’s empty to insert a key.
Originally try
$h (k e y)$ , if collides, try
$(h (k e y) + f (i)) mod m$ for
$i$ collisions.
$f (i)$ can be any function
Linear Probing:
$f (i) = i$ , so
$h (k e y, i) = h (k e y) + i$
Quadratic Probing:
$f (i) = i^{2}$ , so
$h (k e y, i) = h (k e y) + i^{2}$
Double Hashing:
$h (k e y, i) = h (k e y) + i \times g (k e y)$
- e.g.: Cuckoo Hashing

Operation:

//searching
function search(key) {
    int i = 0;
    while (i <= m) {
        int bucket = h(key, i);
        if (T[bucket] == null) // Empty bucket!
            return key-not-found;
        if (T[bucket].key == key) // Full bucket.
            return T[bucket].data;
        i++;
    }
    return key-not-found; // Exhausted entire table.
}

// deletion
1. search() to get the position
2. Set the slot to be “DELETED”, DON'T set to "NULL"

// insertion
1. algorithm similar to search()
2. replace an item on the slot of "DELETED"

Analysis:

Performance

When
$m == n$ , the table is full and we cannot insert any more items, cannot search efficiently (chaining can still search and insert efficiently).
Define
$l o a d, α = \frac{n}{m}$ and assume
$α < 1$ , assuming SUHA, the expected cost of operation,
$E \leq \frac{1}{1 - α}$
Prove for
$E \leq \frac{1}{1 - α}$ is as below.

Prove for cost of operation, E

First probe: probability that first bucket is full is:

\frac{n}{m}

Second probe: if first bucket is full, then the probability that the second bucket is also full:

\frac{n - 1}{m - 1}

Third probe: probability is full:

\frac{n - 2}{m - 2}

Expected #probes

= \overset{first probe}{\overset{⏞}{1}} + \underset{probability of collision on first probe}{\underset{⏟}{\frac{n}{m}}} (Expected cost of remaining probes)

= 1 + \frac{n}{m} (\underset{probability of collision on second probe}{\underset{⏟}{1 + \frac{n - 1}{m - 1}}} + Expected cost of remaining probes)

\begin{matrix} (*) & = 1 + \frac{n}{m} (1 + \frac{n - 1}{m - 1} (1 + \frac{n - 2}{m - 2} (1 + \frac{n - 3}{m - 3}) (\dots))) \end{matrix}

Note that,

\frac{n - i}{m - i} \leq \frac{n}{m} \leq α

So,

(*) \leq 1 + α (1 + α (1 + α (1 + α) (\dots)))

\leq 1 + α^{2} + α^{3} + α^{4} + \dots

\leq \frac{1}{1 - α}

Advantages

Save space
Rarely allocate memory, as no new list-node allocations
Better cache performance
- table all in one place in memory
- fewer accesses to bring table into cache.
- linked lists can wander all over the memory

Disadvantages

More sensitive to choice of hash functions
- clustering is a common problem
More sensitive to load
- performance degrades badly as
  $α \to 1$

Hashing Techniques

Two common hashing techniques:

Division Method
Multiplication Method

Division Method

Hash Function:

h (k) = k mod m

, where

m = table size

Then, two keys

k_{1}

and

k_{2}

collide

⟺

k_{1} mod m = k_{2} mod m

Problem	Solutions
If $k$ and $m$ has a common divisor $d$ , then $k = k mod m + i * m$ , so $h (k) = k mod m$ is divisible by $d$ . For all $k$ that are divisible by $d$ , only $\frac{1}{d}$ fraction of the hash table will be utilized	Choose $m$ to be a prime number
Division is slow

Multiplication Method

Hash Function:

h (k) = (A k) mod 2^{w} ≫ (w - r)

, where

m = 2^{r} = table size, for some constant r

w = size of a key in bits

A = (odd) constant

TODO: need more diagram to illustrate how it work

Pros:

Multiplication method is faster than Division Method
Works reasonably well when
$A$ is an odd integer
$>$
$2^{w - 1}$
- Odd: if it is even, then lose at least one bit’s worth
- Big enough: use all the bits in
  $A$

Hash Table Size

Let number of items

= n

, table size

= m

m < 2 n

, then too many collisions

m > 10 n

, then too much wasted space

Also, we don’t know

n

in advance

Idea:

Start with small (constant) table size
Grow (and shrink) table as necessary

Increasing table size:

~~Increment table size by 1~~
Double the size of the table
- Cannot wait until
  $n == m$
- Resizing when
  $n$ was
  $\frac{n}{2}, \frac{n}{4}, \dots$
- Total time complexity =
  $O (1 + \dots + \frac{n}{8} + \frac{n}{4} + \frac{n}{2} + n) = O (n)$
- In average, every addition of an item cost
  $O (1)$
~~Square the size of the table~~

Shrinking table:

Similar to increase table size
Need to allow some buffer before shrinking

Analysis:

Cost

Scanning old hash table:
$O (size of the old hash table, m_{1})$
Creating new hash table:
$O (size of the new hash table, m_{2})$
Inserting each element in new hash table:
$O (1)$
Total cost:
$O (m_{1} + m_{2} + n)$

Extra Questions

Let's say a company has about 600+ employee and we would like to store them into a hash table with size
$m = 4096$ . In order to avoid confusion if there are two employees with the same name from different department, we will hash the concatenation of an employee name with his/her department. We will then convert the string into "binary" performing
$\mod 2$ on the ASCII value of each character. For example,
$KIMACCOUNTING ⟹ 75 73 77 65 67 67 79 85 78 84 73 78 71 ⟹ 1 1 1 1 1 1 1 1 0 0 1 0 1 .$ After
$\mod 2$ with each number, the binary number will be
$1111111100101 = 8165$ in decimal. And the final hash value is
$h (8165) = 8165 \mod m = 4096.$ Is this hashing method good or bad? Give reasons to support your answer.
- There will be a lot of collision
- A lot of hash table entries are empty. Because after
  $m o d 4096$ , it basically just use the least 12 significant digits that are all coming from the department.

Priority Queue

The aim is to develop an ADT that manages a collection of objects with associated priorities. This ADT requires several operations:

Operations	Description
`insert(Key k, Priority p)`	insert k with priority p
`extractMin()` / `extractMax()`	remove key with minimum / maximum priority
`decreaseKey(Key k, Priority p)` / `increaseKey(Key k, Priority p)`	reduce / increase the priority of key k to priority p
`contains(Key k)`	does the priority queue contain key k?
`isEmpty()`	is the priority queue empty?

Assume data items are unique.

Why Binary Heap?

Data Structure	Operations		Priority Queue Sort
	insert(x)	extractMax()	Time	In-place?
Sorted Array	O(n)	O(1)	O(n^2)	Yes (insertion sort)
Unsorted Array	O(1)	O(n)	O(n^2)	Yes (selection sort)
AVL Tree	O(logn)	O(logn)	O(nlogn)	No (AVL sort)
Heap / Binary Heap	O(logn)	O(logn)	O(nlogn)	Yes (Heap sort)

A binary heap can be visualized as a binary tree (NOT BST), and it can be implemented using an array.

Q: Heap vs. AVL Tree

same cost for operations
slightly simpler
- no rotations
- implement using array
slightly better concurrency

Q: Can we implement AVL tree using array?

too slow for AVL
how many operations/items needed for right/left rotate?
AVL tree is not a complete tree
$⟹$ waste of memory if use array

Properties of Heap:

Heap Ordering:
- Max Heap: priority[parent] >= priority[child]
- Min Heap: priority[child] >= priority[parent]
Complete binary tree:
- every level is full, except possible the last
- all nodes are as far left as possible
Height:
$O (\log n)$

Priority Queue Operations

Refer to the slides or VisualAlgo for detailed explanations and motivations behind the operations.

// bubble up
function bubbleUp(Node v) {
    while (v != null) {
        if (priority(v) > priority(parent(v))) {
            swap(v, parent(v));
            v = parent(v);
        } else {
            return;
        }
    }
}

// bubble down
function bubbleDown(Node v) {
    while (!leaf(v)) {
        leftP = priority(left(v));
        rightP = priority(right(v));
        biggerChild = leftP > rightP ? left(v) : right(v);
        if (priority(biggerChild) > priority(v)) {
            swap(v, biggerChild);
            v = biggerChild;
        } else {
            return;
        }
    }
}

// increase key, O(log n)
1. update priority: increaseKey(5 -> 25)
2. bubble up

// decrease key, O(log n)
1. update priority: decreaseKey(25 -> 5)
2. bubble down

// insert, O(log n)
1. add a new leaf with priority p
2. bubble up
function insert(Priority p, Key k) {
    Node v = m_completeTree.insert(p,k);
    bubbleUp(v);
}

// delete, O(log n)
1. swap node to be deleted with last node (right most leaf)
2. remove last node (the node to be deleted, since we swapped earlier)
3. bubble down IF last() > deleted

// extract max priority, O(log n)
1. Node v = root;
2. delete(root);

Store Heap in Array

map each node in complete binary tree into a slot in an array
each level
$i$ starts from the array index
$2^{i} - 1$ , assuming root is level 0 from top
let
$x$ be array index of node
$v$
- $index(left(v)) = 2 x + 1$
- $index(right(v)) = 2 x + 2$
- $index(parent(v)) = ⌊ \frac{x - 1}{2} ⌋$

Illustration:

Heap Sort

// heap sort
1. convert unsorted array into heap array (heapify)
    1.1 idea: given two heaps and one new node, how to join all of them into a single heap?
    1.2 join them and bubble down the root
    1.3 base case: each leaf is a heap, so half of the tree no need to bubble down
    1.4 recursion: left + right are heaps

int[] A = array of unsorted integers
for (int i = (n-1); i >= 0; i--) {
    bubbleDown(i, A); // O(height)
}
for (int i = 0; i < n; i++) {
    int value = A[i];
    A[i] = EMPTY;
    heapInsert(value, A, 0, i);
}
// NOTE: the above algorithm is not correct, still need modification

2. convert heap array into sorted array

int[] A = array stored as a heap
for (int i = (n-1); i >= 0; i--) { // O(n log n)
    int value = extractMax(A); // O(log n)
    A[i] = value;
}

Analysis:

unsorted array
$\to$ heap array
- cost of bubbleDown() is dependent on height of tree
- obversation: at least
  $\frac{n}{2}$ nodes are
- observation: most nodes have small height
Height Number of Nodes

0
$\frac{n}{2}$

1
$\frac{n}{4}$

2
$\frac{n}{8}$

3
$\frac{n}{16}$

$\dots$
$\dots$

$\log (n)$ 1
- Cost of building a heap:
  $O (n)$

Height	Number of Nodes
0	$\frac{n}{2}$
1	$\frac{n}{4}$
2	$\frac{n}{8}$
3	$\frac{n}{16}$
$\dots$	$\dots$
$\log (n)$	1

Cost of building a heap is O(n)

\sum_{h = 0}^{\log n} \overset{upper bound on number of nodes at level h}{\overset{⏞}{\frac{n}{2^{h}}}} \times \underset{cost for bubbling a node at level h}{\underset{⏟}{O (h)}} = c n \sum_{h = 0}^{\log n} \frac{1}{2^{h}} O (h) = c n \cdot O (\overset{\leq 2, can derive using geometric series}{\overset{⏞}{\sum_{h = 0}^{\log n} \frac{h}{2^{h}}}}) = \dots = 2 O (n) = O (n)

heap array
$\to$ sorted array
- extractMax() is
  $O (\log n)$
- so the loop takes
  $O (n \log n)$
Summary:
- unsorted array
  $\to$ heap array:
  $O (n)$
- heap array
  $\to$ sorted array:
  $O (n \log n)$
- so overall is
  $O (n \log n)$
- heap sort is in-place, but unstable
  - what if two different keys have same priority?

Q: Why not just use merge sort or quick sort?

heap sort is in place
in the scenario where sorting the whole array is not necessary, only the first
$m$ element need to be sorted, heap sort is efficient, the time complexity for this will be
$O (n + number of items + \log n)$
a little slower than quick sort, but always completes in
$O (n \log n)$

Q: Can we improve heap sort?

ternary (3-way) heap sort is a little faster
ternary heap sort is not difficult to implement, we just need to figure out the index of parent and children
the time complexity is still the same, we just improve from
$\log_{2} \to \log_{3}$

Extra Questions

Given an array of max heap, how to convert it into a min heap.
Answer 1
- Approach 1: treat the “max heap” as an unsorted array and heapify (on min heap)
  $O (n)$ .
- Aprroach 2: If you are given max heap implementation and its raw array and the contents are numbers (integers or floating points), you can simply just reinsert the negations of each numbers of that max heap. This essentially converts a max heap into a min heap (just ignore the presence of -ve symbols that is only used to reverse the meaning of min and max).
Given
$n$ items, they are in
$k$ sorted list, so each of the sorted list has
$\frac{n}{k}$ items. How fast can you merge them into one single sorted list? What if
$k = O (n)$ , e.g.
$k = \frac{n}{10}$ ?
Answer 2
- Idea: min heap array + new sorted array, put every first element of the
  $k$ sorted lists into heap array, extract min from heap array, choose another element from old sorted list into heap array and then bubble
- $O (k)$
If a maxheap is stored in an array, we can always convert a max heap to a min heap by reversing the array, and vice versa. For example, if an array A store a max heap of [10,9,8,7] then reversing A as [7,8,9,10] will be a min heap.

Answer 3

False. Counterexample: A = [10,9,3,8,7,1].

Disjoint Set

Definition: A disjoint-set ADT maintains a collection of disjoint (non-overlapping) sets. It supports two primary operations: find and union.

Purpose: Primarily used to track elements split into multiple sets and to easily merge these sets.

Core Operations:

Operations	Description
`DisjointSet(int N)`	constructor: N objects
`boolean find(Key p, Key q)`	are `p` and `q` in the same set? (is there a path connecting the two objects?)
`void union(Key p, Key q)`	replace sets containing `p` and `q` with their union

Quick Find

Data Structure:

uses an array componentID[] of size N, where N is the number of elements
initialy, each element is in its own set, so componentID[i] = i, for
$0 \leq i \leq N - 1$
two objects are connected if they have the same component identifier

// find
bool find(int p, int q) { // O(1)
    return componentID[p] == componentID[q];
}

// union
void union(int p, int q) { // O(n)
    for (int i = 0; i < componentID.size(); i++) {
        if (componentID[i] == componentID[q]) {
            componentID[i] = componentID[p];
        }
    }
}

Quick Union

Data Structure:

uses an array parent[] of size N, where N is the number of elements
initialy, each element is in its own set, so parent[i] = i, for
$0 \leq i \leq N - 1$
two objects are connected if they are part of the same tree

// find
bool find(int p, int q) { // O(n)
    while (parent[p] != p) {
        p = parent[p];
    }
    while (parent[q] != q) {
        q = parent[q];
    }
    return p == q;
}

// union
void union(int p, int q) { // O(n)
    while (parent[p] != p) {
        p = parent[p];
    }
    while (parent[q] != q) {
        q = parent[q];
    }
    parent[p] = q;
}

Weighted Union

Data Structure:

enhanced version of quick union
uses an array parent[] of size N, where N is the number of elements
an additional size[] array that tracks the size of each tree
when merging 2 trees, always attach the smaller tree to the root of the larger tree
the maximum depth of tree will be
$O (\log n)$ , prove refers to slides

// find (same as quick union)
bool find(int p, int q) { // O(log n)
    while (parent[p] != p) {
        p = parent[p];
    }
    while (parent[q] != q) {
        q = parent[q];
    }
    return p == q;
}

// union
void union(int p, int q) { // O(log n)
    while (parent[p] != p) {
        p = parent[p];
    }
    while (parent[q] != q) {
        q = parent[q];
    }
    if (size[p] > size[q]) {
        parent[q] = p;
        size[p] += size[q];
    } else {
        parent[p] = q;
        size[q] += size[p];
    }
}

Weighted Union with Path Compression

extended from weighted union
after finding the root, set the parent of each traversed node to the root

Example: find(11, 32)

// findRoot v1
// also need to update size[] array, ommited here
int findRoot(int p) {
    int root = p;
    while (parent[root] != root) {
        root = parent[root;
    }
    while (parent[p] != p) {
        int temp = parent[p];
        parent[p] = root;
        p = temp;
    }
    return root;
}

// findRoot v2
// also need to update size[] array, ommited here
int findRoot(int p) {
    int root = p;
    while (parent[root] != root) {
        parent[root] = parent[parent[root]];
        root = parent[root];
    }
    return root;
}

Union-Find Algorithm Summary

Algorithm	find	union
Quick Find	$O (1)$	$O (n)$
Quick Union	$O (n)$	$O (n)$
Weighted Union	$O (\log n)$	$O (\log n)$
Path Compression	$O (\log n)$	$O (\log n)$
Weighted Union with Path Compression	$α (m, n)$	$α (m, n)$

$α$ is Inverse Ackermann function

Extra Questions

Extra Questiosn

Given an array equations of strings that represent relationships between variables in one of two different forms: "a==b" or "a!=b". Return true if and only if it is possible to assign integeres to variable names so as to satisfy all the given equations.
Answer 1
1. Create a Union-Find Data Structure:
  - For each variable in the equations, create a disjoint set. Initially, each variable is its own set.
2. Process "==" Equations:
  - Iterate through the equations and for each "a==b" equation, union the sets to which variables 'a' and 'b' belong.
3. Check "!=" Equations:
  - Iterate through the equations again and for each "a!=b" equation, check if 'a' and 'b' belong to the same set in the Union-Find data structure. If they do, the equations are contradictory, and you return False.
4. If No Contradictions, Return True:
  - If you have iterated through all the equations without finding any contradictions, return True, indicating that it's possible to assign integers to variables to satisfy all the given equations.
Leetcode 100
Answer 2
- Approach 1: DFS/BFS
- Approach 2: UF
  - Union-Find (or Disjoint Set) data structure allows you to efficiently perform union and find operations. Union operation merges two sets into one, and find operation determines the set to which an element belongs.
  - For each '1' cell, check its adjacent cells.
  - If adjacent cells are also '1', it means they are part of the same island. Therefore, union these cells in the Union-Find data structure.
  - The number of disjoint sets remaining in the Union-Find data structure after processing all cells represents the number of islands. Each disjoint set represents a separate island.

Graph

For more detailed graph basics, refer CS1231S notes.

Notations:

G = (V, E), comprising

nonempty set of vertices/nodes,
$V = {v_{1}, v_{2}, \dots, v_{n}}$
set of edges,
$E = {e_{1}, e_{2}, \dots, e_{k}}$
$e = {v, w}$ for undirected edge
$E$ incident of vertices
$v$ and
$w$
$e = (v, w)$ for directed edge
$E$ from vertex
$v$ to vertex
$w$

Definition: (not in CS1231S)

Diameter: Maximum distance between two nodes, following the shortest path.

Motivation:

Graphs can model a wide array of real-world problems, which we can then apply graph-based algorithms to solve these problems efficiently.

Graph Representations

Adjacency List

an array of linked list
the index represents a node, and each element in the linked list represents the nodes that are connected to it
the image below shows an undirected graph, can also be a directed graph

// structure
// there are many different implementation
// refer assignment 5 and lecture slides

Adjacency Matrix

a 2D array, where matrix[u][v] indicates the presence/weight of an edge between u and v
Neat Property: Let
$A$ be a adjacency matrix, then number of walks of length
$n$ from
$v_{i}$ to
$v_{j} = i, j$ entry of
$A^{n}$
the image below shows an undirected graph, can also be a directed graph

// structure
class Graph {
    std::vector<std::vector<int>> _adjMatrix;
}

Comparison

	Adjacency List	Adjacency Matrix
Memory Usage for graph, $G = (V, E)$	$O (V + E)$	$O (V^{2})$
Memory Usage for graph with cycle	$O (V)$	$O (V^{2})$
Memory Usage for graph with clique	$O (V^{2})$	$O (V^{2})$
find any neighbour of `v`	fast	slow
are `v` and `w` neighbours?	slow	fast
enumerate all neighbours	fast	slow

Base rule: if graph is dense then use an adjacency matrix; else use an adjacency list.

Travesal Algorithms

Depth-First Search (DFS)

DFS uses stack

Psedocode:

1. Add start-node to stack.
2. Repeat until stack is empty:
    2.1 Pop node v from the top of the stack.
    2.2 Visit v.
    2.3 Explore all outgoing edges of v.
    2.4 Push all unvisited neighbors of v on the top of the stack.

C++ code:

// code snippet from TIC2001 AY2020 S1 PE
// maybe not be 100% correct
void Graph::BFS(int s, List<int>& output, bool resetVisited) {
    Stack<int> nodes;
    if (resetVisited) {
        _resetVisited();
    }

    nodes.push(s);
    _setVisited(s);
    while (!nodes.empty()) {
        int currNode = nodes.pop();
        _setVisited(currNode);
        output.insertTail(currNode);
        _al[currNode].start();
        while (!_al[currNode].end()) {
            int n = _al[currNode].current();
            if (!_isVisited(n)) {
                nodes.push(n);
                _setVisited(n);
            }
            _al[currNode].next();
        }
    }
}

Breadth-First Search (BFS)

BFS uses queue

Psedocode:

1. Add start-node to queue.
2. Repeat until queue is empty:
    2.1 Remove node v from the front of the queue.
    2.2 Visit v.
    2.3 Explore all outgoing edges of v.
    2.4 Add all unvisited neighbors of v to the queue.

C++ code:

// code snippet from TIC2001 AY2020 S1 PE
// maybe not be 100% correct
void Graph::BFS(int s, List<int>& output, bool resetVisited) {
    Queue<int> nodes;
    if (resetVisited) {
        _resetVisited();
    }

    nodes.enqueue(s);
    _setVisited(s);
    while (!nodes.empty()) {
        int currNode = nodes.dequeue();
        _setVisited(currNode);
        output.insertTail(currNode);
        _al[currNode].start();
        while (!_al[currNode].end()) {
            int n = _al[currNode].current();
            if (!_isVisited(n)) {
                nodes.enqueue(n);
                _setVisited(n);
            }
            _al[currNode].next();
        }
    }
}

Analysis

TODO AFTER PRACTICAL EXAM, FOR FINAL EXAM

Shortest Path Algorithms

Dijkstra's algorithm

Bellman-Ford algorithm

Analysis

Questions: why can't we use DFS or BFS to find shortest path

Minimum Spanning Tree Algorithms

Prim's algorithm

For a connected weighted graph G with n vertices:
1. pick any vertex v of G and let T be the graph with this vertex only
2. let V be the set of all vertices of G except v
3. for (i = 0 to n − 1)
    3.1 find the edge e in G with the least weight of all the edges connected to T. 
    3.2 let w be the endpoint of e
    3.3 add e and w to the edge and vertex sets of T
    3.4 delete w from v

Kruskal's algorithm

For a connected weighted graph G with n vertices:
1. initialise T to have all the vertices of G and no edges
2. let E be the set of all edges in G; let m = 0
3. while (m < n-1)
    3.1 find and remove the edge e in E of least weight
    3.2 if adding e to the edge set of T does not produce a circuit:
         i. add e to the edge set of T
        ii. set m = m + 1

Boruvka’s algorithm

Analysis

https://cs.stackexchange.com/questions/18797/minimum-spanning-tree-vs-shortest-path

Topological Sort

Post Order DFS

Psedocode:

For a graph with n vertices and m directed edges:
1. Initialize a stack and a visited array of size n
2. For each unvisited vertex in the graph:
    2.1 Call the DFS function with the vertex as the parameter
    2.2 In the DFS function, mark the vertex as visited and recursively call the DFS function for all unvisited neighbors of the vertex
    2.3 Once all the neighbors have been visited, push the vertex onto the stack
3. After all vertices have been visited, pop elements from the stack and append them to the output list until the stack is empty

C++ code:

#include <forward_list>
#include <stack>
#include <vector>

using std::forward_list;
using std::vector;
using std::stack;

class GraphEdge {
private:
    int _dest;

public:
    GraphEdge() : _dest(-1), _weight(-1) {} // for vector
    GraphEdge(int index) : _dest(index) {}
    GraphEdge(const GraphEdge& e) : _dest(e._dest) {}
    int dest() const { return _dest; }
}

class Graph {
private:
    vector<forward_list<GraphEdge>> _vertices;
    void _topologicalSortUtil(int, vector<bool>&, stack<int>&);

public:
    Graph(int n) : _vertices(n) {}
    void addEdge(int source, int dest, int weight);
    void topologicalSort();
}

void Graph::addEdge(int source, int dest) {
    // Assumes that an edge doesn't already exist!
    _vertices[source].emplace_front(dest, weight);
}

void Graph::_topologicalSortUtil(int v, vector<bool>& visited, stack<int>& Stack) {
    visited[v] = true;
    forward_list<GraphEdge> curr_edge_neighbour;
    
    curr_edge_neighbour = _vertices[v];
    
    for (auto i = curr_edge_neighbour.begin(); i != curr_edge_neighbour.end(); i++) {
        if (!visited[i->dest()]) {
            topologicalSortUtil(i->dest(), visited, Stack);
        }
    }
    Stack.push(v);
}

void Graph::topologicalSort() {
    stack<int> Stack;
    vector<bool> visited(_vertices.size(), false);
    /*
    bool* visited = new bool[_vertices.size()];
    for (int i = 0; i < _vertices.size(); i++) {
        visited[i] = false;
    }
    */
        
    for (int i = 0; i < _vertices.size(); i += 1) {
        if (!visited[i]) {
            _topologicalSortUtil(i, visited, Stack);
        }
    }
    
    // print the content of the stack
    // output here
    while (!Stack.empty()) {
        std::cout << Stack.top() << " ";
        Stack.pop();
    }
    std::cout << std::endl;
}

Mix and Match

Motivation:

Combine multiple basic data structure to achieve better performance.

Example:

adjacency list - array of linked list
a linked list can also be a BST, by augmenting the nodes with *nextPtr, *leftPtr, *rightPtr
queue - printInOrder() will be
$O (n \log n)$
- TBC

Conclusion:

Pointers is powerful and useful, but complicated to manage.

SkipList

Data Structure:

TODO

Analysis:

TODO

Binary Space Partitioning

Motivation:

Convex Hull

Definition:

Convex: A point set

P

is convex if

\forall x, y \in P, \overset{―}{x y} \subseteq P

Convex Combination: Let

S = {p_{i} \in R^{d} ∣ i = 1 \dots n}

, an convex combination of

S = \sum_{p_{i} \in S} (λ_{i} p_{i})

such that

\sum_{i = 1}^{n} λ_{i} = 1, \forall λ_{i} \geq 0

Convex Hull: The convex hull of

S, conv (S)

is the collection of all convex combinations

Font size of math is increased because there is some rendering problem with $\overline$

Alternative explanation:

Let

C \subseteq R^{n}

C

is convex if, for all

x, y \in C

, the line segment connecting

x, y

is included

C

A convex combination is a linear combination of points (which can be vectors, scalars, or more generally points in an affine space) where all coefficients are non-negative and sum to 1.

Source: Wikipedia

Motivation:

Application of ConveX Hull:

In general, a “simpler” representation of a complicated object
- Graphics: Bezier curve drawing/intersection, ray tracing
- Data mining/computer vision: Classification of data
- GIS, path finding

Convex Hull Algorithms

Gift Wrapping algorithm

a.k.a Jarvis march

1. Take the leftmost vertex
2. Repeat
    2.1 Search for the next vertex on the convex hull by choosing the one with the minimal turning angle

Graham's scan

1. Starting from the left most vertex, go around the polygon
3. If it’s a convex corner, continue
2. If it is a concave vertex, “make it convex” by “filling” it
    - By connecting its two neighbors

Divide and Conquer

1. Sort vertices in a direction, e.g. y direction
2. Repeat:
    2.1 Merge every neighboring convex hull
    2.2 For every merge, find the two “supporting lines”
        2.2.1 By taking a walk from an initial line (by joining the two neighbors)

Incremental method

1. Adding a point according to a sorted direction incrementally

Quickhull

- General idea: discard points that are not on the hull as quickly as possible
- First find the maximum and minimum in x and y directions
1. Construct a quadrilateral by these four points
2. Discard any point inside
3. Repeat:
    3.1 For each side, find the furthermost point
    3.2 Include in the convex hull
    3.3 Eliminate any point within

Analysis

Algorithm	Expected Time	Worst Time	Moving up to 3D
Gift Wrapping	$O (n h)$	$O (n^{2})$	$O (f n)$
Graham's scan	$O (n \log n)$	$O (n \log n)$	-
Divide and Conquer	$O (n \log n)$	$O (n \log n)$	$O (n \log n)$
Incremental method	$O (n \log n)$	$O (n \log n)$	$O (n \log n)$
Quickhull	$O (n \log n)$	$O (n^{2})$	$O (n^{2})$

$n$ is the number of points in the set
$h$ is the number of points in the hull
$f$ is the number of faces on the convex hull which is
$O (n)$

Can we do it in

O (n)

No. Otherwise, we can solve sorting problem in
$O (n)$ time (comparison sort)
Idea: For
$n$ numbers, project them onto a parabola

CS2040C Data Structures and Algorithms

Introduction

Searching & Sorting

Binary Search

Merge Sort

Quick Sort

Linked List

Binary Search Tree (BST)

Hash Table

Hash Collisions

Hashing with Chaining

Open Addressing

Hashing Techniques

Division Method

Multiplication Method

Hash Table Size

Extra Questions

Priority Queue

Priority Queue Operations

Store Heap in Array

Heap Sort

Extra Questions

Disjoint Set

Quick Find

Quick Union

Weighted Union

Weighted Union with Path Compression

Union-Find Algorithm Summary

Extra Questions

Graph

Graph Representations

Adjacency List

Adjacency Matrix

Comparison

Travesal Algorithms

Depth-First Search (DFS)

Breadth-First Search (BFS)

Analysis

Shortest Path Algorithms

Dijkstra's algorithm

Bellman-Ford algorithm

Analysis

Minimum Spanning Tree Algorithms

Prim's algorithm

Kruskal's algorithm

Boruvka’s algorithm

Analysis

Topological Sort

Mix and Match

SkipList

Binary Space Partitioning

Convex Hull

Convex Hull Algorithms

Gift Wrapping algorithm

Graham's scan

Divide and Conquer

Incremental method

Quickhull

Analysis

Read more

LAJ1201 Notes

CS2107 Qustions

CS2107 Notes

CS2106 Questions