CS2040C Data Structures and Algorithms
Notes.
Some algorithms are in psedocode/Java, some are C++ code.
Useful Website:
Introduction
Table of Content (Link)
Table of Content (Tree)
Data structures serve as the basis for abstract data types (ADT). ADT is in the logical level and data structure is in the implementation level.
ADT |
Data Structure |
Stack, Queue |
Linked List, Array |
Tree |
Linked List, Array (too slow for AVL) |
Binary Search Tree, Balanced Tree |
Tree |
Priority Queue |
Binary Heap (using array) |
Disjoint Set |
|
Graph |
Graph* |
- |
Hash Table |
Put simply, whether one thing is a data structure or not really depends on how we look at it. stackoverflow
- An ADT can be both ADT and data structure?
- TBH I am still not sure which one is which one
Searching & Sorting
Binary Search
Merge Sort
Quick Sort
Linked List
Linked List Tricks:
Middle of Linked List:
Nth node from the end of the Linked List:
Detect Cycle
Binary Search Tree (BST)
Tree: graph that is circuit free and connected
Binary Tree: a rooted tree in which every parent has at most 2 children, which is referred to as the left child and the right child
Binary Search Tree (BST): a rooted binary tree with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree
Height-balanced BST: a BST in which the height of the left and the right subtree of any node differ by not more than 1*
- *different height-balanced BST will have slightly different definition
- e.g. AVL tree and Red-black tree
Hash Table
Motivation:
When the ordering of the data is not important, hash table is a better choice as it provides constant time* search, insert and delete operations.
* performance degrades to as load factor increase
Hash Collisions
Definition: Two distinct keys and collide
Impossible to have a hash function with no collisions, by pigeonhole principle
Resolving collision:
- chaining (put both items in the same bucket)
- open addressing (find another bucket for the new item)
- linear probing
- quadratic probing
- second hash function (cuckoo hashing)
Hashing with Chaining
Idea: Each bucket contains a linked list of items
Space complexity: , where = table size and = linked list size
Operations:
Analysis:
Optimistic
- Consider the Simple Uniform Hashing Assumption (SUHA), where every key is equally likely to map to every bucket. Then, as long as there are enough buckets, we won’t get too many keys in any one bucket.
- Simple Uniform Hashing Assumption: Suppose items and buckets, then define
- Then by (SUHA), expected search time .
- If , expected search time =
- Expected search time
- Worst-case search time
- Worst-case insertion time
Reality
- BUT Simple Uniform Hashing doesn’t exist. In reality, keys are not random and have lots of regularity which can induce patterns in hash functions.
Open Addressing
Idea: When collision happened, probe a sequence of buckets until you find an empty one.
Probing:
- Find the next position that’s empty to insert a key.
- Originally try , if collides, try for collisions.
- can be any function
- Linear Probing: , so
- Quadratic Probing: , so
- Double Hashing:
Operation:
Analysis:
Performance
- When , the table is full and we cannot insert any more items, cannot search efficiently (chaining can still search and insert efficiently).
- Define and assume , assuming SUHA, the expected cost of operation,
- Prove for is as below.
Prove for cost of operation, E
First probe: probability that first bucket is full is:
Second probe: if first bucket is full, then the probability that the second bucket is also full:
Third probe: probability is full:
Expected #probes
Note that,
So,
Advantages
- Save space
- Rarely allocate memory, as no new list-node allocations
- Better cache performance
- table all in one place in memory
- fewer accesses to bring table into cache.
- linked lists can wander all over the memory
Disadvantages
- More sensitive to choice of hash functions
- clustering is a common problem
- More sensitive to load
- performance degrades badly as
Hashing Techniques
Two common hashing techniques:
- Division Method
- Multiplication Method
Division Method
Hash Function: , where
Then, two keys and collide
Problem |
Solutions |
If and has a common divisor , then , so is divisible by . For all that are divisible by , only fraction of the hash table will be utilized |
Choose to be a prime number |
Division is slow |
|
Multiplication Method
Hash Function: , where
TODO: need more diagram to illustrate how it work
Pros:
- Multiplication method is faster than Division Method
- Works reasonably well when is an odd integer
- Odd: if it is even, then lose at least one bit’s worth
- Big enough: use all the bits in
Hash Table Size
Let number of items , table size
If , then too many collisions
If , then too much wasted space
Also, we don’t know in advance
Idea:
- Start with small (constant) table size
- Grow (and shrink) table as necessary
Increasing table size:
Increment table size by 1
- Double the size of the table
- Cannot wait until
- Resizing when was
- Total time complexity =
- In average, every addition of an item cost
Square the size of the table
Shrinking table:
- Similar to increase table size
- Need to allow some buffer before shrinking
Analysis:
Cost
- Scanning old hash table:
- Creating new hash table:
- Inserting each element in new hash table:
- Total cost:
Extra Questions
-
Let's say a company has about 600+ employee and we would like to store them into a hash table with size . In order to avoid confusion if there are two employees with the same name from different department, we will hash the concatenation of an employee name with his/her department. We will then convert the string into "binary" performing on the ASCII value of each character. For example, After with each number, the binary number will be in decimal. And the final hash value is Is this hashing method good or bad? Give reasons to support your answer.
- There will be a lot of collision
- A lot of hash table entries are empty. Because after , it basically just use the least 12 significant digits that are all coming from the department.
Priority Queue
The aim is to develop an ADT that manages a collection of objects with associated priorities. This ADT requires several operations:
Operations |
Description |
insert(Key k, Priority p) |
insert k with priority p |
extractMin() / extractMax() |
remove key with minimum / maximum priority |
decreaseKey(Key k, Priority p) / increaseKey(Key k, Priority p) |
reduce / increase the priority of key k to priority p |
contains(Key k) |
does the priority queue contain key k? |
isEmpty() |
is the priority queue empty? |
Assume data items are unique.
Why Binary Heap?
Data Structure |
Operations |
Priority Queue Sort |
|
insert(x) |
extractMax() |
Time |
In-place? |
Sorted Array |
O(n) |
O(1) |
O(n^2) |
Yes (insertion sort) |
Unsorted Array |
O(1) |
O(n) |
O(n^2) |
Yes (selection sort) |
AVL Tree |
O(logn) |
O(logn) |
O(nlogn) |
No (AVL sort) |
Heap / Binary Heap |
O(logn) |
O(logn) |
O(nlogn) |
Yes (Heap sort) |
A binary heap can be visualized as a binary tree (NOT BST), and it can be implemented using an array.
Q: Heap vs. AVL Tree
- same cost for operations
- slightly simpler
- no rotations
- implement using array
- slightly better concurrency
Q: Can we implement AVL tree using array?
- too slow for AVL
- how many operations/items needed for right/left rotate?
- AVL tree is not a complete tree waste of memory if use array
Properties of Heap:
- Heap Ordering:
- Max Heap:
priority[parent] >= priority[child]
- Min Heap:
priority[child] >= priority[parent]
- Complete binary tree:
- every level is full, except possible the last
- all nodes are as far left as possible
- Height:
Priority Queue Operations
Refer to the slides or VisualAlgo for detailed explanations and motivations behind the operations.
Store Heap in Array
- map each node in complete binary tree into a slot in an array
- each level starts from the array index , assuming root is level 0 from top
- let be array index of node
Illustration:
Heap Sort
Analysis:
-
unsorted array heap array
- cost of
bubbleDown()
is dependent on height of tree
- obversation: at least nodes are
- observation: most nodes have small height
Height |
Number of Nodes |
0 |
|
1 |
|
2 |
|
3 |
|
|
|
|
1 |
Cost of building a heap is O(n)
-
heap array sorted array
extractMax()
is
- so the loop takes
-
Summary:
- unsorted array heap array:
- heap array sorted array:
- so overall is
- heap sort is in-place, but unstable
- what if two different keys have same priority?
Q: Why not just use merge sort or quick sort?
- heap sort is in place
- in the scenario where sorting the whole array is not necessary, only the first element need to be sorted, heap sort is efficient, the time complexity for this will be
- a little slower than quick sort, but always completes in
Q: Can we improve heap sort?
- ternary (3-way) heap sort is a little faster
- ternary heap sort is not difficult to implement, we just need to figure out the index of parent and children
- the time complexity is still the same, we just improve from
Extra Questions
-
Given an array of max heap, how to convert it into a min heap.
Answer 1
- Approach 1: treat the “max heap” as an unsorted array and heapify (on min heap) .
- Aprroach 2: If you are given max heap implementation and its raw array and the contents are numbers (integers or floating points), you can simply just reinsert the negations of each numbers of that max heap. This essentially converts a max heap into a min heap (just ignore the presence of -ve symbols that is only used to reverse the meaning of min and max).
-
Given items, they are in sorted list, so each of the sorted list has items. How fast can you merge them into one single sorted list? What if , e.g. ?
Answer 2
- Idea: min heap array + new sorted array, put every first element of the sorted lists into heap array, extract min from heap array, choose another element from old sorted list into heap array and then bubble
-
If a maxheap is stored in an array, we can always convert a max heap to a min heap by reversing the array, and vice versa. For example, if an array A store a max heap of [10,9,8,7] then reversing A as [7,8,9,10] will be a min heap.
Answer 3
False. Counterexample: A = [10,9,3,8,7,1].
Disjoint Set
Definition: A disjoint-set ADT maintains a collection of disjoint (non-overlapping) sets. It supports two primary operations: find and union.
Purpose: Primarily used to track elements split into multiple sets and to easily merge these sets.
Core Operations:
Operations |
Description |
DisjointSet(int N) |
constructor: N objects |
boolean find(Key p, Key q) |
are p and q in the same set? (is there a path connecting the two objects?) |
void union(Key p, Key q) |
replace sets containing p and q with their union |
Quick Find
Data Structure:
- uses an array
componentID[]
of size N
, where N
is the number of elements
- initialy, each element is in its own set, so
componentID[i] = i
, for
- two objects are connected if they have the same component identifier
Quick Union
Data Structure:
- uses an array
parent[]
of size N
, where N
is the number of elements
- initialy, each element is in its own set, so
parent[i] = i
, for
- two objects are connected if they are part of the same tree
Weighted Union
Data Structure:
- enhanced version of quick union
- uses an array
parent[]
of size N
, where N
is the number of elements
- an additional
size[]
array that tracks the size of each tree
- when merging 2 trees, always attach the smaller tree to the root of the larger tree
- the maximum depth of tree will be , prove refers to slides
Weighted Union with Path Compression
- extended from weighted union
- after finding the root, set the parent of each traversed node to the root
Example: find(11, 32)
Union-Find Algorithm Summary
Algorithm |
find |
union |
Quick Find |
|
|
Quick Union |
|
|
Weighted Union |
|
|
Path Compression |
|
|
Weighted Union with Path Compression |
|
|
- is Inverse Ackermann function
Extra Questiosn
-
Given an array equations of strings that represent relationships between variables in one of two different forms: "a==b"
or "a!=b"
. Return true if and only if it is possible to assign integeres to variable names so as to satisfy all the given equations.
Answer 1
- Create a Union-Find Data Structure:
- For each variable in the equations, create a disjoint set. Initially, each variable is its own set.
- Process "==" Equations:
- Iterate through the equations and for each "a==b" equation, union the sets to which variables 'a' and 'b' belong.
- Check "!=" Equations:
- Iterate through the equations again and for each "a!=b" equation, check if 'a' and 'b' belong to the same set in the Union-Find data structure. If they do, the equations are contradictory, and you return False.
- If No Contradictions, Return True:
- If you have iterated through all the equations without finding any contradictions, return True, indicating that it's possible to assign integers to variables to satisfy all the given equations.
-
Leetcode 100
Answer 2
- Approach 1: DFS/BFS
- Approach 2: UF
- Union-Find (or Disjoint Set) data structure allows you to efficiently perform union and find operations. Union operation merges two sets into one, and find operation determines the set to which an element belongs.
- For each '1' cell, check its adjacent cells.
- If adjacent cells are also '1', it means they are part of the same island. Therefore, union these cells in the Union-Find data structure.
- The number of disjoint sets remaining in the Union-Find data structure after processing all cells represents the number of islands. Each disjoint set represents a separate island.
Graph
For more detailed graph basics, refer CS1231S notes.
Notations:
- nonempty set of vertices/nodes,
- set of edges,
- for undirected edge incident of vertices and
- for directed edge from vertex to vertex
Definition: (not in CS1231S)
- Diameter: Maximum distance between two nodes, following the shortest path.
Motivation:
- Graphs can model a wide array of real-world problems, which we can then apply graph-based algorithms to solve these problems efficiently.
Graph Representations
Adjacency List
- an array of linked list
- the index represents a node, and each element in the linked list represents the nodes that are connected to it
- the image below shows an undirected graph, can also be a directed graph
Adjacency Matrix
- a 2D array, where
matrix[u][v]
indicates the presence/weight of an edge between u
and v
- Neat Property: Let be a adjacency matrix, then number of walks of length from to entry of
- the image below shows an undirected graph, can also be a directed graph
Comparison
|
Adjacency List |
Adjacency Matrix |
Memory Usage for graph, |
|
|
Memory Usage for graph with cycle |
|
|
Memory Usage for graph with clique |
|
|
find any neighbour of v |
fast |
slow |
are v and w neighbours? |
slow |
fast |
enumerate all neighbours |
fast |
slow |
Base rule: if graph is dense then use an adjacency matrix; else use an adjacency list.
Travesal Algorithms
Depth-First Search (DFS)
Psedocode:
C++ code:
Breadth-First Search (BFS)
Psedocode:
C++ code:
Analysis
TODO AFTER PRACTICAL EXAM, FOR FINAL EXAM
Shortest Path Algorithms
Dijkstra's algorithm
Bellman-Ford algorithm
Analysis
Questions: why can't we use DFS or BFS to find shortest path
Minimum Spanning Tree Algorithms
Prim's algorithm
Kruskal's algorithm
Boruvka’s algorithm
Analysis
https://cs.stackexchange.com/questions/18797/minimum-spanning-tree-vs-shortest-path
Topological Sort
Psedocode:
C++ code:
Mix and Match
Motivation:
Combine multiple basic data structure to achieve better performance.
Example:
- adjacency list - array of linked list
- a linked list can also be a BST, by augmenting the nodes with
*nextPtr
, *leftPtr
, *rightPtr
- queue -
printInOrder()
will be
Conclusion:
Pointers is powerful and useful, but complicated to manage.
SkipList
Data Structure:
Analysis:
Binary Space Partitioning
Motivation:
Convex Hull
Definition:
Convex: A point set is convex if
Convex Combination: Let , an convex combination of such that
Convex Hull: The convex hull of is the collection of all convex combinations
Font size of math is increased because there is some rendering problem with $\overline$
Alternative explanation:
Let . is convex if, for all , the line segment connecting is included .
A convex combination is a linear combination of points (which can be vectors, scalars, or more generally points in an affine space) where all coefficients are non-negative and sum to 1.
Source: Wikipedia
Motivation:
Application of ConveX Hull:
- In general, a “simpler” representation of a complicated object
- Graphics: Bezier curve drawing/intersection, ray tracing
- Data mining/computer vision: Classification of data
- GIS, path finding
Convex Hull Algorithms
Gift Wrapping algorithm
a.k.a Jarvis march
Graham's scan
Divide and Conquer
Incremental method
Quickhull
Analysis
Algorithm |
Expected Time |
Worst Time |
Moving up to 3D |
Gift Wrapping |
|
|
|
Graham's scan |
|
|
- |
Divide and Conquer |
|
|
|
Incremental method |
|
|
|
Quickhull |
|
|
|
- is the number of points in the set
- is the number of points in the hull
- is the number of faces on the convex hull which is
Can we do it in ?
- No. Otherwise, we can solve sorting problem in time (comparison sort)
- Idea: For numbers, project them onto a parabola