## Lecture 05. Data Structures - Part 2 - Sets - Dictionaries - Priority Queues (Heaps) --- ### Sets - A collection of **distinct** elements. - Basic operations of sets: insertion, search and removal. - Python `set`, `frozenset`; Java `HashSet`, `TreeSet`; C++ `set`, `unordered_set`. --- #### Set implementation - Balanced binary tree (ordered) $\rightarrow$ $O(\log(n))$ - Hash table (unordered) $\rightarrow$ $O(1)$ --- #### Python Sets - Python sets are unordered - Implemention based on a **hash table** (similar to a `dict` with dummy values) - Heterogeneous element - Element should be hashable (immutable) Note: - Cannot access element by index; `add` but not `append` - Indeed, CPython's sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values - Note: Nowadays, set and dict's implementations have diverged significantly, so the precise behaviors (e.g. arbitrary order vs. insertion order) and performance in various use cases differs; they're still implemented in terms of hashtables, so average case lookup and insertion remains O(1), but set is no longer just "dict, but with dummy/omitted keys". --- #### Python Set Operations - Create an empty set: `s = set()` (not `s = {}`) - Checking if an item is in: `x in s` - Adding elements: `set.add(x)` - Remove: `set.discard(x)` or `set.remove(x)` - Union: `s.union(t)` or `s | t` - Intersection: `s.intersection(t)` or `s & t` - Difference: `s.difference(t)` or `s - t` - Clear: `s.clear()` Note: `discard` no error or exception is raised if element is not present, `remove` will raise `KeyError` --- #### Python Set Iteration ```python= for v in s: print(v) ``` --- #### Python Set Operators - `x in s`, `x not in s` - `s == t`, `s != t` - `s < t`, `s <= t`: check if `s` is a subset of `t` - `s > t`, `s >= t`: check if `t` is a subset of `s` - `s | t`, `s & t`, `s - t`, `s ^ t` --- #### Python Set Operation Complexity - `x in s`, add, remove an item: $O(1)$; worst $O(n)$ - Union `s|t`: $O(len(s)+len(t))$ = $O(n + m)$ - Inersection `s&t`: $O(\min(n,m))$ - Different `s-t`: $O(n)$ - Symmetric Difference `s^t`: $O(n)$ Note: https://wiki.python.org/moin/TimeComplexity Union, Inersection, Symmetric Difference: worst $O(n \times m)$ --- #### Python Frozen Sets - Immutable sets: `frozenset` - Elements are fixed after creation - Hashable: can be used as a dictionary key Note: --- ### Dictionaries (maps) - Generalized array that consists of **key-value** pairs, in which **unique keys** are mapped to associated values. - Python `dict`, `OrderedDict`; Java `HashMap`, `TreeMap`; C++ `map`, `unordered_map`. --- #### Dictionary implementation - Self-balancing binary tree (ordered) $\rightarrow$ $O(\log(n))$ - Hash table (unordered) $\rightarrow$ $O(1)$ Note: There are generally two methods for implementing a self-balancing binary tree: - Red-Black Tree and. - AVL Trees (named after inventors Adelson-Velsky and Landis) --- #### Python dict - Python dicts are unordered - Implemention based on a **hash table** - Heterogeneous key - Keys should be hashable (immutable) - Since Python 3.6: more performance, order-preserving Note: https://docs.python.org/3/whatsnew/3.6.html#new-dict-implementation --- #### Python Dict Operations - Create an empty dict: `d = dict()` or `d = {}` - Checking if a key is in: `k in d` - Access a key: `d[k]` or `d.get(k)` - Adding/updating a key: `d[k] = value` - Removing a key: `del d[k]` or `d.pop(k)` - Clear: `d.clear()` Note: Both `del` and `pop` raise KeyError if key doesn't exist in dict --- #### Python Dict Iteration ```python= for k in d: print(k, d[k]) ``` ```python= for k in sorted(d.keys()): print(k, d[k]) ``` ```python= for k, v in d.items(): print(k, v) ``` --- #### Python Dict Operation Complexity - `k in d`: $O(1)$ - Get item: $O(1)$ - Set item: $O(1)$ - Delete item: $O(1)$ - Worst case (for all operations): $O(n)$ Note: https://wiki.python.org/moin/TimeComplexity --- #### Python OrderedDict and defaultdict From `collections` module: - `OrderedDict`: remembers the order entries were added - `defaultdict`: calls a factory function to supply missing values --- #### Python OrderedDict Additional methods: - `popitem(last=True)`: returns and removes a (key, value) pair in LIFO order if `last` is `True` or FIFO order if `False`. - `move_to_end(key, last=True)`: move an existing key to either end of an ordered dictionary: right end if `last` is `True` or to the beginning if `last` is `False`. --- #### Python defaultdict ```python= s = 'mississippi' d = defaultdict(int) for k in s: d[k] += 1 print(sorted(d.items())) # [('i', 4), ('m', 1), ('p', 2), ('s', 4)] ``` --- #### Python defaultdict `dict` alternative: ```python= s = 'mississippi' d = {} for k in s: d.setdefault(k, 0) d[k] += 1 print(sorted(d.items())) ``` --- #### Python defaultdict ```python= s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)] d = defaultdict(list) for k, v in s: d[k].append(v) print(sorted(d.items())) # [('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])] ``` --- #### Python defaultdict `dict` alternative: ```python= s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)] d = {} for k, v in s: d.setdefault(k, []).append(v) print(sorted(d.items())) ``` --- #### Python Dictionaries: Internal - Hash Table: a data structure that stores a list of **key-value** pairs. - A hash table uses a **hash function** to compute an **index**(**slot**) into an array of slots, from which the key can be found. Note: https://medium.datadriveninvestor.com/internal-implementation-of-dictionary-in-python-5b739d5535a4 - Hashing is the concept of converting data of arbitrary size into data of **fixed size** (index/slot). - A **hash table** is a form of list where elements are accessed by a keyword rather than an index number. Internally, it will use a hashing function in order to find the index position in which the element should be inserted. --- #### Hash Function and Hash Table <img src="https://cdn.ucode.vn/uploads/1/upload/OQTgmjMv.png"/> --- #### Hash Function and Hash Table Main components of a Hash Table: - Good hash function. - Good collision resolution mechanism. --- #### Python Dictionaries: Internal A **hashcode** is a relatively large value so we use and operation with a small **mask value** to compute an index(slot) to place the key-value into the array. ```python= arr = [-1] * 8 hashcode = hashfunction(key) mask = len(arr) index = hashcode % mask # modulo arr[index] = <hash, key, value> ``` --- #### Hash Collision - Each slot can only keep **1 entry** (hash, key, value). - When 2 keys have the same index (slot): **hash collision**. <img src="https://cdn.ucode.vn/uploads/1/upload/BcNQCGxz.png"/> --- #### Hash Collision - Chaining - Open Addressing --- #### Hash Collision: Chaining <img src="https://cdn.ucode.vn/uploads/1/upload/jWkVNMpA.png"/> --- #### Hash Collision: Open Addressing <img src="https://cdn.ucode.vn/uploads/1/upload/eVdAoNkT.png"/> --- #### Growing a hash table - A hash table must grow when it is getting full - The hash table's **load factor**: an indication of how large a portion of the available slots are being used. $load\_factor = \frac{n}{k}$ - Where: - $n$: number of used slots - $k$: number of total slots --- #### Growing a hash table <img src="https://cdn.ucode.vn/uploads/1/upload/NGenUAuc.png"/> --- #### Python Dictionaries: Internal - Python uses Open Addressing for handling hash collision - Python dict creation: initialized with an 8 element array. - Dictionary is resized when 2/3 of the slots are filled. - How is it resized: double the table size --- ### Priority Queue An extension of the queue with the following properties: - Each item in the queue has a certain **priority**. - An element with **high priority** is dequeued before an element with low priority $\rightarrow$ the elements popped out in **sorted** order. - If two elements have the same priority, they are served according to their order in the queue. - Python `heapq`, `PriorityQueue`; Java `PriorityQueue`; C++ `priority queue`. --- #### Priority Queue implementation | Data structure | Pop | Insert | | -------- | -------- | -------- | | Array / linked list | $O(1)$ | $O(n)$ | | **Binary Heap** | $O(1)$ | $O(\log(n))$ | | Binary Search Tree | $O(1)$ | $O(\log(n))$ | Note: https://www.programiz.com/dsa/priority-queue https://www.programiz.com/dsa/heap-sort#heap https://realpython.com/python-heapq-module/ --- #### Python priority queue - `heapq` module - `queue.PriorityQueue`: synchronized --- #### Python `heapq` - `heapify(x)`: transform list x into a heap, in-place - `heappush(heap, item)`: Push the value item onto the heap, maintaining the heap invariant. - `heappop(heap)`: Pop and return the smallest item from the heap, maintaining the heap invariant --- #### Python `heapq` ```python= from heapq import heappush, heappop def heapsort(a): h = [] for v in a: heappush(h, v) return [heappop(h) for i in range(len(h))] print(heapsort([1, 3, 5, 7, 9, 2, 4, 6, 8, 0])) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ``` --- ## The End ---
{"metaMigratedAt":"2023-06-17T19:46:11.142Z","metaMigratedFrom":"YAML","title":"Thuc Nguyen's ADSA Course - Lecture 05. Data Structures - P2","breaks":true,"description":"View the slide with \"Slide Mode\".","slideOptions":"{\"theme\":\"white\"}","contributors":"[{\"id\":\"476f9158-9a39-4a2d-bb09-ce08dd7eb978\",\"add\":11470,\"del\":1707}]"}
    147 views