## Lecture 05. Data Structures - Part 2
- Sets
- Dictionaries
- Priority Queues (Heaps)
---
### Sets
- A collection of **distinct** elements.
- Basic operations of sets: insertion, search and removal.
- Python `set`, `frozenset`; Java `HashSet`, `TreeSet`; C++ `set`, `unordered_set`.
---
#### Set implementation
- Balanced binary tree (ordered) $\rightarrow$ $O(\log(n))$
- Hash table (unordered) $\rightarrow$ $O(1)$
---
#### Python Sets
- Python sets are unordered
- Implemention based on a **hash table** (similar to a `dict` with dummy values)
- Heterogeneous element
- Element should be hashable (immutable)
Note:
- Cannot access element by index; `add` but not `append`
- Indeed, CPython's sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values
- Note: Nowadays, set and dict's implementations have diverged significantly, so the precise behaviors (e.g. arbitrary order vs. insertion order) and performance in various use cases differs; they're still implemented in terms of hashtables, so average case lookup and insertion remains O(1), but set is no longer just "dict, but with dummy/omitted keys".
---
#### Python Set Operations
- Create an empty set: `s = set()` (not `s = {}`)
- Checking if an item is in: `x in s`
- Adding elements: `set.add(x)`
- Remove: `set.discard(x)` or `set.remove(x)`
- Union: `s.union(t)` or `s | t`
- Intersection: `s.intersection(t)` or `s & t`
- Difference: `s.difference(t)` or `s - t`
- Clear: `s.clear()`
Note:
`discard` no error or exception is raised if element is not present, `remove` will raise `KeyError`
---
#### Python Set Iteration
```python=
for v in s:
print(v)
```
---
#### Python Set Operators
- `x in s`, `x not in s`
- `s == t`, `s != t`
- `s < t`, `s <= t`: check if `s` is a subset of `t`
- `s > t`, `s >= t`: check if `t` is a subset of `s`
- `s | t`, `s & t`, `s - t`, `s ^ t`
---
#### Python Set Operation Complexity
- `x in s`, add, remove an item: $O(1)$; worst $O(n)$
- Union `s|t`: $O(len(s)+len(t))$ = $O(n + m)$
- Inersection `s&t`: $O(\min(n,m))$
- Different `s-t`: $O(n)$
- Symmetric Difference `s^t`: $O(n)$
Note:
https://wiki.python.org/moin/TimeComplexity
Union, Inersection, Symmetric Difference: worst $O(n \times m)$
---
#### Python Frozen Sets
- Immutable sets: `frozenset`
- Elements are fixed after creation
- Hashable: can be used as a dictionary key
Note:
---
### Dictionaries (maps)
- Generalized array that consists of **key-value** pairs, in which **unique keys** are mapped to associated values.
- Python `dict`, `OrderedDict`; Java `HashMap`, `TreeMap`; C++ `map`, `unordered_map`.
---
#### Dictionary implementation
- Self-balancing binary tree (ordered) $\rightarrow$ $O(\log(n))$
- Hash table (unordered) $\rightarrow$ $O(1)$
Note:
There are generally two methods for implementing a self-balancing binary tree:
- Red-Black Tree and.
- AVL Trees (named after inventors Adelson-Velsky and Landis)
---
#### Python dict
- Python dicts are unordered
- Implemention based on a **hash table**
- Heterogeneous key
- Keys should be hashable (immutable)
- Since Python 3.6: more performance, order-preserving
Note:
https://docs.python.org/3/whatsnew/3.6.html#new-dict-implementation
---
#### Python Dict Operations
- Create an empty dict: `d = dict()` or `d = {}`
- Checking if a key is in: `k in d`
- Access a key: `d[k]` or `d.get(k)`
- Adding/updating a key: `d[k] = value`
- Removing a key: `del d[k]` or `d.pop(k)`
- Clear: `d.clear()`
Note:
Both `del` and `pop` raise KeyError if key doesn't exist in dict
---
#### Python Dict Iteration
```python=
for k in d:
print(k, d[k])
```
```python=
for k in sorted(d.keys()):
print(k, d[k])
```
```python=
for k, v in d.items():
print(k, v)
```
---
#### Python Dict Operation Complexity
- `k in d`: $O(1)$
- Get item: $O(1)$
- Set item: $O(1)$
- Delete item: $O(1)$
- Worst case (for all operations): $O(n)$
Note:
https://wiki.python.org/moin/TimeComplexity
---
#### Python OrderedDict and defaultdict
From `collections` module:
- `OrderedDict`: remembers the order entries were added
- `defaultdict`: calls a factory function to supply missing values
---
#### Python OrderedDict
Additional methods:
- `popitem(last=True)`: returns and removes a (key, value) pair in LIFO order if `last` is `True` or FIFO order if `False`.
- `move_to_end(key, last=True)`: move an existing key to either end of an ordered dictionary: right end if `last` is `True` or to the beginning if `last` is `False`.
---
#### Python defaultdict
```python=
s = 'mississippi'
d = defaultdict(int)
for k in s:
d[k] += 1
print(sorted(d.items()))
# [('i', 4), ('m', 1), ('p', 2), ('s', 4)]
```
---
#### Python defaultdict
`dict` alternative:
```python=
s = 'mississippi'
d = {}
for k in s:
d.setdefault(k, 0)
d[k] += 1
print(sorted(d.items()))
```
---
#### Python defaultdict
```python=
s = [('yellow', 1), ('blue', 2), ('yellow', 3),
('blue', 4), ('red', 1)]
d = defaultdict(list)
for k, v in s:
d[k].append(v)
print(sorted(d.items()))
# [('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]
```
---
#### Python defaultdict
`dict` alternative:
```python=
s = [('yellow', 1), ('blue', 2), ('yellow', 3),
('blue', 4), ('red', 1)]
d = {}
for k, v in s:
d.setdefault(k, []).append(v)
print(sorted(d.items()))
```
---
#### Python Dictionaries: Internal
- Hash Table: a data structure that stores a list of **key-value** pairs.
- A hash table uses a **hash function** to compute an **index**(**slot**) into an array of slots, from which the key can be found.
Note:
https://medium.datadriveninvestor.com/internal-implementation-of-dictionary-in-python-5b739d5535a4
- Hashing is the concept of converting data of arbitrary size into data of **fixed size** (index/slot).
- A **hash table** is a form of list where elements are accessed by a keyword rather than an index number. Internally, it will use a hashing function in order to find the index position in which the element should be inserted.
---
#### Hash Function and Hash Table
<img src="https://cdn.ucode.vn/uploads/1/upload/OQTgmjMv.png"/>
---
#### Hash Function and Hash Table
Main components of a Hash Table:
- Good hash function.
- Good collision resolution mechanism.
---
#### Python Dictionaries: Internal
A **hashcode** is a relatively large value so we use and operation with a small **mask value** to compute an index(slot) to place the key-value into the array.
```python=
arr = [-1] * 8
hashcode = hashfunction(key)
mask = len(arr)
index = hashcode % mask # modulo
arr[index] = <hash, key, value>
```
---
#### Hash Collision
- Each slot can only keep **1 entry** (hash, key, value).
- When 2 keys have the same index (slot): **hash collision**.
<img src="https://cdn.ucode.vn/uploads/1/upload/BcNQCGxz.png"/>
---
#### Hash Collision
- Chaining
- Open Addressing
---
#### Hash Collision: Chaining
<img src="https://cdn.ucode.vn/uploads/1/upload/jWkVNMpA.png"/>
---
#### Hash Collision: Open Addressing
<img src="https://cdn.ucode.vn/uploads/1/upload/eVdAoNkT.png"/>
---
#### Growing a hash table
- A hash table must grow when it is getting full
- The hash table's **load factor**: an indication of how large a portion of the available slots are being used.
$load\_factor = \frac{n}{k}$
- Where:
- $n$: number of used slots
- $k$: number of total slots
---
#### Growing a hash table
<img src="https://cdn.ucode.vn/uploads/1/upload/NGenUAuc.png"/>
---
#### Python Dictionaries: Internal
- Python uses Open Addressing for handling hash collision
- Python dict creation: initialized with an 8 element array.
- Dictionary is resized when 2/3 of the slots are filled.
- How is it resized: double the table size
---
### Priority Queue
An extension of the queue with the following properties:
- Each item in the queue has a certain **priority**.
- An element with **high priority** is dequeued before an element with low priority $\rightarrow$ the elements popped out in **sorted** order.
- If two elements have the same priority, they are served according to their order in the queue.
- Python `heapq`, `PriorityQueue`; Java `PriorityQueue`; C++ `priority queue`.
---
#### Priority Queue implementation
| Data structure | Pop | Insert |
| -------- | -------- | -------- |
| Array / linked list | $O(1)$ | $O(n)$ |
| **Binary Heap** | $O(1)$ | $O(\log(n))$ |
| Binary Search Tree | $O(1)$ | $O(\log(n))$ |
Note:
https://www.programiz.com/dsa/priority-queue
https://www.programiz.com/dsa/heap-sort#heap
https://realpython.com/python-heapq-module/
---
#### Python priority queue
- `heapq` module
- `queue.PriorityQueue`: synchronized
---
#### Python `heapq`
- `heapify(x)`: transform list x into a heap, in-place
- `heappush(heap, item)`: Push the value item onto the heap, maintaining the heap invariant.
- `heappop(heap)`: Pop and return the smallest item from the heap, maintaining the heap invariant
---
#### Python `heapq`
```python=
from heapq import heappush, heappop
def heapsort(a):
h = []
for v in a:
heappush(h, v)
return [heappop(h) for i in range(len(h))]
print(heapsort([1, 3, 5, 7, 9, 2, 4, 6, 8, 0]))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
```
---
## The End
---
{"metaMigratedAt":"2023-06-17T19:46:11.142Z","metaMigratedFrom":"YAML","title":"Thuc Nguyen's ADSA Course - Lecture 05. Data Structures - P2","breaks":true,"description":"View the slide with \"Slide Mode\".","slideOptions":"{\"theme\":\"white\"}","contributors":"[{\"id\":\"476f9158-9a39-4a2d-bb09-ce08dd7eb978\",\"add\":11470,\"del\":1707}]"}