Homework 3: Nonprogramming part

--- toc: depth_from: 2 depth_to: 3 ordered: false --- [![hackmd-github-sync-badge](https://hackmd.io/JZUhv-byTRqV46cumxZ9Xw/badge)](https://hackmd.io/JZUhv-byTRqV46cumxZ9Xw) <center> # Homework 3: Nonprogramming part **邱紹庭** `r07945001@ntu.edu.tw` </center> **Table of contents** [toc] --- ## Problem 1 - Hash (60pt) ### 1. (10pt) The probability of no collision ($X'$)[^UniHash]: $$\begin{align} P(X') &= \frac{n^2}{n^2} \frac{n^2-1}{n^2} \cdots \frac{n^2-n+1}{n^2} \\ &= \prod_{i=1}^{n} \frac{n^2 - i + 1}{n^2} \end{align}$$ On the other hand, the probability of any collision ($X$): $$\begin{align} P(X) &= 1 - P(X')\\ &= 1 - \prod_{i=1}^{n} \frac{n^2 - i + 1}{n^2}_{\#} \end{align}$$ [^UniHash]: [If we store n keys in a hash table of size m=n^2 , then what is the probability of any collision ?](https://gateoverflow.in/41109/store-keys-hash-table-size-then-what-probability-collision) [^AC-P1-1]: Thanks to 張杰輝's inspiration. ### 2. (10pt) **💡 Idea** Suppose we add unique keys one by one. Inserting first unique key at start require extected $1$ time of insertion. For second unique key, since there is already one position occupied, the acquisition of second key requires $\frac{|P|}{|P|-1}$ expected times of insertion[^AC-P1-1]. **🔧 Formulation** For inserting $a+1^{th}$ unique keys when there is already $a$ slots occupied. Let $p$ be the probability of success insertion of unique key with single trial. The probability of success insertion after $n$ trials is $$Pr_{a+1}(p_{a+1},n) = (1-p_{a+1})^{n-1}p_{a+1}$$ where $p_{a+1} = \frac{|P|-a}{|P|}$ and $n\in \{1,...,\infty\}$. The probability $Pr_{a+1}(p_{a+1},n) \in$ **Geometric distribution**. Since $Pr_a$ belongs to geometric distribution, the expected value is[^geom] $$E[\text{Times to inserting $a+1^{th}$ key} | a \text{ slots occupied in } P] = \frac{1}{Pr_{a+1}} =\frac{|P|}{|P|-a}$$ For inserting $k$ keys, $$E[\text{# of inserting $k$ keys}] = \sum_{i=0}^{\frac{\lfloor P \rfloor}{4} -1}\frac{|P|}{|P|-i}_{\#}$$ [^geom]: https://en.wikipedia.org/wiki/Geometric_distribution ### 3. (20pt) - $h(k,i) = (h_1(k)+i)~mod~m$ - $h_{1}(k) = k~mod~m$ - $m=11$ **Open addressing with linear probing** |k|h1(k)|0|1|2|3|4|5|6|7|8|9|10| |---|---|---|---|---|---|---|---|---|---|---|---|---| |18|7||||||||18||| |34|1||34||||||18||| |9|9||34||||||18||9| |37|4||34|||37|||18||9| |40|7||34|||37|||18|40|9| |32|9||34|||37|||18|40|9|32| |89|1||34|89||37|||18|40|9|32| **Open addressing with double hashing** - $h(k,i) = (h_1(k) + ih_2(k))~mod~m$ - Primary Hash function: $h_{1}(k) = k~mod~m$ - Secondary hash function: $h_{2}(k) = 1+ (k~mod~(m-1))$ - $m=11$ |k|h1(k)|h2(k)|0|1|2|3|4|5|6|7|8|9|10| |---|---|---|---|---|---|---|---|---|---|---|---|---|---| |18|7|9||||||||18||| |34|1|5||34||||||18||| |9|9|10||34||||||18||9| |37|4|8||34|||37|||18||9| |40|7|1||34|||37|||18|40|9| |32|9|3||34|||37|||18|40|9|32| |89|1|10|89|34|||37|||18|40|9|32| - 40 - H1:40%11 = 7 - H2: 1+(40%10) = 1 - 32 - H1: 32%11 = 10 - 89 - H1: 89%11=1 - H2: 1+ 89%10 = 10 ### 4. (20pt) **Table T1**[^AC-P1-1] - $h_1(k)=k~mod~7$ |k|h1(k)|0|1|2|3|4|5|6| |---|---|---|---|---|---|---|---|---| |6|6|||||||6| |31|3||||31|||6| |2|2|||2|31|||6| |41|6|||2|31|||41| |30|2|||30|31|||6| |45|3|||30|45|||6| |44|2|||44|31|||6| **Table T2** - $h_2(k) = \lfloor \frac{k}{7} \rfloor~mod~7$ |k|h2(k)|0|1|2|3|4|5|6| |---|---|---|---|---|---|---|---|---| |6|0||||||||| |31|4|||||||| |2|0||||||||| |41|5|6||||||| |30|4|2|||||41|| |45|6|2||||31|41|| |44|6|2||||30|41|45| --- ## Problem 2 - String matching (60pt) ### 1. (10pt) **💡 Idea** 1. Use **Robin-Karp algorithm** to hash strings. 2. **Preprocessing**: Create an array of hashes. Let the `array[i]` stores the hash of `S[i..N]`. This requires |Index|1|2|3|4| |---|---|---|---|---| |Sting|S1|S2|S3|S4| |Hash|H1:1|H1:2|H1:3|H1:4| where `Hi:j` represents the hash of the region `H(S[i..j])` 4. **Query**: Use **Substraction** technique to get the hash value in the region of `S[i..j]` with $O(1)$ time complexity (**Fig. 2-1**). <center> <img height=200 src="https://i.imgur.com/DXyMche.jpg"> **Fig 2-1.** Generate `hash(S[2..4])` from `HL` and `HR`. </center> **🔧 Implementation** **Preprocessing** ```cpp= function Preprocessing(Str, d, q) Initialize hash_list[1,...,length(Str)] p = 0 for i = 1 to length(Str) p = (dp + Str[i]) mod q hash_list[i] = p end return hash_list end ``` - `Str`: Text - `d`: The radix - `q`: modulo **Hash Generation** ```cpp= function get_hash(hash_list, d, q, l, r) l_hash = hash_list[l] r_hash = hash_list[r] i_hash = (r_hash - l_hash) / d^(r-l) i_hash = i_hash mod q return i_hash end ``` **Matching** ```cpp= function matching(Str, hash_list, l1, l2, n) h1 = get_hash(hash_list, l1, l1+n-1) h2 = get_hash(hash_list, l2, l2+n-1) if h1==h2 if(Str[l1...l1+n-1] == Str[l2...l2+n-1]) return true return false end ``` **🔢 Analysis** - Preprocessing: - Time: $O(N)$ - Space: $O(N)$ - where $n$ is the length of the text - Hash generation - Time: $O(1\cdot Q)$ - Single query requires constant time. - Space: $O(1)$ - Matching - Similar to Rabin-Karp Matching - Time: $O(Q)$ (amortized). - In this case, two string poccess equal lengths. The original time complexity $O((m-n+1)n)$ decays to $O(n)$ where $n$ is the length of the sub-string. - Space: $O(1)$. Reuse the text stored in `preprocessing`. - **Total**: $O(N+Q)\approx O(N)$ for $Q\approx N$. ### 2. (10pt) - `S[1..N]` = `bcdabcde` |$i$|$x(i) = \max(p)$|array $X$| |---|---|---| |1|8|`bcdabcde`| |2|0|` `| |3|0|` `| |4|0|` `| |5|3|`bcd`| |6|0|` `| |7|0|` `| |8|0|` `| ### 3. (20pt) **💡 Idea** - Let $x(i) = y > 0$. For $\{j | j\in [i+1..i+y-1]\}$, $x(j)=x(j-i+1)$. (**Fig 2-1.**, underlined). Therefore, we can get $x(j)$ in constant time if it is covered in the identical region. - If $k\notin [i+1..i+j-1]$, use one-by-one comparison to iterate next site with $y>0$ and calculate the $P$s on the left side of the boundary (**Fig2-1**, yellow vertical line). - The main reason this algorithm can keep in $O(n)$ is because the boundary **Fig 2-1.** only goes to right, and the region between `i` and `i+p-1` can be retrieved in **constant time**. <img height=200 src="https://i.imgur.com/fKUbPrU.png"> **Fig 2-1.** Construction of Ps. The first row represents the index of site and the boundary described in *idea*. Second row is the text. The third row is the $x(i)$ with the repetitive region underlined in black[^94]. <img height=100 src="https://i.imgur.com/iKzuyPR.jpg"> <img height=100 src="https://i.imgur.com/OypDeRN.jpg"> **Fig 2-2.** **Case I**: Overlapped pattern; **Case II**: Non-overlapped pattern. [^94]: This problem is discussed with the user `54` on DSA Discord. **🔧 Implementation** ```cpp= function get_xs(S) create list xs[1..length(S)] xs[1] = length(s) boundary = 0 boundary_origin = 0 for i = 2 to length(s) // Case 0 if(S[1] != S[i]) xs[i] = 0 end // Case 1 & 2: Within Boundary else if(i < boundary_origin) // Case 1 if(i<xs[boundary]) p = xs[boundary] - (i-boundary_origin) end // Case 2 else p=xs[i-boundary+1] end while(S[i+p] == S[1+p]) p+=1 boundary_origin = i boundary = i+p-1 end xs[i] = p end // Case 3: On the right of the boundary else p = 0 while(xs[i+p]==xs[1+p]) p+=1 boundary_origin = i boundary = i+p-1 end xs[i] = p end end return [S[1..i] for i in xs[i]] //Xs end ``` - Time analysis - The boundary only goes to the end of string in one direction - Hence, the time complexity is $O(len(S))$ - Space: $O(len(S))$. - This function returns a list of **X**s, which is same as > Design an algorithm that calculate the content of array X ### 4. (20pt) **💡 Idea** - Merge pattern $p$ with the string $t$. - This creates a new string, $pt$, with length equal to $len(p)+len(t)$ - Use the algorithm proposed in **P3-3**, calculate the $x(i)$ - Find all of $i$ such that $\{i | i>len(p)~and~ x(i)\geq len(p)\}$ **🔧 Implementation** ```cpp= Find-occurrence(p, t) m = p + t xs = get_xs(m) occurrence = 0 for i = length(p)+1 to length(m) if(xs[i] >= length(p)) occurrence += 1 end end return occurrence end ``` **🔢 Analysis** - Time: - Preprocessing: $O(len(p)+len(t))$ - Counting: $O(len(t))$ - Overall: $O(len(p) + 2\cdot len(t))$ - Space: $O(len(p) + len(t))$ [^RKMatch]: CLRS. PP.993 --- ## Problem 3 - Having fun with Disjoint Sets (80pt) ### 1. (15pt) **💡 Idea** - Initiation [^findbipart][^stackbipart] - Set singular sets for all verteces - Add - If same set, return `false` - If different set, `union` **Initiation** ```cpp= INIT(N) Create MAP[0,..,N-1] with NIL-filled for i = 0 to N-1 MAKE-SET(i) end isBiparatite = True end ``` - `isBiparatite` can be seen and modified globally. - Vector `MAP`: contains a nighbor of each verteces. - If `MAP[1] == 2`, it means vertex `1` is connected to vertex `2` <img height=200 src="https://i.imgur.com/kxkp8uz.jpg"> **Fig 3.1** The data structure of `MAP`. In this case `Add(1)`. Th array `MAP` is a list of pointers which store address of disjoint sets. Calling `MAP[1]` returns the disjoint set of `2-1` as illustrated. **Adding edge** ```cpp= ADD-EDGE(x,y) // Check bipartite if FIND-SET(x) == FIND-SET(y) isBiparatite = False end if MAP[x] == NIL MAP[x] = y else UNION(MAP[x], y) end if MAP[y] == NIL MAP[y] = x else UNION(MAP[y], x) end end ``` - `isBiparatite` is generated by `INIT`. **Check bipartite** ```cpp= IS-BIPARTITE() return isBiparatite end ``` **🔢 Analysis** - **Initiation**: Operates in linear time complexity for `Make-Set` is $O(1)$ and iterates $N$ times. Therefore, the total time complexity is $O(N)$. - **Add Edge**: the union by size technique can confirm the smaller set enlarges at least twice of its size, which yields $O(\log N)$ time complexity for `Union` operation. Since the maximum calling of `ADD-EDGE` is $M$, the worst case of time complexity is: $O(M \log N)$ . - **Check bipartite**: The time complexity of `IS-BIPARTITE` is $O(1)$. - **Summary**: The total operation requires $O(N) + O(M\log N)$ in time complexity. [^findbipart]: [How to Find If a Graph Is Bipartite?](https://www.baeldung.com/cs/graphs-bipartite) [^stackbipart]: https://stackoverflow.com/questions/53246453/detect-if-a-graph-is-bipartite-using-union-find-aka-disjoint-sets ### 2. (15pt) **💡 Idea** 1. Create $3N$ nodes. $N$ nodes labelled with `scissor`, $N$ for `stone` and $N$ for `paper` 2. For `Win` operation, links all of 3 victory conditions. That is `scissor -> paper` / `paper->stone` / `stone->scissor`, and link these conditions into three disjoint sets. 3. For `tie` operation. For example: `Tie(i,j)`: Link $Node_{i}^{stone}$ to $Node_{j}^{stone}$ , and the other two pairs (**Fig 3-1.**). 4. Contradiction: if more than 1 $Node_{i}^{\#}$ belongs to the same set, it means there is one person revealing two results at the same time which is a contradiction. <img height=100 src="https://i.imgur.com/WQe2ymI.png"> **Fig 3-1. Sets of all possibility for `WIN(A,B); WIN(B,C)`.** **🔧Implementation** **Initiation** ```cpp= function INIT(N) Create list of vertexes: Scissor[1..N] Stone[1..N] Paper[1..N] for i = 1 to N MAKE-SET(Scissor[i]) MAKE-SET(Stone[i]) MAKE-SET(Paper[i]) end contracdict = true end ``` - The list contains **Vertex** object. - These function creates $3N$ verteces stored in three lists. - Noted that `Scissor[1]` means `person 1` with `scissor` output, and `Scissor[1]` and `Paper[1]` are two independent objects. **Win** ```cpp= function WIN(a,b) UNION(Scissor[a], Paper[b]) UNION(Stone[a], Scissor[b]) UNION(Paper[a], Stone[b]) contradict = IS-CONTRADICT-NODE(a) contradict = IS-CONTRADICT-NODE(b) end ``` **Tie** ```cpp= function TIE(a,b) UNION(Scissor[a], Scissor[b]) UNION(Stone[a], Stone[b]) UNION(Paper[a], Paper[b]) contradict = IS-CONTRADICT-NODE(a) contradict = IS-CONTRADICT-NODE(b) end ``` **Contradiction** ```cpp= function IS-CONTRADICT-NODE(i) if(FIND-SET(Scissor[i]) == FIND-SET(Stone[i])) return true end else if(FIND-SET(Scissor[i]) == FIND-SET(Paper[i])) return true end else if(FIND-SET(Paper[i]) == FIND-SET(Stone[i])) return true end else return false end end function IS-CONTRADICT() return contradict end ``` - The `IS-CONTRADICT` uses the global variable `contradict` as return. - The `contradict` valuable is updated within every operation `WIN` and `TIE` with $O(\log N)$ time comlexity. **🔢 Analysis** - Time: - `Init`: $O(N)$ - `WIN`: $O(\log N)$ - `TIE`: $O(\log N)$ - `IS-CONTRADICT`: $O(\log N)$ in each operation: The contradict testing is run every `WIN` and `TIE` operations. Each check requires $O(\log N)$ time complexity. This is contributed by `FIND-SET` operation. - Total: $O(N + M\log N)$. Time complexity results from `union by size` and `FIND-SET`, which are used for `WIN`/`TIE`/`IS-CONTRADICT` function. Noted that these three function are called $M$ times. - Space: - Individual outputs are set as independent nodes. The initiation triples the size of $N$ verteces. The total space complexity is $O(N)$. ### 3. (15pt) **💡 Idea** - **Amortized analysis**: Use amotized analysis to derive the time complexity of **path compression** without **union by size**. **🔢 Amortized analysis** **Initiation** This process is trivially $O(N)$ in time complexity. Because the stack operations are performed $N$ times both with `memo` and `stk` (**line `58-68`**). **Add Edge** - This implementation only apply **path compression**. As we can see in **line `85-92`**, the `djs_assign` reassign the subtree to the representative of the disjoint set. - On the other hand, the `djs_union`(**line `94-99`**), the union process is deterministically performed, regardless of the sizes of two disjoint sets. **Contradiction** - According to [^disjoint-amortized][^disjoint-ppt], with the linked list and path compression, the `FIND-SET` requires recursive in times of tree height. Because this operation doesn't provide **union by size** technique. In the worst cast, the tree height can be flattened to a linked list with length $O(N)$ (**Fig. 3-2**). - If the process is undid. The structure will return to **Fig 3-2 (left)** with $O(N)$ operations. - The **contradict condition** is - Call `undo` instantly before call `union`. - Like `Union -> Union undo Union`. This can inevitably reconstruct the tree with undoing (**Fig 3-2.** from `right->left->right`). <img height=200 src="https://i.imgur.com/DsSzsEu.png"> **Fig 3-2.** Path compression[^disjoint-ppt]. The structure can be converted to the right by `FIND-SET`. Conversely, `Undo`. **📒 Conclusion** - I **disprove** the stated time complexity. - The worst case can be $O(N+MN)$ with the contradict condition described above. [^cpuinfo]: CPU Info: `Model name: Intel(R) Xeon(R) CPU @ 2.20GHz; CPU MHz: 2199.998` [^disjoint-amortized]: Disjoint set. PPT. Page 18. Prof. Hsin-Mu Tsai. [^disjoint-ppt]: Page 30. https://www.cs.princeton.edu/courses/archive/spring13/cos423/lectures/UnionFind.pdf. `Starting from an empty data structure, path compression (with naive linking) performs any intermixed sequence of m ≥ n find and n – 1 union operations in O(m log n) time` ### 4. (15pt) As mentioned in [^disjoint-amortized], the linked list with union-by-size can achieve $O(N+M\log N)$ time complexity. **Initiation** The first $O(N)$ is caused by the initiation, that uses a `for-loop` to setup all sets. **Find set** With the aid of **union by size**, the data structure remains a tree (**Fig 3-3**). This makes `Find-set` operation works in $O(\log N)$[^findset_union]. **Add Edge** As mentioned in [^findset_union], the union by size technique assures the smaller set to increase at least twice of its original size, maintaining the height of the tree approximately $O(\log N)$. The union operation is accompanied with the `Find-set` which requires $O(\log N)$ time complexity. However, the link process (**Fig 3-3.**) only requires $O(1)$[^disjoint-ppt][^disjoint-amortized]. **Undo** The `undo` process doesn't destroy the tree structure. For each undo, the unlink process is taken place, and this operation maintain the tree height. Thus, after calling `undo`, `Find-set` and `Add-edge` remains $O(\log N)$ time complexity. The `undo` operation requires the time complexity no more than $O(\log N)$ **Conclusion** - **The statement is valid**. The time complexity of this modification is $O(N+M\log N)$. <img height=200 src="https://i.imgur.com/onvK94p.png"> **Fig 3-3.** Union by rank [^disjoint-ppt]. The dash line represents the link created by `Add-edge`. [^findset_union]: https://www.chegg.com/homework-help/questions-and-answers/data-structure-algorithms-need-understand-time-complexity-analysis-exam-disjoint-sets-link-q33572109. ![](https://i.imgur.com/BsghCel.png) ### 5. (20pt) **💡 Idea** 1. Use path compression technique and union by rank to achieve $O(\alpha(1))$ as mentioned in [^disjoint-amortized]. 2. **Isolation**: - With path compression technique, we can call `FIND-SET` operation ([**Fig 3-3.**](#3-15pt)) with $\log h$ time complexity to flattten the tree, where $h$ is the height of the tree. - Connect $k$'s descendants to its parent with $O(1)$ time complexity. <img height=200 src="https://i.imgur.com/DsSzsEu.png"> **Fig 3-3.** Path compression[^disjoint-ppt]. **🔧 Implementation** **Node structure** ```cpp= struct node parent::node leaf::node end ``` **Isolation** ```cpp= function ISOLATE(k::node) p = FIND-SET(k) k.parent = NULL k.leaf.parent = p end ``` - Noted that the `FIND-SET` operation is implemented with **path compression** technique, which can flatten the disjoint set, and make it easy to re-connect `k`'s offspring to its parent. **🔢 Analysis** - The operation is under the scale of $O(M\log N)$ in time complexity. Because the `isloate` operation is under the time complexity of $O(\log h)$ where $h$ is the height of the disjoint tree. ## References

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.