# DSA Final Project Report
## Strategies for each type of query
### 1. Expression-Match
We assumed that the length of token is fixed so that the string comparison takes $O(1)$ times for the layout to be more concise.
- **Data Structure & Algorithm**
We use **hashing** and **infix tree**(we call it expression tree) to solve this problem.
Since this problem requires us to check whether a token in expression is also in each mail, we think that traversing a mail one time to put all its tokens to hash table is a clever way, which makes it $O(1)$ for expression to determine whether the token exists in given mail.
And we build an infix tree for expression since it only need to be built one time for a query and the time complexity is $O(n)$, futhermore, it can skip some unnecessary calculations, the method is described in scheduling strategy.
- **Cost estimation**
Let $M$ be the length of expression and $n$ be the amount of mails.
We build the expression tree first, which takes $O(M)$ times.
Before queries, we have to create hash table. Which is approximate
$O($**sum of the mails len**$)$
For each mail:
We need:
$O(M)$ to check every token in the expression.
$O(M)$ time to calculate the expression tree.
So its $O(M)$ totally.
If we have n mails, than it becomes $O(nM)$.
- **Scheduling Strategy**
**1. Hashing**
With hashing, we can do queries with fast speed, so we use hasing to find whether the token is in the mail. Moreover, we hash all the mail before doing queries. With this method, we can avoid creating hash tables repeatedly. Which can decrease the time complexity a lot.
**2. Expression Tree**
We use an expression tree such that the operator is internal node and the token is either internal node or leaf, which can simplify the calculation by the property of boolean operator. For example, look at the image below, if A is true than we don't need to consider B, or if C is false than we don't need to consider D.
> precidence: | < & < ! < ()
e.g. (vms)|(puyat)|(glace)&(7)
-> (A)|(B)|(C\)&(D)
->
### 2. Find-Similar(Message-ID, Threshold)
- **Data Structure & Algorithm**
We use **hash** to solve the problem since it's easy to determine whether a token are both in mails[mid] and in given mail throughout hash table. Moreover, we can use the same hash table as in Expression-Match.
- **Cost estimation**
Let mails[mid] be $A$, mail for comparison be $B$.
Let *length of $A$ be $N$, *length of B* be $M$, and we have n mails.
- hash:
We reused the hash table, which was built in advance, so it don't take any time.
- total:
After hashing, we only need $O(N)$ times to find the intersection of A and B (check if the token exists in B by hash table, which averagely takes $O(1)$ per char).
So the total complexity is
$O(nN)$
- **Scheduling Strategy**
To use the resources efficiently, I reused the hash table in the Expression Match Part.
And according to The principle of displacement, $|A\cup B|=A+B-|A\cap B|$, so the similarity is $\frac{|A\cap B|}{|A\cup B|}=\frac{|A\cap B|}{A+B-|A\cap B|}$.Hence we only need to focus on finding the intersection of A and B, which makes us simplified our code.
### 3. Group-Analyse(Message-IDs)
- **Data Structure & Algorithm**
In this query, since the group in this problem can be related to the concept of set, we implement **disjoint set** by array using path compression and union by size and **hash** the name to some integer for the purpose of using disjoint set more easily.
- **Cost estimation**
Let *lengh of name* to be $N$ and *number of id* be $M$
- hash:
$O(N)$
- disjoint set:
time complexity
| one-by-one | Path Compression | Path Compression + Union by size |
| --- | --- | --- |
| $O(M^2)$ | $O(MlogM)$ | close to $O(M)$ |
- total:
$O(N)*O(M) = O(NM)$
- **Scheduling Strategy**
Since we have implemented this query so far, we have not yet find a strategy to solve the problem.