# Advanced Database Systems : Homework 2
# Collaboration work:
- Li-Yuan Lee (ll4627)
- Pin-Yi Chiu (pc3095)
# Question 1
We use AQuery to deal with this problem.
1. Run `python gen.py` to generate `ticks.csv`.
2. cp `ticks.csv` to /data under the AQuery root directory.
3. cp `q1.a` to /tests under the AQuery root directory.
4. Run `python3 prompt.py`
5. Run `f q1.a`
The result for each sub-problem will be `1_a.csv`, `1_b.csv`, `1_c.csv` and `1_d.csv`.
# Question 2
Use `gen2.py` and `gen3.py` to generate `ticks.csv` as our testing data. The table contains approximately 1000000 rows.
1. `gen2.py`: stocksymbol range from 1-1000000, meaning that each entry is unique.
2. `gen3.py`: stocksymbol range from 1-100, meaning that different rows may have same stocksymbol.
**Table**: ticks
| Column | Data Type |
| ------------- | ----------------- |
| stocksymbol | VARCHAR(16) |
| time | INT |
| quantity | INT |
| price | INT |
## Covering Indexes
We compare the query performance between using indexes and without.
### AQuery
In Aquery, we use `ASSUMING` keyword to indicate the index. For example,
```sql
SELECT stocksymbol, time, price
FROM ticks
ASSUMING asc stocksymbol
```
#### Case 1: Each row has distinct stocksymbol
##### Without indexes
10 Execution Time: 0.055, 0.094, 0.072, 0.029, 0.049, 0.084, 0.023, 0.028, 0.091, 0.009
avg = 0.0534 s
for running:
> select * from t1 where stocksymbol=s100
##### With indexes
10 Execution Time: 0.064, 0.037, 0.053, 0.042, 0.071, 0.002, 0.015, 0.034, 0.065, 0.054
avg = 0.0437 s
for running:
> select * from t2 where stocksymbol=s100
#### Case 2: Each row may have same stocksymbol
##### Without indexes
10 Execution Time: 0.136, 0.199, 0.229, 0.168, 0.276, 0.182, 0.224, 0.203, 0.099, 0.149
avg = 0.1865 s
for running:
> select * from t3 where stocksymbol=s100
##### With indexes
10 Execution Time: 0.149, 0.185, 0.250, 0.342, 0.073, 0.254, 0.175, 0.307, 0.246, 0.292
avg = 0.227 s
for running:
> select * from t4 where stocksymbol=s100
### MySQL
We choose MySQL as the other database systems to deal with this problem.
1. cp `ticks2-2.csv` and `ticks2-3.csv` to /data under /tmp (for the local machine) or MySQL root directory.
In our case, we put it in the /tmp directory; therefore, the corresponding fetching data code will be
```sql
LOAD DATA LOCAL INFILE "/tmp/data/ticks2-2.csv" INTO TABLE ticks2a
```
2. Run `q2_INDEXES.sql`.
In MySQL, we create indices using `CREATE INDEX` keyword.
```sql
CREATE INDEX idx ON ticks2a (stocksymbol);
```
#### Case 1: Each row has distinct stocksymbol
##### Without indexes
avg of 10 Execution Time: 521.77 ms
for running:
> SELECT * from ticks2a where stocksymbol = 's100'
##### With indexes
avg of 10 Converting time: 1805.49 ms
avg of 10 Execution Time: 0.30 ms
for running:
covering indices:
> CREATE INDEX idx ON ticks2a (stocksymbol);
run query Execution:
> select * from ticks2a where stocksymbol = 's100'
#### Case 2: Each row may have same stocksymbol
##### Without indexes
avg of 10 Execution Time: 522.42 ms
for running:
> SELECT * from ticks2b where stocksymbol = 's100'
##### With indexes
avg of 10 Converting time: 1765.88 ms
avg of 10 Execution Time: 28.85 ms
### Covering Indexes Conclusion
In AQuery, the improvements from without indexes to with indexes are minor or even none.
One reason for this result might be that in this system, the query is already fast enough without
acceleration of indexes, so we cannot measure any improvements with this suggestion.
On the other hand, the improvements of converting indices in the case of MySQL is significant. The execution time drops from 521 ms to 0.29 ms in case 1 (unique stocksymbol) and 522 ms to 28 ms in case 2 (duplicated stocksymbol). With the index on the column that we query, it becomes quicker to find the data record.
## Eliminate unnecessary DISTINCT
We compare the query performance between using distinct and without. In some cases, the distinct statement might be unnecessary due to the distribution of the data so we want to know if we will reduce the performance by putting on useless DISTINCT statement.
### Aquery
#### Case 1: Each row has distinct stocksymbol
##### WITH DISTINCT
Execution Time: 0.223
for running:
> SELECT DISTINCT stocksymbol, time, quantity, price from ticks2a
##### WITHOUT DISTINCT
Execution Time: 0.226
avg = 0.0678 s
for running:
> SELECT stocksymbol, time, quantity, price from ticks2a
#### Case 2: Each row may have same stocksymbol
##### WITH DISTINCT
Execution Time: 0.244
for running:
> CREATE TABLE t13 AS SELECT DISTINCT stocksymbol, time, quantity, price from ticks2b
##### WITHOUT DISTINCT
Execution Time: 0.229
for running:
> CREATE TABLE t14 AS SELECT stocksymbol, time, quantity, price from ticks2b
### MySQL
Run `q2_DISTINCT.sql`
#### Case 1: Each row has distinct stocksymbol
##### WITH DISTINCT
avg of 10 Execution Time: 11422.50 ms
for running:
> SELECT distinct * from ticks2a;
##### WITHOUT DISTINCT
avg of 10 Execution Time: 1587.64 ms
for running:
> SELECT * from ticks2a;
#### Case 2: Each row may have same stocksymbol
##### WITH DISTINCT
avg of 10 Execution Time: 7535.75 ms
for running:
> SELECT distinct * from ticks2b;
##### WITHOUT DISTINCT
avg of 10 Execution Time: 1537.17 ms
for running:
> SELECT * from ticks2b;
### Eliminate Unnecessary DISTINCT Conclusion
Aquery: In case 1 it does not have any improvements because the data itself is scattered. However, in case 2, the improvements are huge because the stocksymbol is now within a range and if some entries share the same stocksymbol the system will have to check if it is distinct even if it is guaranteed to be so. The extra checking would drag down the efficiency as the experiment shows.
MySQL:
In both cases without the "DISTINCT" statement significantly improved the execution performance.
### The Execution Time Matrix
1. Aquery
| Aquery (Unit: ms) | Case 1 | Case 2 |
| ----------------- | --------- | -------- |
| Without indexes | 53.4 | 186.5 |
| With indexes | 43.7 | 227 |
| With Distinct | 223 | 244 |
| Without Distinct | 226 | 229 |
2. MySQL
| MySQL (Unit: ms) | Case 1 | Case 2 |
| ---------------- | --------- | -------- |
| Without indexes | 521.77 | 522.42 |
| Converting time | 1805.49 | 1765.88 |
| With indexes | 0.30 | 28.85 |
| With Distinct | 11422.50 | 7535.75 |
| Without Distinct | 1587.64 | 1537.17 |
# Question 3
We use AQuery to deal with this problem.
1. cp `friends.csv` and `like.csv` to /data under the AQuery root directory.
2. cp `q3.a` to /tests under the AQuery root directory.
3. Run `python3 prompt.py`
4. Run `f q3.a`
The result will be `q3.csv`.