Advanced Database Systems : Homework 2

# Advanced Database Systems : Homework 2 # Collaboration work: - Li-Yuan Lee (ll4627) - Pin-Yi Chiu (pc3095) # Question 1 We use AQuery to deal with this problem. 1. Run `python gen.py` to generate `ticks.csv`. 2. cp `ticks.csv` to /data under the AQuery root directory. 3. cp `q1.a` to /tests under the AQuery root directory. 4. Run `python3 prompt.py` 5. Run `f q1.a` The result for each sub-problem will be `1_a.csv`, `1_b.csv`, `1_c.csv` and `1_d.csv`. # Question 2 Use `gen2.py` and `gen3.py` to generate `ticks.csv` as our testing data. The table contains approximately 1000000 rows. 1. `gen2.py`: stocksymbol range from 1-1000000, meaning that each entry is unique. 2. `gen3.py`: stocksymbol range from 1-100, meaning that different rows may have same stocksymbol. **Table**: ticks | Column | Data Type | | ------------- | ----------------- | | stocksymbol | VARCHAR(16) | | time | INT | | quantity | INT | | price | INT | ## Covering Indexes We compare the query performance between using indexes and without. ### AQuery In Aquery, we use `ASSUMING` keyword to indicate the index. For example, ```sql SELECT stocksymbol, time, price FROM ticks ASSUMING asc stocksymbol ``` #### Case 1: Each row has distinct stocksymbol ##### Without indexes 10 Execution Time: 0.055, 0.094, 0.072, 0.029, 0.049, 0.084, 0.023, 0.028, 0.091, 0.009 avg = 0.0534 s for running: > select * from t1 where stocksymbol=s100 ##### With indexes 10 Execution Time: 0.064, 0.037, 0.053, 0.042, 0.071, 0.002, 0.015, 0.034, 0.065, 0.054 avg = 0.0437 s for running: > select * from t2 where stocksymbol=s100 #### Case 2: Each row may have same stocksymbol ##### Without indexes 10 Execution Time: 0.136, 0.199, 0.229, 0.168, 0.276, 0.182, 0.224, 0.203, 0.099, 0.149 avg = 0.1865 s for running: > select * from t3 where stocksymbol=s100 ##### With indexes 10 Execution Time: 0.149, 0.185, 0.250, 0.342, 0.073, 0.254, 0.175, 0.307, 0.246, 0.292 avg = 0.227 s for running: > select * from t4 where stocksymbol=s100 ### MySQL We choose MySQL as the other database systems to deal with this problem. 1. cp `ticks2-2.csv` and `ticks2-3.csv` to /data under /tmp (for the local machine) or MySQL root directory. In our case, we put it in the /tmp directory; therefore, the corresponding fetching data code will be ```sql LOAD DATA LOCAL INFILE "/tmp/data/ticks2-2.csv" INTO TABLE ticks2a ``` 2. Run `q2_INDEXES.sql`. In MySQL, we create indices using `CREATE INDEX` keyword. ```sql CREATE INDEX idx ON ticks2a (stocksymbol); ``` #### Case 1: Each row has distinct stocksymbol ##### Without indexes avg of 10 Execution Time: 521.77 ms for running: > SELECT * from ticks2a where stocksymbol = 's100' ##### With indexes avg of 10 Converting time: 1805.49 ms avg of 10 Execution Time: 0.30 ms for running: covering indices: > CREATE INDEX idx ON ticks2a (stocksymbol); run query Execution: > select * from ticks2a where stocksymbol = 's100' #### Case 2: Each row may have same stocksymbol ##### Without indexes avg of 10 Execution Time: 522.42 ms for running: > SELECT * from ticks2b where stocksymbol = 's100' ##### With indexes avg of 10 Converting time: 1765.88 ms avg of 10 Execution Time: 28.85 ms ### Covering Indexes Conclusion In AQuery, the improvements from without indexes to with indexes are minor or even none. One reason for this result might be that in this system, the query is already fast enough without acceleration of indexes, so we cannot measure any improvements with this suggestion. On the other hand, the improvements of converting indices in the case of MySQL is significant. The execution time drops from 521 ms to 0.29 ms in case 1 (unique stocksymbol) and 522 ms to 28 ms in case 2 (duplicated stocksymbol). With the index on the column that we query, it becomes quicker to find the data record. ## Eliminate unnecessary DISTINCT We compare the query performance between using distinct and without. In some cases, the distinct statement might be unnecessary due to the distribution of the data so we want to know if we will reduce the performance by putting on useless DISTINCT statement. ### Aquery #### Case 1: Each row has distinct stocksymbol ##### WITH DISTINCT Execution Time: 0.223 for running: > SELECT DISTINCT stocksymbol, time, quantity, price from ticks2a ##### WITHOUT DISTINCT Execution Time: 0.226 avg = 0.0678 s for running: > SELECT stocksymbol, time, quantity, price from ticks2a #### Case 2: Each row may have same stocksymbol ##### WITH DISTINCT Execution Time: 0.244 for running: > CREATE TABLE t13 AS SELECT DISTINCT stocksymbol, time, quantity, price from ticks2b ##### WITHOUT DISTINCT Execution Time: 0.229 for running: > CREATE TABLE t14 AS SELECT stocksymbol, time, quantity, price from ticks2b ### MySQL Run `q2_DISTINCT.sql` #### Case 1: Each row has distinct stocksymbol ##### WITH DISTINCT avg of 10 Execution Time: 11422.50 ms for running: > SELECT distinct * from ticks2a; ##### WITHOUT DISTINCT avg of 10 Execution Time: 1587.64 ms for running: > SELECT * from ticks2a; #### Case 2: Each row may have same stocksymbol ##### WITH DISTINCT avg of 10 Execution Time: 7535.75 ms for running: > SELECT distinct * from ticks2b; ##### WITHOUT DISTINCT avg of 10 Execution Time: 1537.17 ms for running: > SELECT * from ticks2b; ### Eliminate Unnecessary DISTINCT Conclusion Aquery: In case 1 it does not have any improvements because the data itself is scattered. However, in case 2, the improvements are huge because the stocksymbol is now within a range and if some entries share the same stocksymbol the system will have to check if it is distinct even if it is guaranteed to be so. The extra checking would drag down the efficiency as the experiment shows. MySQL: In both cases without the "DISTINCT" statement significantly improved the execution performance. ### The Execution Time Matrix 1. Aquery | Aquery (Unit: ms) | Case 1 | Case 2 | | ----------------- | --------- | -------- | | Without indexes | 53.4 | 186.5 | | With indexes | 43.7 | 227 | | With Distinct | 223 | 244 | | Without Distinct | 226 | 229 | 2. MySQL | MySQL (Unit: ms) | Case 1 | Case 2 | | ---------------- | --------- | -------- | | Without indexes | 521.77 | 522.42 | | Converting time | 1805.49 | 1765.88 | | With indexes | 0.30 | 28.85 | | With Distinct | 11422.50 | 7535.75 | | Without Distinct | 1587.64 | 1537.17 | # Question 3 We use AQuery to deal with this problem. 1. cp `friends.csv` and `like.csv` to /data under the AQuery root directory. 2. cp `q3.a` to /tests under the AQuery root directory. 3. Run `python3 prompt.py` 4. Run `f q3.a` The result will be `q3.csv`.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.