# DB 2021/09/01
## Environmnet
## Database Setting
- Data Migration: Hermes
- Number of Server: 1
- Number of RTE: 1
- Buffer Size: 524288 (the smallest size that can put the whole database into buffer)
- Java Heap Size:
- sequencer: 16GB
- server: 16GB
- client: 8GB
- Only YoungGC appears according to the GC log.
- Thread Pool: 50
- COMM_BATCH_SIZE: 1
- SCHEDULE_BATCH_SIZE: 1
## Workload Setting
### Common settings
- Workload Type: YCSB-Simple(CHoose records followed Zipfian)
- Number of Record(INIT_RECORD_PER_PART): 10000
- ZIPFIAN_PARAMETER: 0.99
- DIST_TX_RATE: 0
### Vary from differnt workload setting
- RW_TX_RATE: 0, 0.5, 1
- TX_RECORD_COUNT: 2, 10, 100
For simplification, Rec-2_RW-0 = RW_TX_RATE: 0, TX_RECORD_COUNT: 2
## Models
Similar to MB2 by Pavlo, with Sklearn package
1. Random Forest Regression(RFR)
2. Kernel Ridge Regression(KRR)
3. Huber Regression(HR)
4. Support Vector Regression(SVR)
## Metric
In MB2, the relative error is descibed as
$$
Relative \ Error = \frac{|Actual - Predict|}{Actual}
$$
However, the paper doesn't give a detailed definition of the relative error. According to the paper, they implement the model with Sklearn package. Thus, we use **Mean Absolute Percentage Error(MAPE)** as the metric to evaluate the models, which can be seen as a detailed definition of relative error.
$$
MAPE(y, \hat{y}) = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples} - 1} \frac{|y_i - \hat{y}_i|}{\max(\epsilon, |y_i|)}
$$
where $y_i$ is the label of i-th sample and $\hat{y}_i$ is the predicted value corresponding to the true value. $\epsilon$ is a small alter divisor to avoid `divided by zero error` while the $|y_i|$ is 0
## Data Preprocessing
### Step 0. Features & Label
Follow the spec of [this](https://hackmd.io/@pywang/r1UFn0oCd#)
Label: We train a model for each OU latency
- Generate Execution Plan Latency
- Execute SP Arithmetic Logic Latency
- Write to Local Storage Latency
- Transaction commit Latency
Features:
- the number of reads in the read set
- the number of writes in the write set
- the number of active threads
- thread pool size
- current CPU utilization
- number of cache read
- number of cache insert
- number of cache update
- number of arithmetic operations
- number of write back record
- number of bytes of write back record
- number of read write record
- number of log flush bytes
### Step 1. Drop Ouliners
Since the variance is very huge, we **drop** the data points **beyond mean +/- 1 standard deviation**
### Step 2. Sampling
Sample 10000 Txns from dataset and 80% for training, 20% for testing, since it takes about 1 night to train on the whole dataset.
#### Latency Histogram of Rec-2_RW-0
1. Execute SP Arithmetic Logic OU
**Original**

**After Dropping Outliners**

**After Sampling**

2. Write to Local Storage OU
**Original**

**After Dropping Outliners**

**After Sampling**

3. Transaction Commits OU
**Original**

**After Dropping Outliners**

**After Sampling**

4. Generate Execution Plan OU
**Original**

**After Dropping Outliners**

**After Sampling**

#### Latency Histogram of Rec-10_RW-0.5
1. Execute SP Arithmetic Logic OU
**Original**

**After Dropping Outliners**

**After Sampling**

2. Write to Local Storage OU
**Original**


**After Dropping Outliners**


**After Sampling**


3. Transaction Commits OU
**Original**


**After Dropping Outliners**


**After Sampling**


4. Generate Execution Plan OU
**Original**

**After Dropping Outliners**

**After Sampling**

# Results
We only show the best model with lowest relative error(or say MAPE) on testing dataset.
| OU \ Workload | Rec-2_RW-0 | Rec-2_RW-1 | Rec-2_RW-0.5 |
|:--------------------------- |:----------- |:----------- |:------------ |
| Execute SP Arithmetic Logic | RFR: 0.1109 | HR: 0.1399 | RFR: 0.1791 |
| Write to Local Storage | RFR: 0.1452 | RFR: 0.0604 | RFR: 0.1107 |
| Transaction Commits | RFR: 0.1068 | RFR: 0.1717 | RFR: 0.1340 |
| Generate Execution Plan | RFR: 0.3497 | KRR: 0.3555 | RFR: 0.2677 |
| OU \ Workload | Rec-10_RW-0 | Rec-10_RW-1 | Rec-10_RW-0.5 |
|:--------------------------- |:----------- |:----------- |:------------- |
| Execute SP Arithmetic Logic | SVR: 0.0551 | RFR: 0.0842 | RFR: 0.0908 |
| Write to Local Storage | RFR: 0.0772 | RFR: 0.0398 | RFR: 0.0873 |
| Transaction Commits | SVR: 0.0613 | RFR: 0.0876 | RFR: 0.1073 |
| Generate Execution Plan | RFR: 0.1668 | KRR: 0.1522 | RFR: 0.1234 |
| OU \ Workload | Rec-100_RW-0 | Rec-100_RW-1 | Rec-100_RW-0.5 |
|:--------------------------- |:------------ |:------------ |:-------------- |
| Execute SP Arithmetic Logic | RFR: 0.0786 | RFR: 0.0692 | RFR: 0.064 |
| Write to Local Storage | RFR: 0.0691 | RFR: 0.0276 | HR: 0.0413 |
| Transaction Commits | RFR: 0.07 | RFR: 0.0588 | RFR: 0.0605 |
| Generate Execution Plan | RFR: 0.1704 | RFR: 0.0518 | RFR: 0.0484 |
# Conclusion
1. The more record, the lower error
2. The more read-only Txns, the higher error for Generate Plan OU
3. All in all, except for Generate Plan OU, other OUs perform well.
# Appendix
## A1. MB2 results

## A2. Box Plot
### The definition of the fliers/outliner in `matplotlib.pyplot.boxplot`

### Latency Histogram of Rec-2_RW-0
#### 1. Execute SP Arithmetic Logic OU





#### 2. Write to Local Storage OU



#### 3. Transaction Commits OU



#### 4. Generate Execution Plan OU



### Latency Histogram of Rec-10_RW-0.5
#### 1. Execute SP Arithmetic Logic OU





#### 2. Write to Local Storage OU



#### 3. Transaction Commits OU



#### 4. Generate Execution Plan OU



# Note
SVR on Rec-2_RW-0.5 has extremely high error: 36.3408/31.7461
SVR on Rec-10_RW-0.5 Flush OU has extremely high error:
補outliner的圖,
show 不同 dataset size 的 accuracy
1. Show 不同OU在不同dataset size的 Mean和Std
2. 改OU名字
3. 確認有沒有GC