# Summary
* Implemented the challenge solution using C++ and Python.
* **Time Complexity:O(nlogn), Space Complexity:O(n)**
* Developed a data/answer pair generator specifically tailored for the challenge.
```
[Edge case] if multiple same value cases,
1426828011 111
1426828028 111
1426828037 111
1426828056 111
1426828058 111
1426828066 111
Top 3 will be 1426828056, 1426828058, 1426828066.
```
- In the results section, highlighted the use of multi-threading to improve process speed.
- Multi-threading optimization resulted in a **1.5X faster** processing time for **datasets over 10 million lines.**
- For datasets with a size of **up to 10 million lines**, running the **sequential** version of the program is sufficient.
# File Structure
```txt
Woven
├── data
├── ans
├── src
│ ├── verify.sh
│ ├── benchmark.sh
├── outputs
├── exp
└── prepareTestDataAndAns.sh (shell script to prepare the data/answer)
```
## Data related folder
- data: Test data, since the data folder contains too much dataset and size is too big, you can run `./prepareTestDataAndAns.sh` to generate or download them [here](https://drive.google.com/drive/folders/1CY7ud1rErNM0SlkSAuYxgxxBQ1D9n_-B?usp=sharing).
- ans: Answer
- outputs: Contain execution result
- exp: Contain experiment result with different size of file to test performance improvement
## Code related folder
- src: Contain source code for code challenge, and generator for test dataset, and answer
- `verify.sh`: Verify the answer with self-generated dataset
- `prepareTestDataAndAns`.sh: shell script to prepare the data/answer
- `benchmark.sh`: shell scripts for performance evalutions
# How to run
## Environment
```txt!
macOS 13.4 (22F66)
```
## Requirement
```bash!
chmod +x prepareTestDataAndAns.sh;
cd src;
chmod +x verify.sh
chmod +x benchmark.sh
```
## Execution
- if you want to run *.sh provided, you should download data [here](https://drive.google.com/drive/folders/1CY7ud1rErNM0SlkSAuYxgxxBQ1D9n_-B?usp=sharing) first into data folder
```cpp!
// under src/
// Sequential version
g++ -std=c++11 -o program sequential.cpp
./program {$input_filepath} {$topK} {$output_file_path}
// Parallel version
g++ -std=c++11 -o program parallel.cpp
./program {$input_filepath} {$topK} 4 {$output_file_path}
```
## Run verify
```bash!
# under src/, with data prepared and .sh executable
./verify.sh
```
- Successfully execute result

## Data preparation
- Use python code to generate test dataset, and generate top 1-20 for corresponding test dataset.
- Generate (dataset, answer) by `./prepareTestDataAndAns.sh`
# Result (sequential vs parallel)
- The experiment results demonstrate that for datasets with a size of **up to 10 million lines**, running the **sequential** version of the program is sufficient.
- **Multi-threading optimization** resulted in a **1.5X** faster processing time for datasets in general.(**Over 10 million lines**)
- View the experiment result [here](https://docs.google.com/spreadsheets/d/1BvUVn_zX7Ek4Jrm29BwVnBXWDkoytApA6x0pzSQGwCY/edit?usp=sharing)
- 0 => sequential, other means number of threads.

