Summary - HackMD

# Summary * Implemented the challenge solution using C++ and Python. * **Time Complexity:O(nlogn), Space Complexity:O(n)** * Developed a data/answer pair generator specifically tailored for the challenge. ``` [Edge case] if multiple same value cases, 1426828011 111 1426828028 111 1426828037 111 1426828056 111 1426828058 111 1426828066 111 Top 3 will be 1426828056, 1426828058, 1426828066. ``` - In the results section, highlighted the use of multi-threading to improve process speed. - Multi-threading optimization resulted in a **1.5X faster** processing time for **datasets over 10 million lines.** - For datasets with a size of **up to 10 million lines**, running the **sequential** version of the program is sufficient. # File Structure ```txt Woven ├── data ├── ans ├── src │ ├── verify.sh │ ├── benchmark.sh ├── outputs ├── exp └── prepareTestDataAndAns.sh (shell script to prepare the data/answer) ``` ## Data related folder - data: Test data, since the data folder contains too much dataset and size is too big, you can run `./prepareTestDataAndAns.sh` to generate or download them [here](https://drive.google.com/drive/folders/1CY7ud1rErNM0SlkSAuYxgxxBQ1D9n_-B?usp=sharing). - ans: Answer - outputs: Contain execution result - exp: Contain experiment result with different size of file to test performance improvement ## Code related folder - src: Contain source code for code challenge, and generator for test dataset, and answer - `verify.sh`: Verify the answer with self-generated dataset - `prepareTestDataAndAns`.sh: shell script to prepare the data/answer - `benchmark.sh`: shell scripts for performance evalutions # How to run ## Environment ```txt! macOS 13.4 (22F66) ``` ## Requirement ```bash! chmod +x prepareTestDataAndAns.sh; cd src; chmod +x verify.sh chmod +x benchmark.sh ``` ## Execution - if you want to run *.sh provided, you should download data [here](https://drive.google.com/drive/folders/1CY7ud1rErNM0SlkSAuYxgxxBQ1D9n_-B?usp=sharing) first into data folder ```cpp! // under src/ // Sequential version g++ -std=c++11 -o program sequential.cpp ./program {$input_filepath} {$topK} {$output_file_path} // Parallel version g++ -std=c++11 -o program parallel.cpp ./program {$input_filepath} {$topK} 4 {$output_file_path} ``` ## Run verify ```bash! # under src/, with data prepared and .sh executable ./verify.sh ``` - Successfully execute result ![](https://hackmd.io/_uploads/SkBP4E9In.png) ## Data preparation - Use python code to generate test dataset, and generate top 1-20 for corresponding test dataset. - Generate (dataset, answer) by `./prepareTestDataAndAns.sh` # Result (sequential vs parallel) - The experiment results demonstrate that for datasets with a size of **up to 10 million lines**, running the **sequential** version of the program is sufficient. - **Multi-threading optimization** resulted in a **1.5X** faster processing time for datasets in general.(**Over 10 million lines**) - View the experiment result [here](https://docs.google.com/spreadsheets/d/1BvUVn_zX7Ek4Jrm29BwVnBXWDkoytApA6x0pzSQGwCY/edit?usp=sharing) - 0 => sequential, other means number of threads. ![](https://hackmd.io/_uploads/B1r6oljU2.png) ![](https://hackmd.io/_uploads/SkrToxjLh.png)