## openmp - HackMD

# 0501 ## change openmp - [[commit]](https://github.com/johnnychang25678/fastcode-group-knn/commit/84ef2bc56aa650fce747007f172157beca8d798a#diff-57603cb78d2a8c195035f00ed617de2e70bac3e446e039bd99fadb16638ced18) change class instance to array improve a lot, **but I don't know why** cuda - [[commit]](https://github.com/johnnychang25678/fastcode-group-knn/commit/f7fe9ccd50105b3f246398de911ef9c109282fa7#diff-e413327b2dab2eef61f40b41075ceab761a67fabba2571a1533328cb9f26124c) refactor based on above mention's commit - base on [cuda profiling tool report](https://github.com/johnnychang25678/fastcode-group-knn/blob/master/cuda/nvprof_analysis.txt), the performance bottleneck is copying memory to and from device - because now we call kernel function on each test data - this means if we can avoid moving data around GPU and CPU, we can have further improvement. but I don't know how:( - on the other hand, if the time needed to calculate each test data is much larger than the time to copy memory, we could have better performance than openmp version - openmp also help a lot here, since we can call kernel function concurrently and they are independent to each other test scripts - [[folder]](https://github.com/johnnychang25678/fastcode-group-knn/tree/master/test) add test automation script, including data generation - note: previous data format are wrong. This problem fix in new test scripts ## tests https://docs.google.com/spreadsheets/d/1MCjgwqM2Ql31gttKCCWihSmpNLk32YkeYN6OJLkM9MU/edit - Test 1: data size - Test 2: thread - Test 3: feature - Test 4: k ### data size | | baseline | optimized | openmp | cuda | |-|-|-|-|-| | dataset1 | 0.653994(sec) | 0.304389 | 0.03192 | 0.485728 | | dataset2 | 6.129181 | 3.052307 | 0.174879 | 2.629451 | | dataset3 | 106.835553 | 52.606437 | 2.392566 | 46.774043 | - dataset1: - train: 4096 - test: 1924 - dataset2: - train: 16384 - test: 4096 - dataset3: - tarin: 65546 - test: 16384 insight: 1. openmp >> cuda >=optimized > baseline 2. in cuda version, copying data slow down the overall speed; openmp leverage parallelism in this situation 3. acutally, I am a little doubt about this result, since openmp version is so good. Maybe there are something wrong happen? (I do not test the accruacy of the program, and we cannot test because all the data is randomly generated) ### number of thread | | openmp | |-|-| | 1 | 1.980847 | | 2 | 1.008233 | | 4 | 0.517781 | | 8 | 0.293457 | | 16 | 0.165917 | | 32 | 0.115757 | insight: The ratio almost grow as much as the number of thread used ### number of feature | | openmp | cuda | |-|-|-| | 3 | 0.126619 | 2.594525 | | 100 | 0.512585 | 1.802557 | | 500 | 2.133146 | 1.498704 | insight: 1. cuda out perform openmp when a data have large feature number. 2. it is weird that cuda have better performance when data is larger. It could be better than openmp, but it is more reasonable to be slower ### change K in KNN | | openmp | |-|-| | k=3 | 0.177975 | | k=10 | 0.178506 | | k=25 | 0.179105 | | k=50 | 0.180661 | | k=100 | 0.180669 | | k=200 | 0.1846 | insight: larger k would cause worse performance, but it is not the key factor