Benchmark Your Computer Black Box

# Benchmark Your Computer Black Box ## Environment * OS : Ubuntu 18.04.1 * CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz * Menory: 16G * Programming Language(version): C++ ## Execution time * Data Size ： 17.6GB * Algorithms ：K way external sort * Time：1108秒 ![](https://i.imgur.com/eRRlPy2.png) # Program development and usage instructions ## Program development * External sorting algorithm has two part ``` 1. Partition 2. Merge File ``` Partition is to divide a large data set into small file that can be sorted by internal sort. I used tow methods to merge files - two way merge, and k way merge 1. two way merge: Utilize recursive function to merge file two by two 2. K wat merge: Employ a min-heap to merge K file all at onec ## Instruction for use * Random number generator ``` g++ -o rand rand_gen.cpp ./rand 100000000 #nums of data ``` * External sort ``` g++ -o external_sort external_sort.cpp ./external_sort.cpp 100000000 10 10 1 ``` > External_sort.cpp has four parameter 1. data size 2. dividing page number 3. merge how many files 4. two way or K way (two way for 0, K way for 1) --- # Performance report ## 1. Partition optimization ### Optimize internal sort function ![merge sort code](https://i.imgur.com/EQUw6NS.png) This code has poor performance. It has to spend 7 seconds sorting 10 milions data, and 176 for whole data set. Replace vector with vector, but it became even more slow. This is the time used by vector sort ![](https://i.imgur.com/J045knl.png) This is the time used by linked-list sort ![](https://i.imgur.com/5EAZFtF.png) > Discoverory：link list has poor locality > 1. It has to dereference everytime when it pass to its next node > 2. Data access is not continuous > [color=#f948cd] Therefore, I finally used std::sort() to sort divided-files since it is implmented by intro-sort which can optimize its algorithm based on the data size. > And it indeed sort the file more quickly（7 sec to 3 sec） > [color=#87e562] ## 2. I/O optimization > I found that fprintf and fscnf are two functions with lower speed, so I choose to start with this aspect first. In terms of input, I changed it to fgets and then used atoi to transform value and store it in the vector ```cpp while (count < max && fgets(read_buff, READ_SIZE, in_file)){ in.push_back(atoi(read_buff)); count++; } ``` After the actual test, it is found that in the Partition part, it is about 10% faster, and it will be even 40% faster in the merge file. Possible reason： > During merge file, it has huge numbers of I/O, so it would more easily improve the speed. > [color=#e53362] In terms of output, I set up an output buffer to store the data to be output. Use stringstream to set the format to a certain length in the buffer and then use fputs to output it uniformly. ```cpp= output_buffer << in[i] << "\n"; buf_counter++; if(buf_counter == WRITE_SIZE){ fputs(output_buffer.str().c_str(), out); buf_counter = 0; output_buffer.str(""); } ``` Using this method can make the output about 5% faster, and can also use more memory resources. And it can effectively reduce the number of disk accesses. ## 3. Algorithms optimization Changing from two way external sort to k way external sort can effectively improve the sorting speed, because by merging multiple files at one time, the redundant Disk I/O access in the original two way can be removed. And it can save a lot of sorting time at the beginning of Partition. * two way sort time(100 million data) ![](https://i.imgur.com/MygKi9U.png) * K way sort time k = 10(100 million data) ![](https://i.imgur.com/QzRFXKB.png) Improve the speed by 30% * Compare the cache miss two way ![](https://i.imgur.com/3tNhlsH.png) K way ![](https://i.imgur.com/NKsKIJS.png) Cache miss of K way sort is lesser ## 4. Use perf and htop to observe the hardware usage, and then find ways to optimize ![](https://i.imgur.com/FQ3RslQ.png) From htop, we can know the behavior ratio of the hardware in the partition stage. It can be seen that the MEM% is only 14%. This is the maximum usage that I can achieve at this stage, because if I continue increasing the buffer capacity, the program will be killed directly. ![](https://i.imgur.com/LYajh6Y.png) In the merge stage, since there is an additional heap to act as a buffer, the memory usage can be more. Then use perf stat to observe the CPU and branch miss situation ![](https://i.imgur.com/B6RCULO.png) We can seen that the CPU usage is consistent with that displayed in htop during the execution period, so the CPU usage is quite in line with expectations, and the branch miss is only 1.28%, which is as low as we can control. cache miss ![](https://i.imgur.com/NKsKIJS.png) It is observed here that the cache miss is as high as 55%, so there should be room for improvement. Use perf record to record the cache miss situation, the parameter here is `-e cache-misses` ![](https://i.imgur.com/n1fO0zh.png) `copy_user_enhanced_fast_string accounted` for 28% ![](https://i.imgur.com/n8DzGgb.png) After reading the assembly code, I speculate that this code may be generated when the streamstring buffer and atoi call #### Possible optimization In order to reduce the use of streamstring, I decided to use fwrite to output binary file during partition, so that the number of atoi can be reduced to half. The file output speed will also be faster. The disadvantage is that you need to convert the files back to string files after finishing all the files, which is a sacrifice of conversion time to achieve faster performance. #### Comparing Execution time before optimization ![](https://i.imgur.com/Q8NEqek.jpg) Execution time after optimization ![](https://i.imgur.com/QQDIbxC.png) The sorting speed is 250 seconds faster after optimization. Successfully reduced cache miss by about 5% (I thought it would drop more, and the streamstring that may remain still accounts for a large part) ![](https://i.imgur.com/doWXLEv.png) In addition, I originally tried to use mmap to map the memory to the file for operation, and the performance has also been improved, but the range is not very large. I speculate that it is because most of the time for each input and output should be taken up by atoi, and finally the speed becomes faster after reducing atoi, which proves that the previous speculation should be correct. > By the way, if you turn on compiler optimization when compiling > `g++ -o external_sort external_sort.cpp -O2` > The performance will increase sharply (100 million records will change from 65 seconds to 25 seconds), but since this is not something I optimized myself, I only want to ask questions here. ## 5. Multi-threaded Programming Execution Analysis In the case of executing multiple programs at the same time, the hardware resources will be allocated by the OS, and it is necessary to ensure that each program can eat CPU resources without interrupting any programs. Due to the disk capacity, here we first use 200 million pieces of data to record ![](https://i.imgur.com/vq9Us77.png) To begin with, I only run two programs at the same time for comparison. Since my computer has an 8-core processor, the OS allows different programs to run on two CPUs, so the current performance will not be greatly affected. When using perf to observe the hardware behavior, I found that the context switch will be much different from the time when only one is executed. ![](https://i.imgur.com/MVnLyBc.png) `When only one is executed, the context-switch is only 1640` ![](https://i.imgur.com/5tMQbIV.png) When I execute two programs, there will be more context-switches, because although different CPUs can be executed on the IO bus, the order of use must be sorted, so the context-switches will increase dramatically. Then, I execute 4 programs at a time ![](https://i.imgur.com/BgITh23.png) We can see that four CPUs are fully occupied Besides, when I execute four programs at the same time, all the webpages I opened will crash ![](https://i.imgur.com/m6p95rd.png) I deduce that the os judged that because the webpage occupies cpu space, the resources are reserved for the programs that need it first, so the webpage will crash. ![](https://i.imgur.com/1fAEYwO.png) Finally, it can be seen that the context-switch is also increasing rapidly, because the number of programs to be executed in the schedule is increased, so the context-switch will soar. There is another point that cpu-migration will also be high. It can't be seen when two are executed at the same time, but it will become very high when four are executed at the same time. ![](https://i.imgur.com/EIkhPjF.jpg) [Reference](https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ldva/ldva_c_CPU_scheduling.html) Linux scheduler is a multi-scheduling program. It will move the waiting program to another CPU when the estimated waiting time is too long or the queue is full and other columns are waiting to be replenished. Take my example, Because executing four sorting programs at a time may cause the problem of waiting in the queue for too long, the OS will frequently move the programs to the CPU to be operated, but the migration will also cause time costs, so the OS will also consider the migration cost as whether to use the CPU Important factor for migrating to another CPU. # Conclusion I think the service of the OS is to be the housekeeper of the computer, responsible for controlling the parts of the computer to process the commands issued by the master, and we are the master. When multiple commands are issued at one time, the OS must sort which program to compare It is important, which part is empty now and use it quickly. You have used it for too long now, so stop the execution and use it with other programs. This is the function of the os. If I want to say what optimization services the OS can provide, I think we should minimize cpu-migration, because the cost of migration is relatively high. I think the same program should be tied to the same CPU instead of jumping all the time. It should be better to go to such an efficiency.