1. Partition
2. Merge File
Partition is to divide a large data set into small file that can be sorted by internal sort.
I used tow methods to merge files - two way merge, and k way merge
g++ -o rand rand_gen.cpp
./rand 100000000 #nums of data
g++ -o external_sort external_sort.cpp
./external_sort.cpp 100000000 10 10 1
External_sort.cpp has four parameter
Replace vector with vector, but it became even more slow.
This is the time used by vector sort
Discoverory:link list has poor locality
- It has to dereference everytime when it pass to its next node
- Data access is not continuous
Therefore, I finally used std::sort() to sort divided-files since it is implmented by intro-sort which can optimize its algorithm based on the data size.
And it indeed sort the file more quickly(7 sec to 3 sec)
I found that fprintf and fscnf are two functions with lower speed, so I choose to start with this aspect first.
In terms of input, I changed it to fgets and then used atoi to transform value and store it in the vector
while (count < max && fgets(read_buff, READ_SIZE, in_file)){
in.push_back(atoi(read_buff));
count++;
}
After the actual test, it is found that in the Partition part, it is about 10% faster, and it will be even 40% faster in the merge file.
Possible reason:
During merge file, it has huge numbers of I/O, so it would more easily improve the speed.
In terms of output, I set up an output buffer to store the data to be output.
Use stringstream to set the format to a certain length in the buffer and then use fputs to output it uniformly.
output_buffer << in[i] << "\n";
buf_counter++;
if(buf_counter == WRITE_SIZE){
fputs(output_buffer.str().c_str(), out);
buf_counter = 0;
output_buffer.str("");
}
Using this method can make the output about 5% faster, and can also use more memory resources. And it can effectively reduce the number of disk accesses.
Changing from two way external sort to k way external sort can effectively improve the sorting speed, because by merging multiple files at one time, the redundant Disk I/O access in the original two way can be removed. And it can save a lot of sorting time at the beginning of Partition.
K way sort time k = 10(100 million data)
Compare the cache miss
two way
Then use perf stat to observe the CPU and branch miss situation
cache miss
Use perf record to record the cache miss situation, the parameter here is -e cache-misses
copy_user_enhanced_fast_string accounted
for 28%In order to reduce the use of streamstring, I decided to use fwrite to output binary file during partition, so that the number of atoi can be reduced to half. The file output speed will also be faster. The disadvantage is that you need to convert the files back to string files after finishing all the files, which is a sacrifice of conversion time to achieve faster performance.
Execution time before optimization
Execution time after optimization
Successfully reduced cache miss by about 5%
(I thought it would drop more, and the streamstring that may remain still accounts for a large part)
By the way, if you turn on compiler optimization when compiling
g++ -o external_sort external_sort.cpp -O2
The performance will increase sharply (100 million records will change from 65 seconds to 25 seconds), but since this is not something I optimized myself, I only want to ask questions here.
In the case of executing multiple programs at the same time, the hardware resources will be allocated by the OS, and it is necessary to ensure that each program can eat CPU resources without interrupting any programs.
Due to the disk capacity, here we first use 200 million pieces of data to record
To begin with, I only run two programs at the same time for comparison. Since my computer has an 8-core processor, the OS allows different programs to run on two CPUs, so the current performance will not be greatly affected.
When using perf to observe the hardware behavior, I found that the context switch will be much different from the time when only one is executed.
When only one is executed, the context-switch is only 1640
When I execute two programs, there will be more context-switches, because although different CPUs can be executed on the IO bus, the order of use must be sorted, so the context-switches will increase dramatically.
Then, I execute 4 programs at a time
We can see that four CPUs are fully occupied
Besides, when I execute four programs at the same time, all the webpages I opened will crash
I deduce that the os judged that because the webpage occupies cpu space, the resources are reserved for the programs that need it first, so the webpage will crash.
Finally, it can be seen that the context-switch is also increasing rapidly, because the number of programs to be executed in the schedule is increased, so the context-switch will soar. There is another point that cpu-migration will also be high. It can't be seen when two are executed at the same time, but it will become very high when four are executed at the same time.
Reference
Linux scheduler is a multi-scheduling program. It will move the waiting program to another CPU when the estimated waiting time is too long or the queue is full and other columns are waiting to be replenished. Take my example, Because executing four sorting programs at a time may cause the problem of waiting in the queue for too long, the OS will frequently move the programs to the CPU to be operated, but the migration will also cause time costs, so the OS will also consider the migration cost as whether to use the CPU Important factor for migrating to another CPU.
I think the service of the OS is to be the housekeeper of the computer, responsible for controlling the parts of the computer to process the commands issued by the master, and we are the master. When multiple commands are issued at one time, the OS must sort which program to compare It is important, which part is empty now and use it quickly. You have used it for too long now, so stop the execution and use it with other programs. This is the function of the os.
If I want to say what optimization services the OS can provide, I think we should minimize cpu-migration, because the cost of migration is relatively high. I think the same program should be tied to the same CPU instead of jumping all the time. It should be better to go to such an efficiency.