:information_source: The size of input file is larger than the memory size of the computer, so external sorting is required.
int32_generator.c
3,000,000,000
(3e9
)input.txt
main.c
input.txt
order_checker.c
output.txt
I use macro MAX_CHUNK_SIZE
to set the maximum number of integers that can be stored in each spiltted file, and use macro MAX_FILES
to set the maximum files that can be opened at the same time.
The basic concepts of my external sorting program are as follows:
MAX_FILES
number of integers from the input file.MAX_CHUNK_SIZE
) and store the sorted results in the temporary file in binary form.merge_files
.How merge_files
works:
Open the sorted files and store the file pointers in a file pointer array.
Read the first integers in each file, store the integers and the corresponding file indices in the node_t array.
Create a heap and use heap sort to arrange the integers in ascending order.
Write value of the smallest node to the output file and mark the file index (where the smallest node came from) as the current
file index.
Read the next integer from the current file. If EOF is reached, go to step 6.1. Otherwise, go to step 6.2.
Swap the smallest node with the last node in the heap. Decrease size of the heap by 1. If the size is zero, go to step 9, otherwise, go to step 7.
Replace the smallest node in the heap with the value from step 5.
Update the heap by using min-heapify.
Write value of the updated smallest node to the output file and update the current file index if necessary. Repeat step 5.
Done!
However, thing went wrong…
fopen
failure:
I noticed that this error only happened when the system tried to open more than 1021 files with fopen
at the same time.
Therefore, I set the limit of maximum files to 1000, and I also came up with a solution if the number of splitted files was more than the limit.
I wrote an algorithm that can merge 1000 files each time until all files were merged into one file, which was the output file we wanted.
code snippet:
sort
.setup values:
variable | value |
---|---|
total integers | 100,000,000 |
maximum chunk size | 1,000,000 |
elapsed time:
function | elapsed time (sec) |
---|---|
split_file_and_sort |
48.895494 |
merge_files |
39.844204 |
use perf to show the results (in cycles):
result:
We can see that merge
and min_heapify
are the most time-consuming functions in sort
, so we should optimize merge
and min_heapify
for a better result.
It is also clear that __vfscanf_internal
, called from fscanf
(shown in the screenshot below), is the second most time-consuming function, so we need to find an alternative way to replace fscanf
.
fscanf
:split_file_and_sort
:fscanf
with fgets
:result after replacing fscanf
with fgets
:
function | elapsed time (before) | elapsed time (after) |
---|---|---|
split_file_and_sort |
48.895494 secs | 41.440600 secs |
The results showed that it is faster to use fgets
rather than fscanf
. Therefore, I tried to replace fscanf
as many as I can.
I also noticed that I used a lot of fscanf
in merge_files
when reading data from temporary files, so using fread
might be a better choice.
However, fread
reads files in binary form, so the temporary file must be stored in binary form.
As a result, I need to replace fprintf
with fwrite
in split_file_and_sort
when storing data in the temporary files.
original code snippet of split_file_and_sort
, merge_files
and external_sort
:
split_file_and_sort
(original)
merge_files
(original)external_sort
(original)fprintf
with fwrite
, and replace fscanf
with fread
:split_file_and_sort
(new)merge_files
(new)external_sort
(new)result:
function | elapsed time (before) | elapsed time (after) |
---|---|---|
split_file_and_sort |
41.440600 secs | 33.272671 secs |
merge_files |
39.846767 secs | 31.794889 secs |
According to the result, I successfully reduced the execution time of split_file_and_sort
and merge_files
, and __vfscanf_internal
was no longer in the red zone (red zone: time-consuming)