Try   HackMD

2020 OS Fall HW1: External sorting

Environment

  • OS: Ubuntu 20.04.1 LTS
  • Arch: X86-64
  • CPU: Intel® Core i7-6700HQ CPU @ 2.60GHz
  • Memory: 8 GB
  • Programming Language: C

Programming Task

  • Sort integers in the input file and save the result in the output file.

:information_source: The size of input file is larger than the memory size of the computer, so external sorting is required.

Usage

# Compile make # Generates integers # e.g. ./generate 100000 example.txt # default number of integers: 3e9 # default filename: "input.txt" ./generate [number-of-integers] [output-filename] # Sorts the input file and writes to the output file # e.g. ./sort example.txt # default filename: "input.txt" ./sort [input-filename] # Checks if a file is arranged in ascending order # e.g. ./check example.txt # default filename: "output.txt" ./check [filename] # Removes unnecessary files make clean

Programs

GitHub Repo

int32_generator.c

  • creates a file that contains certain number of random integers
  • default values:
    • number of integers: 3,000,000,000 (3e9)
    • output filename: input.txt

main.c

  • sorts the input file and write the result to an output file
  • default value:
    • input filename: input.txt

order_checker.c

  • checks if the input file is arranged in ascending order or not
  • default value:
    • input filename: output.txt

Development Process

  • I use macro MAX_CHUNK_SIZE to set the maximum number of integers that can be stored in each spiltted file, and use macro MAX_FILES to set the maximum files that can be opened at the same time.

  • The basic concepts of my external sorting program are as follows:

    1. Read MAX_FILES number of integers from the input file.
    2. Sort the integers chunk by chunk (the size of each chunk is defined by MAX_CHUNK_SIZE) and store the sorted results in the temporary file in binary form.
    3. Merge the sorted temporary files by calling merge_files.
    4. Done!
  • How merge_files works:

  1. Open the sorted files and store the file pointers in a file pointer array.

  2. Read the first integers in each file, store the integers and the corresponding file indices in the node_t array.

  3. Create a heap and use heap sort to arrange the integers in ascending order.

  4. Write value of the smallest node to the output file and mark the file index (where the smallest node came from) as the current
    file index.

  5. Read the next integer from the current file. If EOF is reached, go to step 6.1. Otherwise, go to step 6.2.

    1. Swap the smallest node with the last node in the heap. Decrease size of the heap by 1. If the size is zero, go to step 9, otherwise, go to step 7.

    2. Replace the smallest node in the heap with the value from step 5.

  6. Update the heap by using min-heapify.

  7. Write value of the updated smallest node to the output file and update the current file index if necessary. Repeat step 5.

  8. Done!

  • However, thing went wrong

  • fopen failure:

  • I noticed that this error only happened when the system tried to open more than 1021 files with fopen at the same time.

  • Therefore, I set the limit of maximum files to 1000, and I also came up with a solution if the number of splitted files was more than the limit.

  • I wrote an algorithm that can merge 1000 files each time until all files were merged into one file, which was the output file we wanted.

  • code snippet:

#define MAX_FILES 1000 static void external_sort(const char *f_in) { ... size_t total_files = split_file_and_sort(f_in); ... do { /* Merges MAX_FILES of sorted files to one file */ total_files = merge_files(total_files); } while (total_files > 1); ... } static size_t merge_files(size_t total_files) { /* Merges MAX_FILES of files */ for (size_t i = 0; i * MAX_FILES < total_files; i++) { ... } return (total_files - 1) / 1000 + 1; }

Performance Analysis

  • The goal of performance analysis is to minimize the execution time of sort.

Test 1

  • setup values:

    variable value
    total integers 100,000,000
    maximum chunk size 1,000,000
  • elapsed time:

    function elapsed time (sec)
    split_file_and_sort 48.895494
    merge_files 39.844204

  • use perf to show the results (in cycles):

    ​​​​perf report -g --no-children
    
  • result:

  • We can see that merge and min_heapify are the most time-consuming functions in sort, so we should optimize merge and min_heapify for a better result.

  • It is also clear that __vfscanf_internal, called from fscanf(shown in the screenshot below), is the second most time-consuming function, so we need to find an alternative way to replace fscanf.

Solutions

  1. Replace fscanf:
  • original code snippet of split_file_and_sort:
static size_t split_file_and_sort(const char *f_in) { ... size_t i, scan_stat = 0; for (i = 0; scan_stat != EOF; i++) { /* Read MAX_CHUNK_SIZE of integers from the input file */ int j; for (j = 0; j < MAX_CHUNK_SIZE; j++) { /* Checks EOF */ if ((scan_stat = fscanf(fp_in, "%d", &n[j])) == EOF) break; } ... } ... }
  • replace fscanf with fgets:
static size_t split_file_and_sort(const char *f_in) { ... size_t i; char buf[BUFFER_SIZE]; char *scan_stat; for (i = 0; scan_stat != NULL; i++) { /* Read MAX_CHUNK_SIZE of integers from the input file */ int j; for (j = 0; j < MAX_CHUNK_SIZE; j++) { /* Checks EOF */ if ((scan_stat = fgets(buf, BUFFER_SIZE, fp_in)) == NULL) break; buf[strlen(buf) - 1] = '\0'; n[j] = atoi(buf); } ... } ... }
  • result after replacing fscanf with fgets:

    function elapsed time (before) elapsed time (after)
    split_file_and_sort 48.895494 secs 41.440600 secs

  • The results showed that it is faster to use fgets rather than fscanf. Therefore, I tried to replace fscanf as many as I can.

  • I also noticed that I used a lot of fscanf in merge_files when reading data from temporary files, so using fread might be a better choice.

  • However, fread reads files in binary form, so the temporary file must be stored in binary form.

  • As a result, I need to replace fprintf with fwrite in split_file_and_sort when storing data in the temporary files.

  • original code snippet of split_file_and_sort, merge_files and external_sort:

  • split_file_and_sort (original)

static size_t split_file_and_sort(const char *f_in) { ... for (i = 0; scan_stat != NULL; i++) { ... char f_out[MAX_FILENAME_LENGTH]; sprintf(f_out, "tmp/%lu.txt", i + 1); FILE *fp_out = open_file(f_out, "w"); /* Writes the sorted array n to f_out */ int int_count = j; for (j = 0; j < int_count; j++) fprintf(fp_out, "%d\n", n[j]); fclose(fp_out); } ... }
  • merge_files (original)
static size_t merge_files(size_t total_files) { /* Merges MAX_FILES of files */ for (size_t i = 0; i * MAX_FILES < total_files; i++) { ... for (size_t j = 0; j < size; j++) { char filename[MAX_FILENAME_LENGTH]; sprintf(filename, "tmp/%lu.txt", i * MAX_FILES + j + 1); fp[j] = open_file(filename, "r"); } ... /* Reads the first integers from each file * and stores them in the heap */ for (int j = 0; j < size; j++) { fscanf(fp[j], "%d", &h->node[j + 1].value); h->node[j + 1].index = j; } ... /* Writes the smallest value to the output file * and names it "tmp_x.txt" */ char f_out[MAX_FILENAME_LENGTH]; sprintf(f_out, "tmp/tmp_%lu.txt", i + 1); FILE *fp_out = open_file(f_out, "w"); fprintf(fp_out, "%d\n", get_min_value(h)); int curr = get_min_index(h); while (1) { int tmp; /* End of the current file */ if (fscanf(fp[curr], "%d", &tmp) == EOF) { /* Closes the sorted file and deletes it */ fclose(fp[curr]); char filename[MAX_FILENAME_LENGTH]; sprintf(filename, "tmp/%lu.txt", i * MAX_FILES + curr + 1); remove(filename); /* Updates min_heap */ swap(&h->node[h->tail--], &h->node[1]); if (h->tail == 0) break; } else { h->node[1].value = tmp; h->node[1].index = curr; } min_heapify(h, 1); fprintf(fp_out, "%d\n", get_min_value(h)); curr = get_min_index(h); } fclose(fp_out); ... } return (total_files - 1) / 1000 + 1; }
  • external_sort (original)
static void external_sort(const char *f_in) { ... size_t total_files = split_file_and_sort(f_in); ... do { /* Merges MAX_FILES of sorted files to one file */ total_files = merge_files(total_files); } while (total_files > 1); ... /* Moves the sorted file from tmp/ to current directory */ char old[] = "tmp/1.txt", new[] = "output.txt"; rename(old, new); rmdir("tmp"); }
  • replace fprintf with fwrite, and replace fscanf with fread:
  • split_file_and_sort (new)
static size_t split_file_and_sort(const char *f_in) { ... for (i = 0; scan_stat != NULL; i++) { ... char f_out[MAX_FILENAME_LENGTH]; sprintf(f_out, "tmp/%lu.txt", i + 1); FILE *fp_out = open_file(f_out, "wb"); /* Writes the sorted array n to f_out */ fwrite(n, sizeof(int), j, fp_out); fclose(fp_out); } fclose(fp_in); return i; /* The number of sorted files */ }
  • merge_files (new)
static size_t merge_files(size_t total_files) { /* Merges MAX_FILES of files */ for (size_t i = 0; i * MAX_FILES < total_files; i++) { ... for (size_t j = 0; j < size; j++) { char filename[MAX_FILENAME_LENGTH]; sprintf(filename, "tmp/%lu.txt", i * MAX_FILES + j + 1); fp[j] = open_file(filename, "rb"); } ... /* Reads the first integers from each file * and stores them in the heap */ for (int j = 0; j < size; j++) { fread(&h->node[j + 1].value, sizeof(int), 1, fp[j]); h->node[j + 1].index = j; } ... /* Writes the smallest value to the output file * and names it "tmp_x.txt" */ char f_out[MAX_FILENAME_LENGTH]; sprintf(f_out, "tmp/tmp_%lu.txt", i + 1); FILE *fp_out = open_file(f_out, "wb"); fwrite(&h->node[1].value, sizeof(int), 1, fp_out); int curr = get_min_index(h); while (1) { int tmp; /* End of the current file */ if (fread(&tmp, sizeof(int), 1, fp[curr]) == 0) { /* Closes the sorted file and deletes it */ fclose(fp[curr]); char filename[MAX_FILENAME_LENGTH]; sprintf(filename, "tmp/%lu.txt", i * MAX_FILES + curr + 1); remove(filename); /* Updates min_heap */ swap(&h->node[h->tail--], &h->node[1]); if (h->tail == 0) break; } else { h->node[1].value = tmp; h->node[1].index = curr; } min_heapify(h, 1); fwrite(&h->node[1].value, sizeof(int), 1, fp_out); curr = get_min_index(h); } fclose(fp_out); ... } return (total_files - 1) / 1000 + 1; }
  • external_sort (new)
static void external_sort(const char *f_in) { ... size_t total_files = split_file_and_sort(f_in); ... do { /* Merges MAX_FILES of sorted files to one file */ total_files = merge_files(total_files); } while (total_files > 1); char f_tmp[] = "tmp/1.txt", f_out[] = "output.txt"; FILE *fp_tmp = open_file(f_tmp, "rb"), *fp_out = open_file(f_out, "w"); int tmp; while (fread(&tmp, sizeof(int), 1, fp_tmp) != 0) fprintf(fp_out, "%d\n", tmp); fclose(fp_tmp); fclose(fp_out); remove(f_tmp); rmdir("tmp"); ... }
  • result:

    function elapsed time (before) elapsed time (after)
    split_file_and_sort 41.440600 secs 33.272671 secs
    merge_files 39.846767 secs 31.794889 secs

  • According to the result, I successfully reduced the execution time of split_file_and_sort and merge_files, and __vfscanf_internal was no longer in the red zone (red zone: time-consuming)

References