--- tags: operating systems --- # 2020 OS Fall HW1: External sorting * [GitHub Repo](https://github.com/haojungc/external-sorting) * [2020_OS_Fall_HW1: Benchmark Your Computer Black Box](https://hackmd.io/@dsclab/r1viIcBSw) ## Environment * OS: Ubuntu 20.04.1 LTS * Arch: X86-64 * CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz * Memory: 8 GB * Programming Language: C ## Programming Task * Sort integers in the input file and save the result in the output file. :::info :information_source: The size of input file is larger than the memory size of the computer, so external sorting is required. ::: ## Usage ```bash= # Compile make # Generates integers # e.g. ./generate 100000 example.txt # default number of integers: 3e9 # default filename: "input.txt" ./generate [number-of-integers] [output-filename] # Sorts the input file and writes to the output file # e.g. ./sort example.txt # default filename: "input.txt" ./sort [input-filename] # Checks if a file is arranged in ascending order # e.g. ./check example.txt # default filename: "output.txt" ./check [filename] # Removes unnecessary files make clean ``` ### Programs > [GitHub Repo](https://github.com/haojungc/external-sorting) #### `int32_generator.c` * creates a file that contains certain number of random integers * default values: * number of integers: `3,000,000,000` (`3e9`) * output filename: `input.txt` #### `main.c` * sorts the input file and write the result to an output file * default value: * input filename: `input.txt` #### `order_checker.c` * checks if the input file is arranged in ascending order or not * default value: * input filename: `output.txt` ## Development Process * I use macro `MAX_CHUNK_SIZE` to set the maximum number of integers that can be stored in each spiltted file, and use macro `MAX_FILES` to set the maximum files that can be opened at the same time. * The basic concepts of my external sorting program are as follows: 1. Read `MAX_FILES` number of integers from the input file. 2. Sort the integers chunk by chunk (the size of each chunk is defined by `MAX_CHUNK_SIZE`) and store the sorted results in the temporary file in binary form. 3. Merge the sorted temporary files by calling `merge_files`. 4. Done! * How `merge_files` works: 1. Open the sorted files and store the file pointers in a file pointer array. 2. Read the first integers in each file, store the integers and the corresponding file indices in the node_t array. 3. Create a heap and use heap sort to arrange the integers in ascending order. 4. Write value of the smallest node to the output file and mark the file index (where the smallest node came from) as the current file index. 5. Read the next integer from the current file. If EOF is reached, go to step 6.1. Otherwise, go to step 6.2. 6. 1. Swap the smallest node with the last node in the heap. Decrease size of the heap by 1. If the size is zero, go to step 9, otherwise, go to step 7. 2. Replace the smallest node in the heap with the value from step 5. 7. Update the heap by using min-heapify. 8. Write value of the updated smallest node to the output file and update the current file index if necessary. Repeat step 5. 9. Done! * However, thing went wrong... * `fopen` failure: ![](https://i.imgur.com/rJiiZ42.png) * I noticed that this error only happened when the system tried to open more than 1021 files with `fopen` at the same time. * Therefore, I set the limit of maximum files to 1000, and I also came up with a solution if the number of splitted files was more than the limit. * I wrote an algorithm that can merge 1000 files each time until all files were merged into one file, which was the output file we wanted. * code snippet: ```c= #define MAX_FILES 1000 static void external_sort(const char *f_in) { ... size_t total_files = split_file_and_sort(f_in); ... do { /* Merges MAX_FILES of sorted files to one file */ total_files = merge_files(total_files); } while (total_files > 1); ... } static size_t merge_files(size_t total_files) { /* Merges MAX_FILES of files */ for (size_t i = 0; i * MAX_FILES < total_files; i++) { ... } return (total_files - 1) / 1000 + 1; } ``` ### Performance Analysis * The goal of performance analysis is to minimize the execution time of `sort`. #### Test 1 * setup values: | variable | value | |:------------------:|:-----------:| | total integers | 100,000,000 | | maximum chunk size | 1,000,000 | * elapsed time: | function | elapsed time (sec) | |:---------------------:|:------------------:| | `split_file_and_sort` | 48.895494 | | `merge_files` | 39.844204 | ![](https://i.imgur.com/6YWZbT9.png) * use perf to show the results (in cycles): ``` perf report -g --no-children ``` * result: ![](https://i.imgur.com/4uGluQd.png) * We can see that `merge` and `min_heapify` are the most time-consuming functions in `sort`, so we should optimize `merge` and `min_heapify` for a better result. * It is also clear that `__vfscanf_internal`, called from `fscanf`(shown in the screenshot below), is the second most time-consuming function, so we need to find an alternative way to replace `fscanf`. ![](https://i.imgur.com/l2x6yvp.png) #### Solutions 1. Replace `fscanf`: * original code snippet of `split_file_and_sort`: ```c= static size_t split_file_and_sort(const char *f_in) { ... size_t i, scan_stat = 0; for (i = 0; scan_stat != EOF; i++) { /* Read MAX_CHUNK_SIZE of integers from the input file */ int j; for (j = 0; j < MAX_CHUNK_SIZE; j++) { /* Checks EOF */ if ((scan_stat = fscanf(fp_in, "%d", &n[j])) == EOF) break; } ... } ... } ``` * replace `fscanf` with `fgets`: ```c= static size_t split_file_and_sort(const char *f_in) { ... size_t i; char buf[BUFFER_SIZE]; char *scan_stat; for (i = 0; scan_stat != NULL; i++) { /* Read MAX_CHUNK_SIZE of integers from the input file */ int j; for (j = 0; j < MAX_CHUNK_SIZE; j++) { /* Checks EOF */ if ((scan_stat = fgets(buf, BUFFER_SIZE, fp_in)) == NULL) break; buf[strlen(buf) - 1] = '\0'; n[j] = atoi(buf); } ... } ... } ``` * result after replacing `fscanf` with `fgets`: | function | elapsed time (before) | elapsed time (after) | |:---------------------:|:---------------------:|:--------------------:| | `split_file_and_sort` | 48.895494 secs | 41.440600 secs | ![](https://i.imgur.com/El41mn5.png) ![](https://i.imgur.com/6vtxV52.png) * The results showed that it is faster to use `fgets` rather than `fscanf`. Therefore, I tried to replace `fscanf` as many as I can. * I also noticed that I used a lot of `fscanf` in `merge_files` when reading data from temporary files, so using `fread` might be a better choice. * However, `fread` reads files in binary form, so the temporary file must be stored in binary form. * As a result, I need to replace `fprintf` with `fwrite` in `split_file_and_sort` when storing data in the temporary files. * original code snippet of `split_file_and_sort`, `merge_files` and `external_sort`: * `split_file_and_sort` (original) ```c= static size_t split_file_and_sort(const char *f_in) { ... for (i = 0; scan_stat != NULL; i++) { ... char f_out[MAX_FILENAME_LENGTH]; sprintf(f_out, "tmp/%lu.txt", i + 1); FILE *fp_out = open_file(f_out, "w"); /* Writes the sorted array n to f_out */ int int_count = j; for (j = 0; j < int_count; j++) fprintf(fp_out, "%d\n", n[j]); fclose(fp_out); } ... } ``` * `merge_files` (original) ```c= static size_t merge_files(size_t total_files) { /* Merges MAX_FILES of files */ for (size_t i = 0; i * MAX_FILES < total_files; i++) { ... for (size_t j = 0; j < size; j++) { char filename[MAX_FILENAME_LENGTH]; sprintf(filename, "tmp/%lu.txt", i * MAX_FILES + j + 1); fp[j] = open_file(filename, "r"); } ... /* Reads the first integers from each file * and stores them in the heap */ for (int j = 0; j < size; j++) { fscanf(fp[j], "%d", &h->node[j + 1].value); h->node[j + 1].index = j; } ... /* Writes the smallest value to the output file * and names it "tmp_x.txt" */ char f_out[MAX_FILENAME_LENGTH]; sprintf(f_out, "tmp/tmp_%lu.txt", i + 1); FILE *fp_out = open_file(f_out, "w"); fprintf(fp_out, "%d\n", get_min_value(h)); int curr = get_min_index(h); while (1) { int tmp; /* End of the current file */ if (fscanf(fp[curr], "%d", &tmp) == EOF) { /* Closes the sorted file and deletes it */ fclose(fp[curr]); char filename[MAX_FILENAME_LENGTH]; sprintf(filename, "tmp/%lu.txt", i * MAX_FILES + curr + 1); remove(filename); /* Updates min_heap */ swap(&h->node[h->tail--], &h->node[1]); if (h->tail == 0) break; } else { h->node[1].value = tmp; h->node[1].index = curr; } min_heapify(h, 1); fprintf(fp_out, "%d\n", get_min_value(h)); curr = get_min_index(h); } fclose(fp_out); ... } return (total_files - 1) / 1000 + 1; } ``` * `external_sort` (original) ```c= static void external_sort(const char *f_in) { ... size_t total_files = split_file_and_sort(f_in); ... do { /* Merges MAX_FILES of sorted files to one file */ total_files = merge_files(total_files); } while (total_files > 1); ... /* Moves the sorted file from tmp/ to current directory */ char old[] = "tmp/1.txt", new[] = "output.txt"; rename(old, new); rmdir("tmp"); } ``` * replace `fprintf` with `fwrite`, and replace `fscanf` with `fread`: * `split_file_and_sort` (new) ```c= static size_t split_file_and_sort(const char *f_in) { ... for (i = 0; scan_stat != NULL; i++) { ... char f_out[MAX_FILENAME_LENGTH]; sprintf(f_out, "tmp/%lu.txt", i + 1); FILE *fp_out = open_file(f_out, "wb"); /* Writes the sorted array n to f_out */ fwrite(n, sizeof(int), j, fp_out); fclose(fp_out); } fclose(fp_in); return i; /* The number of sorted files */ } ``` * `merge_files` (new) ```c= static size_t merge_files(size_t total_files) { /* Merges MAX_FILES of files */ for (size_t i = 0; i * MAX_FILES < total_files; i++) { ... for (size_t j = 0; j < size; j++) { char filename[MAX_FILENAME_LENGTH]; sprintf(filename, "tmp/%lu.txt", i * MAX_FILES + j + 1); fp[j] = open_file(filename, "rb"); } ... /* Reads the first integers from each file * and stores them in the heap */ for (int j = 0; j < size; j++) { fread(&h->node[j + 1].value, sizeof(int), 1, fp[j]); h->node[j + 1].index = j; } ... /* Writes the smallest value to the output file * and names it "tmp_x.txt" */ char f_out[MAX_FILENAME_LENGTH]; sprintf(f_out, "tmp/tmp_%lu.txt", i + 1); FILE *fp_out = open_file(f_out, "wb"); fwrite(&h->node[1].value, sizeof(int), 1, fp_out); int curr = get_min_index(h); while (1) { int tmp; /* End of the current file */ if (fread(&tmp, sizeof(int), 1, fp[curr]) == 0) { /* Closes the sorted file and deletes it */ fclose(fp[curr]); char filename[MAX_FILENAME_LENGTH]; sprintf(filename, "tmp/%lu.txt", i * MAX_FILES + curr + 1); remove(filename); /* Updates min_heap */ swap(&h->node[h->tail--], &h->node[1]); if (h->tail == 0) break; } else { h->node[1].value = tmp; h->node[1].index = curr; } min_heapify(h, 1); fwrite(&h->node[1].value, sizeof(int), 1, fp_out); curr = get_min_index(h); } fclose(fp_out); ... } return (total_files - 1) / 1000 + 1; } ``` * `external_sort` (new) ```c= static void external_sort(const char *f_in) { ... size_t total_files = split_file_and_sort(f_in); ... do { /* Merges MAX_FILES of sorted files to one file */ total_files = merge_files(total_files); } while (total_files > 1); char f_tmp[] = "tmp/1.txt", f_out[] = "output.txt"; FILE *fp_tmp = open_file(f_tmp, "rb"), *fp_out = open_file(f_out, "w"); int tmp; while (fread(&tmp, sizeof(int), 1, fp_tmp) != 0) fprintf(fp_out, "%d\n", tmp); fclose(fp_tmp); fclose(fp_out); remove(f_tmp); rmdir("tmp"); ... } ``` * result: | function | elapsed time (before) | elapsed time (after) | |:---------------------:|:---------------------:|:--------------------:| | `split_file_and_sort` | 41.440600 secs | 33.272671 secs | | `merge_files` | 39.846767 secs | 31.794889 secs | ![](https://i.imgur.com/kBRG4zo.png) ![](https://i.imgur.com/ojJADTO.png) * According to the result, I successfully reduced the execution time of `split_file_and_sort` and `merge_files`, and `__vfscanf_internal` was no longer in the red zone (red zone: time-consuming) ## References * [How to sort the runs in external sorting using merge sort](https://stackoverflow.com/a/12869401) * [The Mode Bits for Access Permission](https://www.gnu.org/software/libc/manual/html_node/Permission-Bits.html) * [Merge Sort](https://www.geeksforgeeks.org/merge-sort/) * [External Sorting](https://www.geeksforgeeks.org/external-sorting/) * [Linux 效能分析工具: Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) * [Linux kernel profiling with perf](https://perf.wiki.kernel.org/index.php/Tutorial)