2020q3 Quiz13 - HackMD

# 2020q3 Quiz13 contributed by < [StoneLin0708](https://github.com/StoneLin0708) > ###### tags: `sysprog2020` ## LeetCode 1239. Maximum Length of a Concatenated String with Unique Characters Given a C solution :::spoiler Unfold ```c #include <string.h> #define MAX(x, y) ((x) > (y) ? (x) : (y)) static void dfs(int *arr, int arr_size, int i, int temp, int *maxm) { if (i > arr_size - 1) return; int c = temp; if ((c & arr[i]) == 0) { // check if still unique after concatenated c |= arr[i]; *maxm = MAX(__builtin_popcount(c), *maxm); dfs(arr, arr_size, i + 1, c, maxm); } dfs(arr, arr_size, i + 1, temp, maxm); } int maxLength(char **arr, int arr_size) { int a[arr_size]; memset(a, 0, arr_size * sizeof(int)); for (int i = 0; i < arr_size; i++) { int k = 0; int len = strlen(arr[i]); for (int j = 0; j < len; j++) k |= 1 << (arr[i][j] - 'a'); // ignore non unique case if (__builtin_popcount(k) != len) continue; a[i] = k; } int maxm = 0; dfs(a, arr_size, 0, 0, &maxm); return maxm; } ``` ::: ### 解釋上述程式碼運作原理 Because input characters contains only lower case English letters, therefore it is possible to encode character appearance of each input string into one 32bit integer as follows. Given arrSize of inputs $input\ arr=[str_0...str_{arrSize}]$ Encode appearance to k $k=[k_0...k_{arrSize}]$ Where $k_i$ is $k_i=k_i^{bit}\in[0,1]|\ bit=0...31\\ k_i^{bit}=\ \begin{cases}\ 1 & \text{if}\ \exists j\in{str_i} \ni{alphabetOrder(j)=bit}\\ \ 0 & else \end{cases}$ After convert string to character appearance, the unique character property can be simply verified by bitwise and equal zero, and string concatenation can be done by bitwise or. $k_a\&k_b=0\rightarrow{hasUniqueCharacter(str_a,str_a)}$ $k_a|k_b\rightarrow{encode(str_a+str_b)}$ The next step is to search the solution recursively by considering taking or to ignore on each string. $str_i$ will be taken and the solution will be updated, if the current solution still has a unique character set after concatenation, and there might be a potential better solution without current the $str_i$, therefore it's required to store the previous solution and proceed to the next $str_{i+1}$ no matter if $str_i$ is taken or not. ### 改寫程式碼，避免使用遞迴並尋求效率更高的實作 #### Skip invalid input string :::spoiler ```c int maxLength(char **arr, int arr_size) { int32_t a[arr_size]; int32_t *k = a; for (char **s = arr; s != (arr + arr_size); ++s) { char *j = *s; for (*k = 0; *j; ++j) (*k) |= 1 << (*j - 'a'); if (__builtin_popcount(*k) == (j - *s)) ++k; // only store valid input } arr_size = k - a; // number of valid string int32_t maxm = 0; dfs(a, arr_size, 0, 0, &maxm); return maxm; } ``` ::: #### Note: MAX macro expanded to ternary operator, and it acturally repeat x, y twice, which cause unexpected result (to me). in c++ std::max cast inputs to same type before compare. :::spoiler ```cpp #define MAX(x, y) ((x) > (y) ? (x) : (y)) int foo(); int bar(); int test(){ return MAX(foo(),bar()); } // -> int test_expanded(){ if(foo() > bar()){ // function call return foo(); // function call }else{ return bar(); // function call } } int test2(){ return std::max(foo(),bar()); } ``` ```asm test(): push rbx call foo() mov ebx, eax call bar() cmp ebx, eax jle .L2 pop rbx jmp foo() .L2: pop rbx jmp bar() test2(): push rbx call bar() mov ebx, eax call foo() cmp ebx, eax cmovge eax, ebx pop rbx ret ``` ::: #### Iterative version with linear allocator ```c int maxLength(char **arr, int arr_size) { uint32_t a[arr_size]; uint32_t *k = a; for (char **s = arr; s != (arr + arr_size); ++s) { char *j = *s; for (*k = 0; *j; ++j) (*k) |= 1 << (*j - 'a'); if (__builtin_popcount(*k) == (j - *s)) ++k; // only store valid input } size_t mask_max = 128; uint32_t *mask = malloc(sizeof(uint32_t)*128); size_t num = 1; int maxm = 0; mask[0] = 0; for(uint32_t *begin = a; begin != k; ++begin){ int size = num; for(int i=0;i<size;++i){ if( mask[i] & *begin) continue; if( (++num) == mask_max ){ mask_max *= 2; mask = realloc(mask, sizeof(uint32_t)*mask_max); } mask[num-1] = mask[i] | *begin; maxm = MAX(__builtin_popcount(mask[num-1]), maxm); } } return maxm; } ``` #### Iterative version with staic array ```c static int maxLength4(char **arr, int arr_size) { uint32_t a[arr_size]; uint32_t *k = a; for (char **s = arr; s != (arr + arr_size); ++s) { char *j = *s; for (*k = 0; *j; ++j) (*k) |= 1 << (*j - 'a'); if (__builtin_popcount(*k) == (j - *s)) ++k; // only store valid input } static uint32_t mask[1 << 16]; size_t num = 1; int maxm = 0; mask[0] = 0; for (uint32_t *begin = a; begin != k; ++begin) { int size = num; for (int i = 0; i < size; ++i) { if (mask[i] & *begin) continue; mask[num] = mask[i] | *begin; maxm = MAX(__builtin_popcount(mask[num]), maxm); ++num; } } return maxm; } ``` #### Performance test * i7-8700 @ 4.6 ghz * test data * random generated 4096 dataset (which quality is very bad) * a-p (i.e. ["a","b",...,"p"]) * perf measure time usage for 10000 times * using valgrind for memory usage measurement * gcc 9.3 * -O3 |method|time 4096(ms)|memory 4096|time a-p(ms)|memory a-p| |-|-|-|-|-|-| |recursive|0.30073 +- 0.00028|37k|0.50186 +- 0.00023|37k| |iterative dynamic|0.29658 +- 0.00025|64.5k|0.54361 +- 0.00049|37k| |iterative static|0.28986 +- 0.00022|37k|0.41798 +- 0.00032|37k| :::warning TODO: 1. 預先建立測試資料，不依賴 LeetCode 線上評分系統 (常會遇到嚴重偏頗)，而在可掌握的主機上，透過 perf 一類的工具，分析程式的執行效率和記憶體開銷 :notes: jserv ::: ## mmap :::spoiler ```c /* copy modified blocks of source file to destination file efficiently * using mmap. */ #include <assert.h> #include <fcntl.h> #include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <sysexits.h> #include <unistd.h> int main(int argc, char *argv[]) { if (argc != 3) { printf("Usage: %s <source> <destination>\n", argv[0]); return EX_USAGE; } const char *src_name = argv[1]; const char *dst_name = argv[2]; int src_fd, dst_fd; struct stat dst_stat = {0}; off_t src_len, dst_len; src_fd = open(src_name, O_RDONLY); if (src_fd == -1) { perror(src_name); return EX_DATAERR; } dst_fd = open(dst_name, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH); if (dst_fd == -1 || fstat(dst_fd, &dst_stat) != 0) { perror(dst_name); return EX_DATAERR; } src_len = lseek(src_fd, 0, SEEK_END); if (src_len < 0) { perror(src_name); return EX_DATAERR; } dst_len = lseek(dst_fd, 0, SEEK_END); if (dst_len < 0) { perror(dst_name); return EX_DATAERR; } if (dst_len > src_len) { printf("Destination file is larger (%zd) than input file (%zd)\n", dst_len, src_len); return EX_DATAERR; } const size_t page_size = dst_stat.st_blksize > 0 ? dst_stat.st_blksize : BUFSIZ; const size_t len = src_len; if (ftruncate(dst_fd, len) != 0) { perror(dst_name); return EX_DATAERR; } size_t read_count = 0; size_t write_count = 0; if (len > 0) { const uint8_t *src; uint8_t *dst; src = mmap(NULL, len, PROT_READ, MAP_SHARED, src_fd, 0); if (src == NULL || posix_madvise((void *) src, len, POSIX_MADV_SEQUENTIAL) != 0) { perror(src_name); return EX_UNAVAILABLE; } dst = mmap(NULL, len, PROT_READ | PROT_WRITE , MAP_SHARED, dst_fd, 0); if (dst == NULL || posix_madvise(dst, len, POSIX_MADV_SEQUENTIAL) != 0) { perror(dst_name); return EX_UNAVAILABLE; } for (size_t i = 0; i < len; i += page_size) { size_t block_size = (len - i) >= page_size ? page_size : (len - i); if (memcmp(src + i, dst + i, block_size)) { memcpy(dst + i, src + i, block_size); write_count += block_size; } read_count += block_size; } if (munmap((void *) src, len) != 0) { perror(src_name); return EX_UNAVAILABLE; } if (msync(dst, len, MS_SYNC) != 0 || munmap(dst, len) != 0) { perror(dst_name); return EX_UNAVAILABLE; } } if (close(src_fd) != 0) { perror(src_name); return EX_UNAVAILABLE; } if (close(dst_fd) != 0) { perror(dst_name); return EX_UNAVAILABLE; } printf("%zu bytes read\n", read_count); printf("%zu bytes written\n", write_count); return EXIT_SUCCESS; } ``` ::: * reference [cp source code](https://github.com/coreutils/coreutils/blob/master/src/cp.c) * 解釋上述程式碼運作原理 1. open source and destination file, check file length by seeking to the end of file. * fseek vs lseek , libc vs posix 2. using mmap to map file to virtual memory address, also advice system to expect sequential access. > POSIX_MADV_SEQUENTIAL > The application expects to access the specified address range sequentially, running from lower addresses to higher addresses. Hence, pages in this region can be aggressively read ahead, and may be freed soon after they are accessed. 3. read and write using memory addess 4. close source memory map, synchronous memory to destination file and close destination memory map. 5. close file * 指出其缺失 * compare source and destination memory(mapped file) on each address, copy when necessay. * has negative impact on most cast, only benefit in a case when source and destination are mostly same, which doesn't make sense. * read and write using mmap rely on virtual memory, each time access to not loaded address trigger page fault, in my 16MB file copy test, it cause about **4800 page faults**, which is a huge number compare to **140 page faults from plain cp**, mmap method also take a lot longer. |page faults on|16MB|256MB| |-|-|-| |mmap|~4800|~76000| |cp|~145|~145| * outputfile - permission * if output file doesn't exist, it always try create new file with permission 666 * 探討 sendfile 和 splice 等系統系統在上述程式的應用 * 參見 [以 sendfile 和 splice 系統呼叫達到 Zero-Copy](https://hackmd.io/@sysprog/linux2020-zerocopy) Above article mentioned that mmap did reduce kernel space to user space copy, in a cost of VMM, and mmap approach could be slower in small file or sequential access, therefore it could be improved by sendfile. Replace all mmap code with sendfile and test performance. Sendfile copy take only 55 page faults to copy 256MB data, which is better than both cp and mmap ```c write_count = sendfile(dst_fd, src_fd, NULL, len); ``` ```bash 268435456 bytes read 268435456 bytes written Performance counter stats for './sfcp data/268435456.bin data/out_268435456.bin': 88.47 msec task-clock # 0.998 CPUs utilized 1 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 54 page-faults # 0.610 K/sec 380,130,771 cycles # 4.297 GHz 430,161,189 instructions # 1.13 insn per cycle 77,413,379 branches # 875.043 M/sec 64,651 branch-misses # 0.08% of all branches 0.088643552 seconds time elapsed 0.000000000 seconds user 0.088640000 seconds sys ```