--- tags: SDE-good-read topic: Optimed Strlen --- # Optimed Strlen We often use the ```strlen(str)``` function to get the length of the string. This is broad-use, and simply implemented. In C, we cannot consider a string as an object. We need to find the '\0'(NULL) character to determine the end of the string. >C11 6.4.5 >6. a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. >C11 7.1.1 > A string is a contiguous sequence of characters terminated by and including the first null character. The term multibyte string is sometimes used instead to emphasize special processing given to multibyte characters contained in the string or to avoid confusion with a wide string. A pointer to a string is a pointer to its initial (lowest addressed) character. The length of a string is the number of bytes preceding the null character and the value of a string is the sequence of the values of the contained characters, in order. Briefly, strlen() is the finding first 0 function. ## 1.Easy-Implementation strlen First, we can implement the following in accordance with the behavior defined by the language: ```c= size_t strlen (const char *str) { const char *p; if (str == NULL) return 0; p = str; while (*p != '\0') p++; return p - str; } ``` It is a very easy implement - compare every character to find NULL character. But we can find out something - the function always comparing one character at one time. This can obviously be optimized! Depending on the computer architecture, we can operate on data of a long size. Taking a computer with 64-bit architecture for example, 8 characters can be compared at a time, and the performance can be increased up to 8 times for longer strings. ## 2. Design [glib - strlen()](https://github.com/lattera/glibc/blob/master/string/strlen.c) 1. If the string length is less than one long size (or smaller, correctly to say, to the address that matches the bytes address), the design does not improve its performance. We have to think about how to handle very short strings. **In addition, we also need to find the aligned pointer.** 2. If a long string is given, we will need to find the first '\0' in the later data. The later data should be spilted into many one-long-size block (sub-string), which can be used directly by memory. 3. After finding NULL charater, we can calculate the length by comparing with start pointer. ## 3. glib strlen ### What is GLib? [GNOME](https://www.gnome.org/) - GNU Network Object Model Environment ![](https://i.imgur.com/WY2hyBI.png) Glib is one of the bsp library for GNOME, but GNU plan fail. Glib has good cross-platform features, and can deployed on different OS. GNU Liberary provides this optimized implementation for strlen(), which will be described in detail in this article. ## 4. Implementation **1. Small string / Find start-pointer** ```c= const char *char_ptr; const unsigned long int *longword_ptr /* Handle the first few characters by reading one character at a time. Do this until CHAR_PTR is aligned on a longword boundary. */ for (char_ptr = str; ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0; ++char_ptr) if (*char_ptr == '\0') return char_ptr - str; ``` The first goal would be to find a aligned pointer on a longword boundary. If we find a NULL in here, return back. In this for-loop ```for ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0 ```we can find the padding address. The address will be the start-address for next stage. #### Why do we need to find the start-address? Consider two string condition. ```c= char* str = "This is a book." unsigned int len1 = strlen(str); unsigned int len2 = strlen((&str[1]); ``` ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="T|h|i|s|_|i|s|_"]; node1 [fontsize=13, label ="a|_|b|o|o|k|.|"]; } ``` The pointer ```&str[1]``` can not be used for the long-byte operator. **2. Check the block (sub-string)** So here is the question. How to find the null character in a long size data? [Determine if a word has a byte less than n](http://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord) >Test if a word x contains an unsigned byte with value < n. Specifically for n=1, it can be used to find a 0-byte by examining one long at a time, or any byte by XORing x with a mask first. Uses 4 arithmetic/logical operations when n is constant. > Requirements: x>=0; 0<=n<=128 ```#define hasless(x,n) (((x)-~0UL/255*(n))&~(x)&~0UL/255*128)``` To count the number of bytes in x that are less than n in 7 operations, use ```#define countless(x,n) \``` ```(((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255)``` [Determine if a word has a zero byte](http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord) >The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second. ```c= // consider a 64 bit long size himagic = 0x8080808080808080L; lomagic = 0x0101010101010101L; ... for (;;) { longword = *longword_ptr++; if (((longword - lomagic) & ~longword & himagic) != 0) ``` We can notice how to build a mask in different platform by an elegant method in source code. 3. Return the answer Return the distance between start and finish pointer. ```c= if (cp[0] == 0) return cp - str; ... ``` According to this implementation, we can speculate that there would be an 8x improvment performance in 64-bits architecture computer. [holes](Printable characters Codes 20hex to 7Ehex, known as the printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. There are 95 printable characters in total.[m] ) ## 5. Source Code ```c= #ifndef STRLEN # define STRLEN strlen #endif /* Return the length of the null-terminated string STR. Scan for the null terminator quickly by testing four bytes at a time. */ size_t STRLEN (const char *str) { const char *char_ptr; const unsigned long int *longword_ptr; unsigned long int longword, himagic, lomagic; /* Handle the first few characters by reading one character at a time. Do this until CHAR_PTR is aligned on a longword boundary. */ for (char_ptr = str; ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0; ++char_ptr) if (*char_ptr == '\0') return char_ptr - str; /* All these elucidatory comments refer to 4-byte longwords, but the theory applies equally well to 8-byte longwords. */ longword_ptr = (unsigned long int *) char_ptr; /* Bits 31, 24, 16, and 8 of this number are zero. Call these bits the "holes." Note that there is a hole just to the left of each byte, with an extra at the end: bits: 01111110 11111110 11111110 11111111 bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD The 1-bits make sure that carries propagate to the next 0-bit. The 0-bits provide holes for carries to fall into. */ himagic = 0x80808080L; lomagic = 0x01010101L; if (sizeof (longword) > 4) { /* 64-bit version of the magic. */ /* Do the shift in two steps to avoid a warning if long has 32 bits. */ himagic = ((himagic << 16) << 16) | himagic; lomagic = ((lomagic << 16) << 16) | lomagic; } if (sizeof (longword) > 8) abort (); /* Instead of the traditional loop which tests each character, we will test a longword at a time. The tricky part is testing if *any of the four* bytes in the longword in question are zero. */ for (;;) { longword = *longword_ptr++; if (((longword - lomagic) & ~longword & himagic) != 0) { /* Which of the bytes was the zero? If none of them were, it was a misfire; continue the search. */ const char *cp = (const char *) (longword_ptr - 1); if (cp[0] == 0) return cp - str; if (cp[1] == 0) return cp - str + 1; if (cp[2] == 0) return cp - str + 2; if (cp[3] == 0) return cp - str + 3; if (sizeof (longword) > 4) { if (cp[4] == 0) return cp - str + 4; if (cp[5] == 0) return cp - str + 5; if (cp[6] == 0) return cp - str + 6; if (cp[7] == 0) return cp - str + 7; } } } } libc_hidden_builtin_def (strlen) ``` ## 6. Experiment (Normal-strlen vs glib-strlen vs string.h) ```shell= $ uname -a Linux u49049006de455c.ant.amazon.com 5.13.0-40-generic #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux $ cat /proc/cpuinfo | grep "cpu core" cpu cores : 4 ... ``` ### A. Performance [As a software engineer, what's the most ingenious line of code you've seen?](https://www.quora.com/As-a-software-engineer-whats-the-most-ingenious-line-of-code-youve-seen/answer/Gary-Munnelly?ch=10&oid=96311272&share=00d27a11&srid=kIeuq&target_type=answer&fbclid=IwAR3M3BMTqbGJSOIr-Pq695wa85KDlizwWQGNs_xMdlrLyUzjI1TWMD0Zjpk) ![](https://i.imgur.com/qdj0Huw.png) #### Our test: ``` set title "SDE-good read strlen test" set xlabel "string lengh" set ylabel "time(ms)" set terminal png font " Times_New_Roman,12 " set output "output.png" set key left plot \ "out.txt" using 1:2 with linespoints linewidth 2 title "normal-strlen", \ "out.txt" using 1:3 with linespoints linewidth 2 title "glib-strlen", \ "out.txt" using 1:4 with linespoints linewidth 2 title "string.h-strlen", \ ~ ``` ``` $ gnuplot run.gp $ eog output.png ``` ![](https://i.imgur.com/il1FRfe.png) **Test:** string size = 100000 run 10000 times and calculate the avg-time ```c= cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o out branchtest-strlen.c cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out --- test normal - strlen: time: 1644.232056 avg time: 0.164423 ms --- test glib - strlen: time: 188.436005 avg time: 0.018844 ms --- test string.h - strlen: time: 8.435000 avg time: 0.000844 ms ``` ### B. Different GCC optimitize (normal-strlen vs fast-strlen vs string.h - strlen) Branch-Test ```c= void branchtest(strlenfunc func, char *str){ clock_t t1, t2; t1 = clock(); int testtime = 10000; for(int i=0; i<testtime; i++) func(str); t2 = clock(); printf("time: %f\n", (float)(t2 - t1)/1000); printf("avg time: %f ms \n", ((float)(t2 - t1)/(float)testtime)/1000); return; } int main(){ char *str; str = gen_str(100000); printf("test normal - strlen:\n"); branchtest(normal_strlen, str); printf("test glib - strlen:\n"); branchtest(fast_strlen, str); printf("test string.h - strlen:\n"); branchtest(strlen, str); free(str); return 0; } ``` 1. -O1: **0.051850 ms : 0.006008 ms : 0.000830 ms** ``` cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -O1 -o out branchtest-strlen.c cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out test normal - strlen: time: 518.505005 avg time: 0.051850 ms test glib - strlen: time: 60.083000 avg time: 0.006008 ms test string.h - strlen: time: 8.300000 avg time: 0.000830 ms cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ``` 2. -O2 **0.032488 ms : 0.009022 ms : 8.366000 ms** ``` cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -O2 -o out branchtest-strlen.c cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out test normal - strlen: time: 324.880005 avg time: 0.032488 ms test glib - strlen: time: 90.221001 avg time: 0.009022 ms test string.h - strlen: time: 8.366000 avg time: 0.000837 ms ``` 3. -O3 **0.034716 ms : 0.008997 ms : 0.000854 ms** ``` cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -O3 -o out branchtest-strlen.c cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out test normal - strlen: time: 347.162994 avg time: 0.034716 ms test glib - strlen: time: 89.970001 avg time: 0.008997 ms test string.h - strlen: time: 8.544000 avg time: 0.000854 ms ``` 4. without gcc optimitize (worst) **0.163534 ms : 0.018793 ms : 0.000843 ms** ``` cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o out branchtest-strlen.c cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out test normal - strlen: time: 1635.336060 avg time: 0.163534 ms test glib - strlen: time: 187.932007 avg time: 0.018793 ms test string.h - strlen: time: 8.429000 avg time: 0.000843 ms ``` In this experiment, we can find that glib-preformance will have the best performance in **-O1** option. In addition, string.h- strlen() is hardly affected by GCC-optimitize. **So what is the reason that -O2 amd -O3 can't achieve better performance in glib - strlen case?** //ToDo ### C. Assembly code 1. normal - strlen ``` .L5: addq $1, -8(%rbp) .L4: movq -8(%rbp), %rax movzbl (%rax), %eax testb %al, %al jne .L5 movq -8(%rbp), %rax subq -24(%rbp), %rax ``` 2. glib - strlen ``` fast_strlen: .LFB6: .cfi_startproc endbr64 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movq %rdi, -56(%rbp) movq -56(%rbp), %rax movq %rax, -48(%rbp) jmp .L2 .L5: movq -48(%rbp), %rax movzbl (%rax), %eax testb %al, %al jne .L3 movq -48(%rbp), %rax subq -56(%rbp), %rax jmp .L4 .L3: addq $1, -48(%rbp) .L2: movq -48(%rbp), %rax andl $7, %eax testq %rax, %rax jne .L5 movq -48(%rbp), %rax movq %rax, -40(%rbp) movl $2155905152, %eax movq %rax, -32(%rbp) movq $16843009, -24(%rbp) movq -32(%rbp), %rax salq $32, %rax orq %rax, -32(%rbp) movq -24(%rbp), %rax salq $32, %rax orq %rax, -24(%rbp) .L14: movq -40(%rbp), %rax leaq 8(%rax), %rdx movq %rdx, -40(%rbp) movq (%rax), %rax movq %rax, -16(%rbp) movq -16(%rbp), %rax subq -24(%rbp), %rax movq -16(%rbp), %rdx notq %rdx andq %rdx, %rax andq -32(%rbp), %rax testq %rax, %rax je .L14 movq -40(%rbp), %rax subq $8, %rax movq %rax, -8(%rbp) movq -8(%rbp), %rax movzbl (%rax), %eax testb %al, %al jne .L7 movq -8(%rbp), %rax subq -56(%rbp), %rax jmp .L4 .L7: movq -8(%rbp), %rax addq $1, %rax movzbl (%rax), %eax testb %al, %al jne .L8 movq -8(%rbp), %rax subq -56(%rbp), %rax addq $1, %rax jmp .L4 .L8: movq -8(%rbp), %rax addq $2, %rax movzbl (%rax), %eax testb %al, %al jne .L9 movq -8(%rbp), %rax subq -56(%rbp), %rax addq $2, %rax jmp .L4 .L9: movq -8(%rbp), %rax addq $3, %rax movzbl (%rax), %eax testb %al, %al jne .L10 movq -8(%rbp), %rax subq -56(%rbp), %rax addq $3, %rax jmp .L4 .L10: movq -8(%rbp), %rax addq $4, %rax movzbl (%rax), %eax testb %al, %al jne .L11 movq -8(%rbp), %rax subq -56(%rbp), %rax addq $4, %rax jmp .L4 .L11: movq -8(%rbp), %rax addq $5, %rax movzbl (%rax), %eax testb %al, %al jne .L12 movq -8(%rbp), %rax subq -56(%rbp), %rax addq $5, %rax jmp .L4 .L12: movq -8(%rbp), %rax addq $6, %rax movzbl (%rax), %eax testb %al, %al jne .L13 movq -8(%rbp), %rax subq -56(%rbp), %rax addq $6, %rax jmp .L4 .L13: movq -8(%rbp), %rax addq $7, %rax movzbl (%rax), %eax testb %al, %al jne .L14 movq -8(%rbp), %rax subq -56(%rbp), %rax addq $7, %rax .L4: popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc ``` 3. string.h - strlen ``` subq $16, %rsp movq %rdi, -8(%rbp) movq -8(%rbp), %rax movq %rax, %rdi call strlen@PLT leave ``` What is **strlen@PLT**? PTL(procedure linkage table) makes dynamic loading and linking easier to use. [How is glibc loaded at runtime?](http://dustin.schultz.io/how-is-glibc-loaded-at-runtime.html) This should be the implementation on the hardware. Maybe we should test the strlen() in different platform. https://stackoverflow.com/questions/1733281/where-is-the-implementation-of-strlen-in-gcc #### Q: Why can string.h run so fast? //ToDo: Compare ARM vs Intel architecture ## 8. Application ```strlen``` is an indeterminate function with time complexity O(n). In this section we will show what we can do or learn in glib-strlen implementation. ### Example: upper the string **1. Do/Don't** ```c= //Don't do it: for(int i = 0; i < strlen(str); i++){ upperstr(str[i]); } ``` ```c= // Do it: len = strlen(str) for(int i = 0; i < len; i++){ upperstr(str[i]); } ``` Test: ```c= void upperstr(char *str){ for(int i = 0; i < strlen(str); i++){ str[i] &= 0b11011111; } } int main(){ char *str; clock_t t1, t2; str = gen_str(10000); t1 = clock(); int testtime = 10000; for(int i=0; i<testtime; i++) upperstr(str); t2 = clock(); printf("time: %f\n", (float)(t2 - t1)/1000); printf("avg time: %f ms \n", ((float)(t2 - t1)/(float)testtime)/1000); free(str); return 0; } ``` ``` cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o upper testupper.c cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./upper time: 7018.982910 avg time: 0.701898 ms ``` ``` cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o upper testupper.c cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./upper time: 196.768005 avg time: 0.019677 ms ``` **2. upper string** ASCII : A = 0x41 = 0b 0100 0000 a = 0x61 = 0b 0110 0000 ... Z = 0x5a = 0b 0101 1010 z = 0x7a = 0b 0111 1010 A single upper character can be transfered to a lower character by "or-ing a mask" (| 0b01000000). In other words, "and-ing a mask"(& 0b11011111) can let a lower character become a upper character. ```c= size_t fast_upperstr (char *str) { char *char_ptr; unsigned long int *longword_ptr; unsigned long int longword, himagic, lomagic, mask; for (char_ptr = str; ((unsigned long int) char_ptr& (sizeof (longword) - 1)) != 0;++char_ptr){ *char_ptr = (*char_ptr) & 0b11011111; if (*char_ptr == '\0') return char_ptr - str; } longword_ptr = (unsigned long int *) char_ptr; himagic = 0x80808080L; lomagic = 0x01010101L; mask = 0xdfdfdfdfL; if (sizeof (longword) > 4) { himagic = ((himagic << 16) << 16) | himagic; lomagic = ((lomagic << 16) << 16) | lomagic; mask = ((mask << 16) << 16) | mask; } if (sizeof (longword) > 8) abort (); for (;;) { longword = *longword_ptr++; if (((longword - lomagic) & ~longword & himagic) != 0) { char *cp = (char *) (longword_ptr - 1); if (cp[0] == 0) return cp - str; if (cp[1] == 0){ cp[0] &= 0xdf; return cp - str + 1; } if (cp[2] == 0){ cp[0] &= 0xdf; cp[1] &= 0xdf; return cp - str + 2; } if (cp[3] == 0){ cp[0] &= 0xdf; cp[1] &= 0xdf; cp[2] &= 0xdf; return cp - str + 3; } if (sizeof (longword) > 4) { if (cp[4] == 0){ cp[0] &= 0xdf; cp[1] &= 0xdf; cp[2] &= 0xdf; cp[3] &= 0xdf; return cp - str + 4; } if (cp[5] == 0){ cp[0] &= 0xdf; cp[1] &= 0xdf; cp[2] &= 0xdf; cp[3] &= 0xdf; cp[4] &= 0xdf; return cp - str + 5; } if (cp[6] == 0){ cp[0] &= 0xdf; cp[1] &= 0xdf; cp[2] &= 0xdf; cp[3] &= 0xdf; cp[4] &= 0xdf; cp[5] &= 0xdf; return cp - str + 6; } if (cp[7] == 0){ cp[0] &= 0xdf; cp[1] &= 0xdf; cp[2] &= 0xdf; cp[3] &= 0xdf; cp[4] &= 0xdf; cp[5] &= 0xdf; cp[6] &= 0xdf; return cp - str + 7; } } } else{ (* (longword_ptr-1)) &= mask; } } } ``` ### Brench Test Test upper a 100000 size string. ``` cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -std=c11 -o out branchtest-strupr.c cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out test normal - strlen: time: 1798.147949 avg time: 0.179815 ms test fast - strlen: time: 220.440002 avg time: 0.022044 ms ``` Output: ![](https://i.imgur.com/oaCTlnG.png) ```strupr()``` and ```strlwr()``` are not standard function, these functions are sometimes shipped with the implementation of the compiler on some platforms but generally it is recomemded to write your own custom functions to conver the string to upper and/or lower case.It is implementation-defined whether they are supported or not. ## 9. Conclusion 1. In C, strlen() shouldn't be call too much time. 2. Use string.h library's strlen() for the best performance. ## 10. Future Work 1. Faster strlen? Parallelize? https://stackoverflow.com/questions/5119309/parallel-strlen 2. Illegal reading memory? [AddressSanitizerManualPoisoning](http://blog.hostilefork.com/poison-memory-without-asan/) ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="T|h|i|s|_|i|s|_"]; node1 [fontsize=13, label ="a|_|p|e|n|.||X"]; } ``` 3. Fast string compare [fast-strcmp](https://mgronhol.github.io/fast-strcmp/) ## Reference [As a software engineer, what's the most ingenious line of code you've seen?](https://www.quora.com/As-a-software-engineer-whats-the-most-ingenious-line-of-code-youve-seen/answer/Gary-Munnelly?ch=10&oid=96311272&share=00d27a11&srid=kIeuq&target_type=answer&fbclid=IwAR3M3BMTqbGJSOIr-Pq695wa85KDlizwWQGNs_xMdlrLyUzjI1TWMD0Zjpk) [GCC-buildin](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) ### TODO ([cont.](https://hackmd.io/@YLowy/Sy1jO7wPq)) 1. uclib 2. C++ 3. try ELF -> string lib