---
tags: SDE-good-read
topic: Optimed Strlen
---
# Optimed Strlen
We often use the ```strlen(str)``` function to get the length of the string. This is broad-use, and simply implemented. In C, we cannot consider a string as an object. We need to find the '\0'(NULL) character to determine the end of the string.
>C11 6.4.5
>6. a byte or code of value zero is appended to each multibyte
character sequence that results from a string literal or literals.
>C11 7.1.1
> A string is a contiguous sequence of characters terminated by and including the first null
character. The term multibyte string is sometimes used instead to emphasize special
processing given to multibyte characters contained in the string or to avoid confusion
with a wide string. A pointer to a string is a pointer to its initial (lowest addressed)
character. The length of a string is the number of bytes preceding the null character and
the value of a string is the sequence of the values of the contained characters, in order.
Briefly, strlen() is the finding first 0 function.
## 1.Easy-Implementation strlen
First, we can implement the following in accordance with the behavior defined by the language:
```c=
size_t strlen (const char *str)
{
const char *p;
if (str == NULL)
return 0;
p = str;
while (*p != '\0')
p++;
return p - str;
}
```
It is a very easy implement - compare every character to find NULL character. But we can find out something - the function always comparing one character at one time. This can obviously be optimized!
Depending on the computer architecture, we can operate on data of a long size. Taking a computer with 64-bit architecture for example, 8 characters can be compared at a time, and the performance can be increased up to 8 times for longer strings.
## 2. Design
[glib - strlen()](https://github.com/lattera/glibc/blob/master/string/strlen.c)
1. If the string length is less than one long size (or smaller, correctly to say, to the address that matches the bytes address), the design does not improve its performance. We have to think about how to handle very short strings. **In addition, we also need to find the aligned pointer.**
2. If a long string is given, we will need to find the first '\0' in the later data. The later data should be spilted into many one-long-size block (sub-string), which can be used directly by memory.
3. After finding NULL charater, we can calculate the length by comparing with start pointer.
## 3. glib strlen
### What is GLib?
[GNOME](https://www.gnome.org/) - GNU Network Object Model Environment

Glib is one of the bsp library for GNOME, but GNU plan fail.
Glib has good cross-platform features, and can deployed on different OS.
GNU Liberary provides this optimized implementation for strlen(), which will be described in detail in this article.
## 4. Implementation
**1. Small string / Find start-pointer**
```c=
const char *char_ptr;
const unsigned long int *longword_ptr
/* Handle the first few characters by reading one character at a time.
Do this until CHAR_PTR is aligned on a longword boundary. */
for (char_ptr = str; ((unsigned long int) char_ptr
& (sizeof (longword) - 1)) != 0;
++char_ptr)
if (*char_ptr == '\0')
return char_ptr - str;
```
The first goal would be to find a aligned pointer on a longword boundary. If we find a NULL in here, return back.
In this for-loop ```for ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0 ```we can find the padding address. The address will be the start-address for next stage.
#### Why do we need to find the start-address?
Consider two string condition.
```c=
char* str = "This is a book."
unsigned int len1 = strlen(str);
unsigned int len2 = strlen((&str[1]);
```
```graphviz
digraph G{
node [shape = record];
node0 [fontsize=13, label ="T|h|i|s|_|i|s|_"];
node1 [fontsize=13, label ="a|_|b|o|o|k|.|"];
}
```
The pointer ```&str[1]``` can not be used for the long-byte operator.
**2. Check the block (sub-string)**
So here is the question. How to find the null character in a long size data?
[Determine if a word has a byte less than n](http://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord)
>Test if a word x contains an unsigned byte with value < n. Specifically for n=1, it can be used to find a 0-byte by examining one long at a time, or any byte by XORing x with a mask first. Uses 4 arithmetic/logical operations when n is constant.
> Requirements: x>=0; 0<=n<=128
```#define hasless(x,n) (((x)-~0UL/255*(n))&~(x)&~0UL/255*128)```
To count the number of bytes in x that are less than n in 7 operations, use
```#define countless(x,n) \```
```(((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255)```
[Determine if a word has a zero byte](http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord)
>The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.
```c=
// consider a 64 bit long size
himagic = 0x8080808080808080L;
lomagic = 0x0101010101010101L;
...
for (;;)
{
longword = *longword_ptr++;
if (((longword - lomagic) & ~longword & himagic) != 0)
```
We can notice how to build a mask in different platform by an elegant method in source code.
3. Return the answer
Return the distance between start and finish pointer.
```c=
if (cp[0] == 0)
return cp - str;
...
```
According to this implementation, we can speculate that there would be an 8x improvment performance in 64-bits architecture computer.
[holes](Printable characters
Codes 20hex to 7Ehex, known as the printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. There are 95 printable characters in total.[m] )
## 5. Source Code
```c=
#ifndef STRLEN
# define STRLEN strlen
#endif
/* Return the length of the null-terminated string STR. Scan for
the null terminator quickly by testing four bytes at a time. */
size_t
STRLEN (const char *str)
{
const char *char_ptr;
const unsigned long int *longword_ptr;
unsigned long int longword, himagic, lomagic;
/* Handle the first few characters by reading one character at a time.
Do this until CHAR_PTR is aligned on a longword boundary. */
for (char_ptr = str; ((unsigned long int) char_ptr
& (sizeof (longword) - 1)) != 0;
++char_ptr)
if (*char_ptr == '\0')
return char_ptr - str;
/* All these elucidatory comments refer to 4-byte longwords,
but the theory applies equally well to 8-byte longwords. */
longword_ptr = (unsigned long int *) char_ptr;
/* Bits 31, 24, 16, and 8 of this number are zero. Call these bits
the "holes." Note that there is a hole just to the left of
each byte, with an extra at the end:
bits: 01111110 11111110 11111110 11111111
bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
The 1-bits make sure that carries propagate to the next 0-bit.
The 0-bits provide holes for carries to fall into. */
himagic = 0x80808080L;
lomagic = 0x01010101L;
if (sizeof (longword) > 4)
{
/* 64-bit version of the magic. */
/* Do the shift in two steps to avoid a warning if long has 32 bits. */
himagic = ((himagic << 16) << 16) | himagic;
lomagic = ((lomagic << 16) << 16) | lomagic;
}
if (sizeof (longword) > 8)
abort ();
/* Instead of the traditional loop which tests each character,
we will test a longword at a time. The tricky part is testing
if *any of the four* bytes in the longword in question are zero. */
for (;;)
{
longword = *longword_ptr++;
if (((longword - lomagic) & ~longword & himagic) != 0)
{
/* Which of the bytes was the zero? If none of them were, it was
a misfire; continue the search. */
const char *cp = (const char *) (longword_ptr - 1);
if (cp[0] == 0)
return cp - str;
if (cp[1] == 0)
return cp - str + 1;
if (cp[2] == 0)
return cp - str + 2;
if (cp[3] == 0)
return cp - str + 3;
if (sizeof (longword) > 4)
{
if (cp[4] == 0)
return cp - str + 4;
if (cp[5] == 0)
return cp - str + 5;
if (cp[6] == 0)
return cp - str + 6;
if (cp[7] == 0)
return cp - str + 7;
}
}
}
}
libc_hidden_builtin_def (strlen)
```
## 6. Experiment (Normal-strlen vs glib-strlen vs string.h)
```shell=
$ uname -a
Linux u49049006de455c.ant.amazon.com 5.13.0-40-generic #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/cpuinfo | grep "cpu core"
cpu cores : 4
...
```
### A. Performance
[As a software engineer, what's the most ingenious line of code you've seen?](https://www.quora.com/As-a-software-engineer-whats-the-most-ingenious-line-of-code-youve-seen/answer/Gary-Munnelly?ch=10&oid=96311272&share=00d27a11&srid=kIeuq&target_type=answer&fbclid=IwAR3M3BMTqbGJSOIr-Pq695wa85KDlizwWQGNs_xMdlrLyUzjI1TWMD0Zjpk)

#### Our test:
```
set title "SDE-good read strlen test"
set xlabel "string lengh"
set ylabel "time(ms)"
set terminal png font " Times_New_Roman,12 "
set output "output.png"
set key left
plot \
"out.txt" using 1:2 with linespoints linewidth 2 title "normal-strlen", \
"out.txt" using 1:3 with linespoints linewidth 2 title "glib-strlen", \
"out.txt" using 1:4 with linespoints linewidth 2 title "string.h-strlen", \
~
```
```
$ gnuplot run.gp
$ eog output.png
```

**Test:**
string size = 100000
run 10000 times and calculate the avg-time
```c=
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o out branchtest-strlen.c
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
---
test normal - strlen:
time: 1644.232056
avg time: 0.164423 ms
---
test glib - strlen:
time: 188.436005
avg time: 0.018844 ms
---
test string.h - strlen:
time: 8.435000
avg time: 0.000844 ms
```
### B. Different GCC optimitize (normal-strlen vs fast-strlen vs string.h - strlen)
Branch-Test
```c=
void branchtest(strlenfunc func, char *str){
clock_t t1, t2;
t1 = clock();
int testtime = 10000;
for(int i=0; i<testtime; i++)
func(str);
t2 = clock();
printf("time: %f\n", (float)(t2 - t1)/1000);
printf("avg time: %f ms \n", ((float)(t2 - t1)/(float)testtime)/1000);
return;
}
int main(){
char *str;
str = gen_str(100000);
printf("test normal - strlen:\n");
branchtest(normal_strlen, str);
printf("test glib - strlen:\n");
branchtest(fast_strlen, str);
printf("test string.h - strlen:\n");
branchtest(strlen, str);
free(str);
return 0;
}
```
1. -O1: **0.051850 ms : 0.006008 ms : 0.000830 ms**
```
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -O1 -o out branchtest-strlen.c
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
test normal - strlen:
time: 518.505005
avg time: 0.051850 ms
test glib - strlen:
time: 60.083000
avg time: 0.006008 ms
test string.h - strlen:
time: 8.300000
avg time: 0.000830 ms
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$
```
2. -O2 **0.032488 ms : 0.009022 ms : 8.366000 ms**
```
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -O2 -o out branchtest-strlen.c
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
test normal - strlen:
time: 324.880005
avg time: 0.032488 ms
test glib - strlen:
time: 90.221001
avg time: 0.009022 ms
test string.h - strlen:
time: 8.366000
avg time: 0.000837 ms
```
3. -O3 **0.034716 ms : 0.008997 ms : 0.000854 ms**
```
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -O3 -o out branchtest-strlen.c
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
test normal - strlen:
time: 347.162994
avg time: 0.034716 ms
test glib - strlen:
time: 89.970001
avg time: 0.008997 ms
test string.h - strlen:
time: 8.544000
avg time: 0.000854 ms
```
4. without gcc optimitize (worst)
**0.163534 ms : 0.018793 ms : 0.000843 ms**
```
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o out branchtest-strlen.c
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
test normal - strlen:
time: 1635.336060
avg time: 0.163534 ms
test glib - strlen:
time: 187.932007
avg time: 0.018793 ms
test string.h - strlen:
time: 8.429000
avg time: 0.000843 ms
```
In this experiment, we can find that glib-preformance will have the best performance in **-O1** option. In addition, string.h- strlen() is hardly affected by GCC-optimitize.
**So what is the reason that -O2 amd -O3 can't achieve better performance in glib - strlen case?**
//ToDo
### C. Assembly code
1. normal - strlen
```
.L5:
addq $1, -8(%rbp)
.L4:
movq -8(%rbp), %rax
movzbl (%rax), %eax
testb %al, %al
jne .L5
movq -8(%rbp), %rax
subq -24(%rbp), %rax
```
2. glib - strlen
```
fast_strlen:
.LFB6:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -56(%rbp)
movq -56(%rbp), %rax
movq %rax, -48(%rbp)
jmp .L2
.L5:
movq -48(%rbp), %rax
movzbl (%rax), %eax
testb %al, %al
jne .L3
movq -48(%rbp), %rax
subq -56(%rbp), %rax
jmp .L4
.L3:
addq $1, -48(%rbp)
.L2:
movq -48(%rbp), %rax
andl $7, %eax
testq %rax, %rax
jne .L5
movq -48(%rbp), %rax
movq %rax, -40(%rbp)
movl $2155905152, %eax
movq %rax, -32(%rbp)
movq $16843009, -24(%rbp)
movq -32(%rbp), %rax
salq $32, %rax
orq %rax, -32(%rbp)
movq -24(%rbp), %rax
salq $32, %rax
orq %rax, -24(%rbp)
.L14:
movq -40(%rbp), %rax
leaq 8(%rax), %rdx
movq %rdx, -40(%rbp)
movq (%rax), %rax
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
subq -24(%rbp), %rax
movq -16(%rbp), %rdx
notq %rdx
andq %rdx, %rax
andq -32(%rbp), %rax
testq %rax, %rax
je .L14
movq -40(%rbp), %rax
subq $8, %rax
movq %rax, -8(%rbp)
movq -8(%rbp), %rax
movzbl (%rax), %eax
testb %al, %al
jne .L7
movq -8(%rbp), %rax
subq -56(%rbp), %rax
jmp .L4
.L7:
movq -8(%rbp), %rax
addq $1, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L8
movq -8(%rbp), %rax
subq -56(%rbp), %rax
addq $1, %rax
jmp .L4
.L8:
movq -8(%rbp), %rax
addq $2, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L9
movq -8(%rbp), %rax
subq -56(%rbp), %rax
addq $2, %rax
jmp .L4
.L9:
movq -8(%rbp), %rax
addq $3, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L10
movq -8(%rbp), %rax
subq -56(%rbp), %rax
addq $3, %rax
jmp .L4
.L10:
movq -8(%rbp), %rax
addq $4, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L11
movq -8(%rbp), %rax
subq -56(%rbp), %rax
addq $4, %rax
jmp .L4
.L11:
movq -8(%rbp), %rax
addq $5, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L12
movq -8(%rbp), %rax
subq -56(%rbp), %rax
addq $5, %rax
jmp .L4
.L12:
movq -8(%rbp), %rax
addq $6, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L13
movq -8(%rbp), %rax
subq -56(%rbp), %rax
addq $6, %rax
jmp .L4
.L13:
movq -8(%rbp), %rax
addq $7, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L14
movq -8(%rbp), %rax
subq -56(%rbp), %rax
addq $7, %rax
.L4:
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
```
3. string.h - strlen
```
subq $16, %rsp
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi
call strlen@PLT
leave
```
What is **strlen@PLT**?
PTL(procedure linkage table) makes dynamic loading and linking easier to use. [How is glibc loaded at runtime?](http://dustin.schultz.io/how-is-glibc-loaded-at-runtime.html)
This should be the implementation on the hardware. Maybe we should test the strlen() in different platform.
https://stackoverflow.com/questions/1733281/where-is-the-implementation-of-strlen-in-gcc
#### Q: Why can string.h run so fast?
//ToDo: Compare ARM vs Intel architecture
## 8. Application
```strlen``` is an indeterminate function with time complexity O(n). In this section we will show what we can do or learn in glib-strlen implementation.
### Example: upper the string
**1. Do/Don't**
```c=
//Don't do it:
for(int i = 0; i < strlen(str); i++){
upperstr(str[i]);
}
```
```c=
// Do it:
len = strlen(str)
for(int i = 0; i < len; i++){
upperstr(str[i]);
}
```
Test:
```c=
void upperstr(char *str){
for(int i = 0; i < strlen(str); i++){
str[i] &= 0b11011111;
}
}
int main(){
char *str;
clock_t t1, t2;
str = gen_str(10000);
t1 = clock();
int testtime = 10000;
for(int i=0; i<testtime; i++)
upperstr(str);
t2 = clock();
printf("time: %f\n", (float)(t2 - t1)/1000);
printf("avg time: %f ms \n", ((float)(t2 - t1)/(float)testtime)/1000);
free(str);
return 0;
}
```
```
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o upper testupper.c
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./upper
time: 7018.982910
avg time: 0.701898 ms
```
```
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o upper testupper.c
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./upper
time: 196.768005
avg time: 0.019677 ms
```
**2. upper string**
ASCII :
A = 0x41 = 0b 0100 0000
a = 0x61 = 0b 0110 0000
...
Z = 0x5a = 0b 0101 1010
z = 0x7a = 0b 0111 1010
A single upper character can be transfered to a lower character by "or-ing a mask" (| 0b01000000). In other words, "and-ing a mask"(& 0b11011111) can let a lower character become a upper character.
```c=
size_t fast_upperstr (char *str)
{
char *char_ptr;
unsigned long int *longword_ptr;
unsigned long int longword, himagic, lomagic, mask;
for (char_ptr = str; ((unsigned long int) char_ptr& (sizeof (longword) - 1)) != 0;++char_ptr){
*char_ptr = (*char_ptr) & 0b11011111;
if (*char_ptr == '\0')
return char_ptr - str;
}
longword_ptr = (unsigned long int *) char_ptr;
himagic = 0x80808080L;
lomagic = 0x01010101L;
mask = 0xdfdfdfdfL;
if (sizeof (longword) > 4)
{
himagic = ((himagic << 16) << 16) | himagic;
lomagic = ((lomagic << 16) << 16) | lomagic;
mask = ((mask << 16) << 16) | mask;
}
if (sizeof (longword) > 8)
abort ();
for (;;)
{
longword = *longword_ptr++;
if (((longword - lomagic) & ~longword & himagic) != 0)
{
char *cp = (char *) (longword_ptr - 1);
if (cp[0] == 0)
return cp - str;
if (cp[1] == 0){
cp[0] &= 0xdf;
return cp - str + 1;
}
if (cp[2] == 0){
cp[0] &= 0xdf;
cp[1] &= 0xdf;
return cp - str + 2;
}
if (cp[3] == 0){
cp[0] &= 0xdf;
cp[1] &= 0xdf;
cp[2] &= 0xdf;
return cp - str + 3;
}
if (sizeof (longword) > 4)
{
if (cp[4] == 0){
cp[0] &= 0xdf;
cp[1] &= 0xdf;
cp[2] &= 0xdf;
cp[3] &= 0xdf;
return cp - str + 4;
}
if (cp[5] == 0){
cp[0] &= 0xdf;
cp[1] &= 0xdf;
cp[2] &= 0xdf;
cp[3] &= 0xdf;
cp[4] &= 0xdf;
return cp - str + 5;
}
if (cp[6] == 0){
cp[0] &= 0xdf;
cp[1] &= 0xdf;
cp[2] &= 0xdf;
cp[3] &= 0xdf;
cp[4] &= 0xdf;
cp[5] &= 0xdf;
return cp - str + 6;
}
if (cp[7] == 0){
cp[0] &= 0xdf;
cp[1] &= 0xdf;
cp[2] &= 0xdf;
cp[3] &= 0xdf;
cp[4] &= 0xdf;
cp[5] &= 0xdf;
cp[6] &= 0xdf;
return cp - str + 7;
}
}
}
else{
(* (longword_ptr-1)) &= mask;
}
}
}
```
### Brench Test
Test upper a 100000 size string.
```
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -std=c11 -o out branchtest-strupr.c
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
test normal - strlen:
time: 1798.147949
avg time: 0.179815 ms
test fast - strlen:
time: 220.440002
avg time: 0.022044 ms
```
Output:

```strupr()``` and ```strlwr()``` are not standard function, these functions are sometimes shipped with the implementation of the compiler on some platforms but generally it is recomemded to write your own custom functions to conver the string to upper and/or lower case.It is implementation-defined whether they are supported or not.
## 9. Conclusion
1. In C, strlen() shouldn't be call too much time.
2. Use string.h library's strlen() for the best performance.
## 10. Future Work
1. Faster strlen? Parallelize?
https://stackoverflow.com/questions/5119309/parallel-strlen
2. Illegal reading memory?
[AddressSanitizerManualPoisoning](http://blog.hostilefork.com/poison-memory-without-asan/)
```graphviz
digraph G{
node [shape = record];
node0 [fontsize=13, label ="T|h|i|s|_|i|s|_"];
node1 [fontsize=13, label ="a|_|p|e|n|.||X"];
}
```
3. Fast string compare
[fast-strcmp](https://mgronhol.github.io/fast-strcmp/)
## Reference
[As a software engineer, what's the most ingenious line of code you've seen?](https://www.quora.com/As-a-software-engineer-whats-the-most-ingenious-line-of-code-youve-seen/answer/Gary-Munnelly?ch=10&oid=96311272&share=00d27a11&srid=kIeuq&target_type=answer&fbclid=IwAR3M3BMTqbGJSOIr-Pq695wa85KDlizwWQGNs_xMdlrLyUzjI1TWMD0Zjpk)
[GCC-buildin](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)
### TODO
([cont.](https://hackmd.io/@YLowy/Sy1jO7wPq))
1. uclib
2. C++
3. try ELF -> string lib