Optimed Strlen

We often use the strlen(str) function to get the length of the string. This is broad-use, and simply implemented. In C, we cannot consider a string as an object. We need to find the '\0'(NULL) character to determine the end of the string.

C11 6.4.5
6. a byte or code of value zero is appended to each multibyte
character sequence that results from a string literal or literals.
C11 7.1.1
A string is a contiguous sequence of characters terminated by and including the first null
character. The term multibyte string is sometimes used instead to emphasize special
processing given to multibyte characters contained in the string or to avoid confusion
with a wide string. A pointer to a string is a pointer to its initial (lowest addressed)
character. The length of a string is the number of bytes preceding the null character and
the value of a string is the sequence of the values of the contained characters, in order.

Briefly, strlen() is the finding first 0 function.

1.Easy-Implementation strlen

First, we can implement the following in accordance with the behavior defined by the language:













size_t strlen (const char *str)  
{ 
	const char *p; 
	 
	if (str == NULL) 
		return 0; 
	 
	p = str; 
	while (*p != '\0')  
		p++; 
	 
	return p - str; 
}

It is a very easy implement - compare every character to find NULL character. But we can find out something - the function always comparing one character at one time. This can obviously be optimized!
Depending on the computer architecture, we can operate on data of a long size. Taking a computer with 64-bit architecture for example, 8 characters can be compared at a time, and the performance can be increased up to 8 times for longer strings.

2. Design

glib - strlen()

If the string length is less than one long size (or smaller, correctly to say, to the address that matches the bytes address), the design does not improve its performance. We have to think about how to handle very short strings. In addition, we also need to find the aligned pointer.
If a long string is given, we will need to find the first '\0' in the later data. The later data should be spilted into many one-long-size block (sub-string), which can be used directly by memory.
After finding NULL charater, we can calculate the length by comparing with start pointer.

3. glib strlen

What is GLib?

GNOME - GNU Network Object Model Environment

Glib is one of the bsp library for GNOME, but GNU plan fail.

Glib has good cross-platform features, and can deployed on different OS.

GNU Liberary provides this optimized implementation for strlen(), which will be described in detail in this article.

4. Implementation

1. Small string / Find start-pointer











  const char *char_ptr;
  const unsigned long int *longword_ptr

  /* Handle the first few characters by reading one character at a time.
     Do this until CHAR_PTR is aligned on a longword boundary.  */
  for (char_ptr = str; ((unsigned long int) char_ptr
			& (sizeof (longword) - 1)) != 0;
       ++char_ptr)
    if (*char_ptr == '\0')
      return char_ptr - str;

The first goal would be to find a aligned pointer on a longword boundary. If we find a NULL in here, return back.
In this for-loop for ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0 we can find the padding address. The address will be the start-address for next stage.

Why do we need to find the start-address?

Consider two string condition.



char* str = "This is a book."
unsigned int len1 = strlen(str);
unsigned int len2 = strlen((&str[1]);

The pointer &str[1] can not be used for the long-byte operator.

2. Check the block (sub-string)
So here is the question. How to find the null character in a long size data?

Determine if a word has a byte less than n

Test if a word x contains an unsigned byte with value < n. Specifically for n=1, it can be used to find a 0-byte by examining one long at a time, or any byte by XORing x with a mask first. Uses 4 arithmetic/logical operations when n is constant.
Requirements: x>=0; 0<=n<=128
#define hasless(x,n) (((x)-~0UL/255*(n))&~(x)&~0UL/255*128)
To count the number of bytes in x that are less than n in 7 operations, use
#define countless(x,n) \
(((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255)

Determine if a word has a zero byte

The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.











// consider a 64 bit long size
  himagic = 0x8080808080808080L;
  lomagic = 0x0101010101010101L;

...
  for (;;)
    {
      longword = *longword_ptr++;

      if (((longword - lomagic) & ~longword & himagic) != 0)

We can notice how to build a mask in different platform by an elegant method in source code.

Return the answer
Return the distance between start and finish pointer.



if (cp[0] == 0)
	    return cp - str;
...

According to this implementation, we can speculate that there would be an 8x improvment performance in 64-bits architecture computer.
[holes](Printable characters

Codes 20hex to 7Ehex, known as the printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. There are 95 printable characters in total.[m] )

5. Source Code




















































































#ifndef STRLEN
# define STRLEN strlen
#endif

/* Return the length of the null-terminated string STR.  Scan for
   the null terminator quickly by testing four bytes at a time.  */
size_t
STRLEN (const char *str)
{
  const char *char_ptr;
  const unsigned long int *longword_ptr;
  unsigned long int longword, himagic, lomagic;

  /* Handle the first few characters by reading one character at a time.
     Do this until CHAR_PTR is aligned on a longword boundary.  */
  for (char_ptr = str; ((unsigned long int) char_ptr
			& (sizeof (longword) - 1)) != 0;
       ++char_ptr)
    if (*char_ptr == '\0')
      return char_ptr - str;

  /* All these elucidatory comments refer to 4-byte longwords,
     but the theory applies equally well to 8-byte longwords.  */

  longword_ptr = (unsigned long int *) char_ptr;

  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
     the "holes."  Note that there is a hole just to the left of
     each byte, with an extra at the end:
     bits:  01111110 11111110 11111110 11111111
     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
     The 1-bits make sure that carries propagate to the next 0-bit.
     The 0-bits provide holes for carries to fall into.  */
  himagic = 0x80808080L;
  lomagic = 0x01010101L;
  if (sizeof (longword) > 4)
    {
      /* 64-bit version of the magic.  */
      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
      himagic = ((himagic << 16) << 16) | himagic;
      lomagic = ((lomagic << 16) << 16) | lomagic;
    }
  if (sizeof (longword) > 8)
    abort ();

  /* Instead of the traditional loop which tests each character,
     we will test a longword at a time.  The tricky part is testing
     if *any of the four* bytes in the longword in question are zero.  */
  for (;;)
    {
      longword = *longword_ptr++;

      if (((longword - lomagic) & ~longword & himagic) != 0)
	{
	  /* Which of the bytes was the zero?  If none of them were, it was
	     a misfire; continue the search.  */

	  const char *cp = (const char *) (longword_ptr - 1);

	  if (cp[0] == 0)
	    return cp - str;
	  if (cp[1] == 0)
	    return cp - str + 1;
	  if (cp[2] == 0)
	    return cp - str + 2;
	  if (cp[3] == 0)
	    return cp - str + 3;
	  if (sizeof (longword) > 4)
	    {
	      if (cp[4] == 0)
		return cp - str + 4;
	      if (cp[5] == 0)
		return cp - str + 5;
	      if (cp[6] == 0)
		return cp - str + 6;
	      if (cp[7] == 0)
		return cp - str + 7;
	    }
	}
    }
}
libc_hidden_builtin_def (strlen)

6. Experiment (Normal-strlen vs glib-strlen vs string.h)






$ uname -a
Linux u49049006de455c.ant.amazon.com 5.13.0-40-generic #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/cpuinfo | grep "cpu core"
cpu cores	: 4
...

A. Performance

As a software engineer, what's the most ingenious line of code you've seen?

Our test:

set title "SDE-good read strlen test"
set xlabel "string lengh"
set ylabel "time(ms)"
set terminal png font " Times_New_Roman,12 "
set output "output.png"
set key left

plot \
"out.txt" using 1:2 with linespoints linewidth 2 title "normal-strlen", \
"out.txt" using 1:3 with linespoints linewidth 2 title "glib-strlen", \
"out.txt" using 1:4 with linespoints linewidth 2 title "string.h-strlen", \
~

$ gnuplot run.gp
$ eog output.png

Test:
string size = 100000
run 10000 times and calculate the avg-time














cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o out branchtest-strlen.c 
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
---
test normal - strlen:
time: 1644.232056
avg time: 0.164423 ms 
---
test glib - strlen:
time: 188.436005
avg time: 0.018844 ms 
---
test string.h - strlen:
time: 8.435000
avg time: 0.000844 ms

B. Different GCC optimitize (normal-strlen vs fast-strlen vs string.h - strlen)

Branch-Test




























void branchtest(strlenfunc func, char *str){
  clock_t t1, t2;
  t1 = clock();
  int testtime = 10000;
	for(int i=0; i<testtime; i++)
    func(str);  

	t2 = clock();
	printf("time: %f\n", (float)(t2 - t1)/1000);
	printf("avg time: %f ms \n", ((float)(t2 - t1)/(float)testtime)/1000);
  return;
}

int main(){
  char *str;
  str = gen_str(100000);
  printf("test normal - strlen:\n");
  branchtest(normal_strlen, str);

  printf("test glib - strlen:\n");
  branchtest(fast_strlen, str);

  printf("test string.h - strlen:\n");
  branchtest(strlen, str);

  free(str);
	return 0;
}

-O1: 0.051850 ms : 0.006008 ms : 0.000830 ms

cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -O1 -o out branchtest-strlen.c 
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
test normal - strlen:
time: 518.505005
avg time: 0.051850 ms 
test glib - strlen:
time: 60.083000
avg time: 0.006008 ms 
test string.h - strlen:
time: 8.300000
avg time: 0.000830 ms 
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$

-O2 0.032488 ms : 0.009022 ms : 8.366000 ms

cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -O2 -o out branchtest-strlen.c 
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
test normal - strlen:
time: 324.880005
avg time: 0.032488 ms 
test glib - strlen:
time: 90.221001
avg time: 0.009022 ms 
test string.h - strlen:
time: 8.366000
avg time: 0.000837 ms

-O3 0.034716 ms : 0.008997 ms : 0.000854 ms

cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -O3 -o out branchtest-strlen.c 
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out
test normal - strlen:
time: 347.162994
avg time: 0.034716 ms 
test glib - strlen:
time: 89.970001
avg time: 0.008997 ms 
test string.h - strlen:
time: 8.544000
avg time: 0.000854 ms

without gcc optimitize (worst)
0.163534 ms : 0.018793 ms : 0.000843 ms

cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o out branchtest-strlen.c 
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out 
test normal - strlen:
time: 1635.336060
avg time: 0.163534 ms 
test glib - strlen:
time: 187.932007
avg time: 0.018793 ms 
test string.h - strlen:
time: 8.429000
avg time: 0.000843 ms

In this experiment, we can find that glib-preformance will have the best performance in -O1 option. In addition, string.h- strlen() is hardly affected by GCC-optimitize.

So what is the reason that -O2 amd -O3 can't achieve better performance in glib - strlen case?

//ToDo

C. Assembly code

normal - strlen

.L5:
	addq	$1, -8(%rbp)
.L4:
	movq	-8(%rbp), %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L5
	movq	-8(%rbp), %rax
	subq	-24(%rbp), %rax

glib - strlen

fast_strlen:
.LFB6:
	.cfi_startproc
	endbr64
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	movq	%rdi, -56(%rbp)
	movq	-56(%rbp), %rax
	movq	%rax, -48(%rbp)
	jmp	.L2
.L5:
	movq	-48(%rbp), %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L3
	movq	-48(%rbp), %rax
	subq	-56(%rbp), %rax
	jmp	.L4
.L3:
	addq	$1, -48(%rbp)
.L2:
	movq	-48(%rbp), %rax
	andl	$7, %eax
	testq	%rax, %rax
	jne	.L5
	movq	-48(%rbp), %rax
	movq	%rax, -40(%rbp)
	movl	$2155905152, %eax
	movq	%rax, -32(%rbp)
	movq	$16843009, -24(%rbp)
	movq	-32(%rbp), %rax
	salq	$32, %rax
	orq	%rax, -32(%rbp)
	movq	-24(%rbp), %rax
	salq	$32, %rax
	orq	%rax, -24(%rbp)
.L14:
	movq	-40(%rbp), %rax
	leaq	8(%rax), %rdx
	movq	%rdx, -40(%rbp)
	movq	(%rax), %rax
	movq	%rax, -16(%rbp)
	movq	-16(%rbp), %rax
	subq	-24(%rbp), %rax
	movq	-16(%rbp), %rdx
	notq	%rdx
	andq	%rdx, %rax
	andq	-32(%rbp), %rax
	testq	%rax, %rax
	je	.L14
	movq	-40(%rbp), %rax
	subq	$8, %rax
	movq	%rax, -8(%rbp)
	movq	-8(%rbp), %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L7
	movq	-8(%rbp), %rax
	subq	-56(%rbp), %rax
	jmp	.L4
.L7:
	movq	-8(%rbp), %rax
	addq	$1, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L8
	movq	-8(%rbp), %rax
	subq	-56(%rbp), %rax
	addq	$1, %rax
	jmp	.L4
.L8:
	movq	-8(%rbp), %rax
	addq	$2, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L9
	movq	-8(%rbp), %rax
	subq	-56(%rbp), %rax
	addq	$2, %rax
	jmp	.L4
.L9:
	movq	-8(%rbp), %rax
	addq	$3, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L10
	movq	-8(%rbp), %rax
	subq	-56(%rbp), %rax
	addq	$3, %rax
	jmp	.L4
.L10:
	movq	-8(%rbp), %rax
	addq	$4, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L11
	movq	-8(%rbp), %rax
	subq	-56(%rbp), %rax
	addq	$4, %rax
	jmp	.L4
.L11:
	movq	-8(%rbp), %rax
	addq	$5, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L12
	movq	-8(%rbp), %rax
	subq	-56(%rbp), %rax
	addq	$5, %rax
	jmp	.L4
.L12:
	movq	-8(%rbp), %rax
	addq	$6, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L13
	movq	-8(%rbp), %rax
	subq	-56(%rbp), %rax
	addq	$6, %rax
	jmp	.L4
.L13:
	movq	-8(%rbp), %rax
	addq	$7, %rax
	movzbl	(%rax), %eax
	testb	%al, %al
	jne	.L14
	movq	-8(%rbp), %rax
	subq	-56(%rbp), %rax
	addq	$7, %rax
.L4:
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc

string.h - strlen

	subq	$16, %rsp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rax
	movq	%rax, %rdi
	call	strlen@PLT
	leave

What is strlen@PLT?
PTL(procedure linkage table) makes dynamic loading and linking easier to use. How is glibc loaded at runtime?
This should be the implementation on the hardware. Maybe we should test the strlen() in different platform.
https://stackoverflow.com/questions/1733281/where-is-the-implementation-of-strlen-in-gcc

Q: Why can string.h run so fast?

//ToDo: Compare ARM vs Intel architecture

8. Application

strlen is an indeterminate function with time complexity O(n). In this section we will show what we can do or learn in glib-strlen implementation.

Example: upper the string

1. Do/Don't




//Don't do it:
for(int i = 0; i < strlen(str); i++){
    upperstr(str[i]);
}





// Do it:
len = strlen(str)
for(int i = 0; i < len; i++){
    upperstr(str[i]);
}

Test:























void upperstr(char *str){
  for(int i = 0; i < strlen(str); i++){
    str[i] &= 0b11011111; 
  }
}

int main(){
  char *str;
  clock_t t1, t2;

  str = gen_str(10000);
  
  t1 = clock();
  int testtime = 10000;
	for(int i=0; i<testtime; i++)
    upperstr(str);

	t2 = clock();
	printf("time: %f\n", (float)(t2 - t1)/1000);
	printf("avg time: %f ms \n", ((float)(t2 - t1)/(float)testtime)/1000);
  free(str);
	return 0;
}

cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o upper testupper.c 
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./upper 
time: 7018.982910
avg time: 0.701898 ms

cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -o upper testupper.c 
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./upper 
time: 196.768005
avg time: 0.019677 ms

2. upper string

ASCII :
A = 0x41 = 0b 0100 0000
a = 0x61 = 0b 0110 0000
…
Z = 0x5a = 0b 0101 1010
z = 0x7a = 0b 0111 1010
A single upper character can be transfered to a lower character by "or-ing a mask" (| 0b01000000). In other words, "and-ing a mask"(& 0b11011111) can let a lower character become a upper character.






























































































size_t fast_upperstr (char *str)
{
  char *char_ptr;
  unsigned long int *longword_ptr;
  unsigned long int longword, himagic, lomagic, mask;

  for (char_ptr = str; ((unsigned long int) char_ptr& (sizeof (longword) - 1)) != 0;++char_ptr){
    *char_ptr = (*char_ptr) & 0b11011111;
    if (*char_ptr == '\0')
      return char_ptr - str;
  }

  longword_ptr = (unsigned long int *) char_ptr;

  himagic = 0x80808080L;
  lomagic = 0x01010101L;
  mask    = 0xdfdfdfdfL;
  if (sizeof (longword) > 4)
    {
      himagic = ((himagic << 16) << 16) | himagic;
      lomagic = ((lomagic << 16) << 16) | lomagic;
      mask    = ((mask    << 16) << 16) | mask;
    }
  if (sizeof (longword) > 8)
    abort ();

  for (;;)
    {
      longword = *longword_ptr++;

      if (((longword - lomagic) & ~longword & himagic) != 0)
	    {
        char *cp = (char *) (longword_ptr - 1);

        if (cp[0] == 0)
          return cp - str;
        if (cp[1] == 0){
          cp[0] &= 0xdf;
          return cp - str + 1;
        }
        if (cp[2] == 0){
          cp[0] &= 0xdf;
          cp[1] &= 0xdf;
          return cp - str + 2;
        }
        if (cp[3] == 0){
          cp[0] &= 0xdf;
          cp[1] &= 0xdf;
          cp[2] &= 0xdf;
          return cp - str + 3;
        }
        if (sizeof (longword) > 4)
          {
            if (cp[4] == 0){
              cp[0] &= 0xdf;
              cp[1] &= 0xdf;
              cp[2] &= 0xdf;
              cp[3] &= 0xdf;
              return cp - str + 4;
            }
            if (cp[5] == 0){
              cp[0] &= 0xdf;
              cp[1] &= 0xdf;
              cp[2] &= 0xdf;
              cp[3] &= 0xdf;
              cp[4] &= 0xdf;
              return cp - str + 5;
            }
            if (cp[6] == 0){
              cp[0] &= 0xdf;
              cp[1] &= 0xdf;
              cp[2] &= 0xdf;
              cp[3] &= 0xdf;
              cp[4] &= 0xdf;
              cp[5] &= 0xdf;
              return cp - str + 6;
            }
            if (cp[7] == 0){
              cp[0] &= 0xdf;
              cp[1] &= 0xdf;
              cp[2] &= 0xdf;
              cp[3] &= 0xdf;
              cp[4] &= 0xdf;
              cp[5] &= 0xdf;
              cp[6] &= 0xdf;
              return cp - str + 7;
            }
	    }
	  }
    else{
      (* (longword_ptr-1)) &= mask;
    }
  }
}

Brench Test

Test upper a 100000 size string.

cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ gcc -std=c11 -o out branchtest-strupr.c 
cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./out 
test normal - strlen:
time: 1798.147949
avg time: 0.179815 ms 
test fast - strlen:
time: 220.440002
avg time: 0.022044 ms

Output:

strupr() and strlwr() are not standard function, these functions are sometimes shipped with the implementation of the compiler on some platforms but generally it is recomemded to write your own custom functions to conver the string to upper and/or lower case.It is implementation-defined whether they are supported or not.

9. Conclusion

In C, strlen() shouldn't be call too much time.
Use string.h library's strlen() for the best performance.

10. Future Work

Faster strlen? Parallelize?
https://stackoverflow.com/questions/5119309/parallel-strlen
Illegal reading memory?
AddressSanitizerManualPoisoning

Fast string compare
fast-strcmp

Reference

As a software engineer, what's the most ingenious line of code you've seen?
GCC-buildin

TODO

(cont.)

uclib
C++
try ELF -> string lib