Optimed Strlen (cont.)

--- tags: SDE-good-read topic: Optimed Strlen (cont.) --- # Optimed Strlen (cont.) ## 1. uclib vs glib [uclib](https://uclibc-ng.org/) ->https://git.uclibc-ng.org/git/uclibc-ng.git ![](https://i.imgur.com/x5JDfSN.png) uClibc-ng is a small C library for developing embedded Linux systems. It is much smaller than the GNU C Library, but nearly all applications supported by glibc also work perfectly with uClibc-ng. In ```strlen()``` function, the "finding zero in long size string" is different between glib and uclib. [See Here](https://hackmd.io/@YLowy/HkQ68pjN5) ```c= //uclib if (((longword - lomagic) & himagic) != 0) ``` vs ```c= //glib if (((longword - lomagic) & ~longword & himagic) != 0) ``` ### Compare: What's different between two library? #### CASE 1: No Zero in a long size string We will explain the magic in `strlen()`. ```c= //uclib if (((longword - lomagic) & himagic) != 0) ``` ##### 1. longword ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x63|0x63|0x63|0x63|0x63|0x63|0x63|0x63"]; } ``` ##### 2. longword - lomagic(0x0101010101010101) ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x62|0x62|0x62|0x62|0x62|0x62|0x62|0x62"]; } ``` ##### 3. (longword - lomagic) & himagic (0x8080808080808080) ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x62|0x62|0x62|0x62|0x62|0x62|0x62|0x62"]; } ``` ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x80|0x80|0x80|0x80|0x80|0x80|0x80|0x80"]; } ``` --- ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x00|0x00|0x00|0x00|0x00|0x00|0x00|0x00"]; } ``` ```c= //Result: ((longword - lomagic) & ~longword & himagic) == 0 ``` #### CASE 2: a Zero in a long size string ##### 1. longword ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x63|0x00|0x63|0x63|0x63|0x63|0x63|0x63"]; } ``` ##### 2. longword - lomagic(0x0101010101010101) ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x61|0xFF|0x62|0x62|0x62|0x62|0x62|0x62"]; } ``` ##### 3. (longword - lomagic) & himagic (0x8080808080808080) ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x61|0xFF|0x62|0x62|0x62|0x62|0x62|0x62"]; } ``` ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x80|0x80|0x80|0x80|0x80|0x80|0x80|0x80"]; } ``` --- ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x00|0x10|0x00|0x00|0x00|0x00|0x00|0x00"]; } ``` ```c= //Result: ((longword - lomagic) & ~longword & himagic) != 0 ``` #### Different between uclib and glib Is `& ~longword` necessary for `strlen()` ? What does it mean in `strlen()`? Consider the situation below: ```c= char myString[] = "AMAZON SDE READ"; ``` ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="A|M|A|Z|O|N| |S"]; node1 [fontsize=13, label ="D|E| |R|E|A|D|-"]; } ``` ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x41|0x4D|0x41|0x5A|0x4F|0x4E|0x20|0x53"]; node1 [fontsize=13, label ="0x44|0x45|0x20|0x52|0x45|0x41|0x44|0x00"]; } ``` Now that I decide to modify the context "O" to "A" in the string AMAZON. ```c= myString[4] = 'A'; ``` However, I accidentally type the wrong one, and now the string has a non-ASCII character. (p.s. Correctly to say, a noraml ASCII character, 0xF1 is an external ASXII character.) ```c= myString[4] = 0xF1; ``` So now we will get a strange string as below. ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="A|M|A|Z|?|N| |S"]; node1 [fontsize=13, label ="D|E| |R|E|A|D|-"]; } ``` ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x41|0x4D|0x41|0x5A|0x41|0xF1|0x20|0x53"]; node1 [fontsize=13, label ="0x44|0x45|0x20|0x52|0x45|0x41|0x44|0x00"]; } ``` In this case, there are two different return for uclib's and glib's strlen function. ##### uclib 1. find the start point. 2. looping to find zero in sub-string. ##### 1. longword ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x41|0x4D|0x41|0x5A|0xF1|0x4E|0x20|0x53"]; } ``` ##### 2. longword - lomagic(0x0101010101010101) ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x40|0x4C|0x40|0x59|0xF0|0x4D|0x1F|0x52"]; } ``` ##### 3. (longword - lomagic) & himagic (0x8080808080808080) ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x40|0x4C|0x40|0x59|0xF0|0x4D|0x1F|0x52"]; } ``` ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x80|0x80|0x80|0x80|0x80|0x80|0x80|0x80"]; } ``` --- ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="0x00|0x00|0x00|0x00|0x00|0x00|0x10|0x00"]; } ``` Function `strlen()` in uclib will return true in this case. So the function will check if there is a zero character in the substring. However, glib's `strlen()` will not have to check this because of `& ~longword`. In this case, it will ignore the highest bits in the character. 1. case for normal ASCII string The lack of `& ~longword` will let uclib's performance better. ![](https://i.imgur.com/IjbGXT7.png) 2. case for external ASCII string However, if the string is full with external ASCII, uclib will check if there is zero in looping function. It will have negative impact on proformance. ![](https://i.imgur.com/Fbcx3ZX.png) ## 2. C++ strlen In C++, we have two types of strings: 1. C-style strings 2. std::strings (from the C++ Standard string class) ### How to use C-style strings Use them in C++ code by including the `<cstring>` header. ```cpp= #include <iostream> #include <cstring> int main() { char str[] = "This is a C-style string"; std::cout << str << "\n"; std::cout << "string's size: "<< strlen(str) << "\n"; } ``` ``` cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ g++ -o outcpp strlentest.cpp cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./outcpp This is a C-style string string's size: 24 ``` ### How to use std::strings C-style strings are relatively unsafe – if the string has no 0x00 , it can lead to a whole host of potential bugs. The `std::string` class that's provided by the C++ Standard Library is a much safer alternative. ```cpp= #include <iostream> #include <string> int main() { std::string str = "This is a C++ string class"; std::cout << str << "\n"; std::cout << "string's size: "<< str.length() << "\n"; } ``` ``` cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ g++ -o outcpp strlentest.cpp cheyenyu@u49049006de455c:~/Desktop/SDEGoodRead$ ./outcpp This is a C++ string class string's size: 26 ``` String object will return it's private member.(O(1)) ```cpp= /// null-termination. size_type length() const _GLIBCXX_NOEXCEPT { return _M_string_length; } ``` ![](https://i.imgur.com/4uPiZRa.png) ![](https://i.imgur.com/AAggjoD.png) --- https://www.youtube.com/watch?v=kPR8h4-qZdk&t=53s&ab_channel=CppCon ### std::string ![](https://i.imgur.com/P2O2Lgp.png) SSO (Small String Optimzation) & CoW (Copy of Write) ```cpp= class string { char *start; size_t size; static const int kLocalSize = 15; union{ char buffer[kLocalSize+1]; size_t capacity; }data; }; ``` **Small String** ```graphviz digraph G{ node [shape = record]; node0 [fontsize=13, label ="{{-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-}|char *start|size_t size|{S|D|E|G|O|O|D|R}|{E|A|D|/0|X|X|X|X}}"]; } ``` **Large String** ```graphviz digraph G{ node [shape = record]; A [fontsize=13, label ="{{-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-}|<A1>char *start |size_t size|size_t capacity |{unused}}"]; B [fontsize=13, label ="{12|size}|{30|capacity}|{ X|refcnt}|{<B1>SDEGOODREAD|string in heap}|{|}"] A:A1->B:B1 } ``` ### folly:fbstring folly/FBString.h ```cpp= struct RefCounted { std::atomic<size_t> refCount_; Char data_[1]; static RefCounted * create(size_t * size); static RefCounted * create(const Char * data, size_t * size); static void incrementRefs(Char * p); static void decrementRefs(Char * p); }; struct MediumLarge { Char* data_; size_t size_; size_t capacity_; size_t capacity() const { return kIsLittleEndian ? capacity_ & capacityExtractMask : capacity_ >> 2; } void setCapacity(size_t cap, Category cat) { capacity_ = kIsLittleEndian ? cap | (static_cast<size_t>(cat) << kCategoryShift) : (cap << 2) | static_cast<size_t>(cat); } }; union { uint8_t bytes_[sizeof(MediumLarge)]; // For accessing the last byte. Char small_[sizeof(MediumLarge) / sizeof(Char)]; MediumLarge ml_; }; ``` **Small String(1-23)** ```graphviz digraph G{ node [shape = record]; A [fontsize=13, label ="{{-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-}|{S|D|E|G|O|O|D|R}|{E|A|D|/0|-|-|-|-}|{-|-|-|-|-|-|-|<star>*}}"]; B [fontsize=13, label ="{{<B1>-0-|-1-|-2-|-3-|-4-|-5-|-6-|<B3>-7-}|{0|<B2>0|-|s|i|z|<B4>e|-}}"] C [fontsize=13, label ="00 = small string"] D [fontsize=13, label ="size = 23 - strlen"] A:star -> B:B1 A:star -> B:B3 B:B2->C B:B4->D } ``` **Medium String (24-255)** ```graphviz digraph G{ node [shape = record]; A [fontsize=13, label ="{{0|1|2|3|4|5|6|7}|<A1>char* data_|size_t size_|size_t capacity_}"]; B [fontsize=13, label ="{<B1>SDEGOODREAD|string in heap}"] A:A1 -> B:B2 } ``` **Large String(255 up)** ```graphviz digraph G{ node [shape = record]; A [fontsize=13, label ="{{0|1|2|3|4|5|6|7}|<A1>char* data_|size_t size_|size_t capacity_}"]; B [fontsize=13, label ="{12|size}|{30|capacity}|{ X|refcnt}|{<B1>SDEGOODREAD|string in heap}|{|}"] A:A1 -> B:B1 } ``` ## 3. try ELF -> string lib ```c= #include <stdio.h> #include <string.h> int main(){ char *str = "SDE Good Read"; strlen(str); return 0; } ``` ``` $ gcc -g -static -o sl2 sl2.c $ objdump -d -M intel -S sl2 ``` ## Refer https://www.lookuptables.com/ ![](https://i.imgur.com/1yHGes9.png) ![](https://i.imgur.com/7cTKWmv.png)