[閱讀筆記]Optimize string use: a case study === ###### tags: `study`, `c/c++` 文章網址: [Optimize string use: a case study](https://www.oreilly.com/content/optimize-string-use-case-study/) ## Why Strings Are a Problem > Furthermore, the behavior of std::string has been changed over the years to keep up with changes in the C++ standard. This means that a conforming std::string implementation from a C\++98 compiler may not behave the same way as a std::string implementation after C++11. > Strings have some behaviors that make them expensive to use, no matter the implementation. They are dynamically allocated, they behave as values in expressions, and their implementation requires a lot of copying. ### Strings Are Dynamically Allocated std:string是動態分配memory的,使用到了大量的複製,動態配置的成本極高,在C裡面動態配置的寫法類似下方這樣: ```c char* p = (char*) malloc(7); strcpy(p, "string"); ... free(p); ``` 不過,string的內部mem space(internal character buffer)仍然是固定的,當操作string造成空間超出原本配置的大小時,會重新分配一段空間,在將內容複製過去。 > std::string implementations do a trick to amortize the cost of reallocating storage for the character buffer as the string grows. Instead of making a request to the memory manager for the exact number of characters needed, the string implementation rounds up the request to some larger number of characters. For instance, some implementations round up the request to the next power of 2. The string then has the capacity to grow to twice its current size before needing to call the into the memory manager again. The next time an operation needs to extend the length of the string, there is room in the existing buffer, avoiding the need to allocate a new buffer. The benefit of this trick is that the cost of appending characters to a string approaches a constant asymptotically as the string grows longer. The cost of this trick is that strings carry around some unused space. If the string implements a policy of rounding up requests to a power of 2, up to half the storage in a string may be unused. - 一種技巧是,每次都為請求的空間配置搭約多兩倍的記憶體(round up the request to the next power of 2),從而降低了重新配置的次數 - 不過缺點就是可能會有一些空間沒使用到 ### Strings Are Values strings是values,如同下方integer是values的範例: ```cpp= int i,j; i = 3; // i has the value 3 j = i; // j also has the value 3 i = 5; // i now has the value 5, j still has the value 3 ``` string也有相同的特性 ```cpp= std::string s1, s2; s1 = “hot”; // s1 is "hot" s2 = s1; // s2 is "hot" s1[0] = 'n'; // s2 is still "hot", s1 is "not" ``` > Because strings are values, the results of string expressions are also values. If you concatenate strings, as in the statement s1 = s2 + s3 + s4;, the result of s2 + s3 is a newly allocated temporary string value. The result of concatenating s4 to this temporary string is another temporary string value. This value replaces the previous value of s1. Then the dynamically allocated storage for the first temporary string and the previous value of s1 are freed. This adds up to a lot of calls into the memory manager. - 所以在對string做操作的時候,並不是更改原本memory的value,而是先去產生一個新的string儲存結果,然後在把原本的memory釋放掉。因此string的相關操作會大量的呼叫到memory manager ### Strings Do a Lot of Copying Copy on Write: 讓string都先指向相同的空間,透過counter來統計複製的數量,有需要修改時在另外開一個空間配置 > There is a well-known programming idiom for things that behave as values but are expensive to copy. It is called “copy on write,” and often abbreviated COW in C++ literature (see not available). In a COW string, the dynamically allocated storage can be shared between strings. A reference count lets each string know if it is using shared storage. When one string is assigned to another, only a pointer is copied, and the reference count is incremented. Any operation that changes a string’s value first checks to see that there is only one pointer to that storage. If multiple strings point to the storage, any mutating operation (any operation that may change the contents of the string) allocates new storage and makes a copy of the string before making its change ```cpp= COWstring s1, s2; s1 = "hot"; // s1 is "hot" s2 = s1; // s2 is "hot" (s1 and s2 point to the same storage) s1[0] = 'n'; // s1 makes a new copy of its storage before // changing anything // s2 is still "hot", s1 is "not" ``` - 儘管這樣在assignment和argument-passing時cost很低,但non-const references和any call to a mutating function的cost就很大 - 在平行化時,Cow的cost也很大,需要確保同時間不會有人更改共同的空間 > In C++11 and later, the burden of copying is somewhat reduced by the presence of rvalue references and the move semantics (see not available) that go with them. If a function takes an rvalue reference as argument, the string can do an inexpensive pointer copy when the actual argument is an rvalue expression, saving one copy. ## Case Study: First Attempt at Optimizing Strings 考慮以下code: ```cpp= std::string remove_ctrl(std::string s) { std::string result; for (int i=0; i<s.length(); ++i) { if (s[i] >= 0x20) result = result + s[i]; } return result; } ``` - 糟糕透了,這僅僅只是一段可以執行的危險代碼,因為有著對大量記憶體的更動,每次更改string都會重新產生一個memory去更改,然後再把原本的刪除。 - 假設string長度100,則會create storage和release storage各100次 - 此外,根據assignment的實作不同,可能會有更多的memory call > - If strings are implemented using the copy-on-write idiom, then the assignment operator performs an efficient pointer copy and increments the reference count. >- If strings have a non–shared buffer implementation, then the assignment operator must copy the contents of the temporary string. If the implementation is naïve, or result‘s buffer does not have enough capacity, then the assignment operator also allocates a new buffer to copy into. This results in 100 copy operations and as many as 100 additional allocations. >- If the compiler implements C++11-style rvalue references and move semantics, then the fact that the concatenation expression’s result is an rvalue allows the compiler to call result‘s move constructor instead of its copy constructor. The result is that the program performs an efficient pointer copy. 在作者的設備上,each call took 24.8 microseconds,數字沒有意義,僅作為後續優化的比較參考值,接下來作者會介紹如何一步步的優化 ### Use Mutating String Operations to Eliminate Temporaries ```cpp= std::string remove_ctrl_mutating(std::string s) { std::string result; for (int i=0; i<s.length(); ++i) { if (s[i] >= 0x20) result += s[i]; } return result; } ``` > This improvement comes from eliminating all the calls to allocate temporary string objects to hold the concatenation expression result, and the associated copying and deleting of temporaries. Depending on the string implementation, allocation and copying on assignment are also eliminated. - 修改第五行的concatenation expression,省去了暫時儲存計算結果的 temporary object allocate, 存回去的copying和釋放的deleting。 現在each call是1.72 microseconds per call,進步了13倍 ### 從source code理解上述兩種寫法的差異 看不懂上面這兩個差在哪?來看source code 1. 從[libstdc++](https://gcc.gnu.org/onlinedocs/gcc-4.8.2/libstdc++/api/a01053_source.html)可以看到`operator+=`其實就是呼叫了`string::append()` ```cpp=926 ... // Modifiers: /** * @brief Append a string to this string. * @param __str The string to append. * @return Reference to this string. */ basic_string& operator+=(const basic_string& __str) { return this->append(__str); } ``` 2. 來看看:eye::eye:[`string:append()`](https://android.googlesource.com/platform/external/astl/+/9cb0478662a7c988146fff0d868bba2839ea80f2/src/string.cpp)都在做啥 ```cpp=269 void string::Append(const value_type *str, size_type n) { const size_type total_len = mLength + n; // n > 0 and no overflow for the string length + terminating null. if (n > 0 && (total_len + 1) > mLength) { if (total_len > mCapacity) { reserve(total_len); if (total_len > mCapacity) { // something went wrong in the reserve call. return; } } memcpy(mData + mLength, str, n); mLength = total_len; mData[mLength] = '\0'; } } ``` 原來就是先用`reserve()`string的capacity變大,然後memcpy過去,這樣就不用產生一個占存的memory space先記錄改變的結果,然後在assign了(這就是`operator+`在做的事情) 3.來:eye::eye:[`operator+`](https://gcc.gnu.org/onlinedocs/gcc-4.6.2/libstdc++/api/a01075_source.html)在幹啥 ```cpp=00881 // operator+ /** * @brief Concatenate two strings. * @param __lhs First string. * @param __rhs Last string. * @return New string with value of @a __lhs followed by @a __rhs. */ template<typename _CharT, typename _Traits, typename _Alloc> basic_string<_CharT, _Traits, _Alloc> operator+(const basic_string<_CharT, _Traits, _Alloc>& __lhs, const basic_string<_CharT, _Traits, _Alloc>& __rhs) { basic_string<_CharT, _Traits, _Alloc> __str(__lhs); __str.append(__rhs); return __str; } ``` 注意到他先創造了一個新的variable `basic_string`,然後初始值放其中一個string,才把另一個變數進行`append` ### Reduce Reallocation by Reserving Storage 上面提到,string空間不足時的re-allocate也是成本,透過reserve()是確認好大小以減少re-allocate的次數。 ```cpp= std::string remove_ctrl_reserve(std::string s) { std::string result; result.reserve(s.length()); for (int i=0; i<s.length(); ++i) { if (s[i] >= 0x20) result += s[i]; } return result; } ``` > Not only does use of reserve() eliminate reallocation of the string buffer, but it also improves the cache locality of the data accessed by the function - 除了降低重新配置的cost,也增加了cache locality,使得data能夠更快速的存取 A test using remove_ctrl_reserve() consumes 1.47 microseconds per call, an improvement of 17% over remove_ctrl_mutating(). ### Eliminate Copying of String Arguments > When a string expression is passed into a function by value, the formal argument (in this case, s) is copy-constructed. 在function傳遞參數時,參數會被copy,根據string的implementation,copy的動作有多種可能的情況: > - If strings are implemented using the copy-on-write idiom, then the compiler generates a call to the copy constructor, which performs an efficient pointer copy and increments the reference count. > - If strings have a nonshared buffer implementation, then the copy constructor must allocate a new buffer and copy the contents of the actual argument. > - If the compiler implemented C++11-style rvalue references and move semantics, then if the actual argument is an expression, it will be an rvalue, so the compiler will generate a call to the move constructor, resulting in an efficient pointer copy. If the actual argument is a variable, then the formal argument’s copy constructor is called, resulting in an allocation-and-copy. Rvalue references and move semantics are described in more detail in not available. ```cpp std::string remove_ctrl_ref_args(std::string const& s) { std::string result; result.reserve(s.length()); for (int i=0; i<s.length(); ++i) { if (s[i] >= 0x20) result += s[i]; } return result; } ``` - 由於s不會被修改,所以不用另外copy一份,透過const reference來節省空間 > The result is a surprise. The timing test of remove_ctrl_ref_args() took 1.60 microseconds per call, 8% worse than remove_ctrl_reserve(). > ...Reference variables are implemented as pointers. So, everywhere s appears in remove_ctrl_ref_args(), the program dereferences a pointer that it did not have to dereference in remove_ctrl_reserve(). I hypothesize that this extra work might be enough to account for the reduced performance. - 但在這一步的結果卻沒有優化到code(在Visual Studio上),作者推測是VS對於copy有額外做一些事情,還有因為reference是透過dereference pointer實作的,在dereference上可能有一些額外的工作增加了時間 ### Eliminate Pointer Dereference Using Iterators > String iterators are simple pointers into the character buffer. That saves two dereference operations versus the non-iterator code in the loop. - string iterators直接指向string的char buffer,所以不用在額外做兩次的dereference(loop一次合併一次) ```cpp= std::string remove_ctrl_ref_args_it(std::string const& s) { std::string result; result.reserve(s.length()); for (auto it=s.begin(),end=s.end(); it != end; ++it) { if (*it >= 0x20) result += *it; } return result; } ``` > The timing test for remove_ctrl_ref_args_it() produced a satisfying result of 1.04 microseconds per call. > remove_ctrl_ref_args_it() contains one other optimization of note. The value s.end(), used to control the for loop, is cached on loop initialization. This saves another 2n indirections, where n is the length of the argument string. - 另一種版本的優化是,把`s.end()`位址先記錄下來,就不用每次loop都重找一次 #### 為什麼是2n? 來看看`string::end()`的[source code](https://code.woboq.org/llvm/libcxx/include/string.html),可以看到每次呼叫`end()`,都會取得原始的pointer + size(),這個別需要一次operation(可以在追下去看`__get_pointer`和`size()`各別做了什麼事情),迴圈執行了n次,所以2n可能是這樣來的 ```cpp=910 const_iterator end() const _NOEXCEPT {return const_iterator(this, __get_pointer() + size());} ``` ### Eliminate Copying of Returned String Values 為了回傳結果,function內必須要create一個變數,最後在將該變數copy給等待接收的變數中。有些compiler會幫忙做優化(though the compiler is permitted to elide (that is, simplify by removing) the copy construction if it can),不過如果要確保沒有copy,有兩種方法,一種是所有C++版本都適用的,也就是call function的時候同時reference一個保存結果的string進來(這也是編譯器優化時在做的事情) ```cpp void remove_ctrl_ref_result_it( std::string& result, std::string const& s) { result.clear(); result.reserve(s.length()); for (auto it=s.begin(),end=s.end(); it != end; ++it) { if (*it >= 0x20) result += *it; } } ``` > Measured performance of remove_ctrl_ref_result_it() is 1.02 microseconds per call, about 2% faster than the previous version. > 不過這種用法要注意,比方說下列的情況就會得到不正確的結果(return empty string) ```cpp std::string foo("this is a string"); remove_ctrl_ref_result_it(foo, foo); ``` ### Use Character Arrays Instead of Strings 要效能?用C++幹嘛,用C阿 > To use the C-style string functions, the programmer must choose either to manually allocate and free character buffers, or to use static arrays dimensioned to worst-case sizes - 不過C需要手動配置和釋放記憶體,以及以worst-case決定static array size > Declaring a bunch of static arrays is problematic if memory is at all constrained. However, there is usually room to statically declare large temporary buffers in local storage (that is, on the function call stack). These buffers are reclaimed at negligible runtime cost when the function exits. Except in the most constrained embedded environments, it is no problem to declare a worst-case buffer of 1,000 or even 10,000 characters on the stack. - 在記憶體有限的狀況下,宣告一堆static array是不好的,不過在function內宣告的memory由於function結束時會釋放出來,除非是真的記憶體非常受限的狀況下(embedded system),不然宣告個1000~10000的 char空間還是沒問題的 ```cpp void remove_ctrl_cstrings(char* destp, char const* srcp, size_t size) { for (size_t i=0; i<size; ++i) { if (srcp[i] >= 0x20) *destp++ = srcp[i]; } *destp = 0; } ``` > remove_ctrl_cstrings() took 0.15 microseconds per call in the timing test. This is 6 times faster than its predecessor, and an astonishing 170 times faster than the original function. One reason for the improvement is the elimination of several function calls, with a corresponding improvement in cache locality. - 比上一版快了6倍,比最原始版本快了170倍,我的老天鵝阿! - 減少了大量的function call和增加了cache locality(char* 是連續的memory space) ### Stop and Think 儘管過程中一步步的優化了程式碼,但這些效能可能是犧牲了簡單性和安全性換來的,比方說 - remove_ctrl_ref_result_it()具有潛在的不正確使用方式 - remove_ctrl_cstrings()需要手動管理char*的記憶體空間,設計上較複雜 > C++ offers developers a range of choices between simple and safe code that is slow and radically fast code that must be used carefully. Advocates of other languages may call this a weakness, but for optimization, it is one of the greatest strengths of C++. > > In the end, the team must answer the question, “How much do we need this performance improvement?” ### Summary of First Optimization Attempt 作者也比較了debug mode 和 release mode的差異,能看出好的coding對於程式效能真的影響很大 ![](https://i.imgur.com/w1QD7au.png)