HackMD - Collaborative Markdown Knowledge Base

--- post_title: A Tour of 4 MSVC Backend Improvements username: sibrand@microsoft.com microsoft_alias: sybrand featured_image: ../../../common_images/cplusplusfeature.png categories: C++, performance summary: We hear that many of you would like to see more details of the improvements which the MSVC backend team have been working on recently. This blog post presents some of the optimisation the team has implemented for Visual Studio 2022. desired_publication_date: 2022-11-28 --- We hear that many of you would like to see more details of the improvements which the MSVC backend team have been working on recently. This blog post presents some of the optimisation the team has implemented for Visual Studio 2022. It was co-written by one of our backend team leads, Eric Brumer, and our developer advocate, Sy Brand. ## Byteswap Identification Changing the [endianness](https://en.wikipedia.org/wiki/Endianness) of an integer can be an important operation in contexts where data is being transmitted between processors with different byte orders, or over the network. This is often referred to as a "byteswap". C++23 will add a [`std::byteswap`](https://en.cppreference.com/w/cpp/numeric/byteswap) function to the standard library, but for codebases today it’s common to see custom implementations, which might look something like this: ```c++ int byteswap(int n) { return ((n & 0xff) << 24u) | ((n & 0xff00) << 8u) | ((n & 0xff0000) >> 8u) | ((n & 0xff000000) >> 24u); } ``` In Visual Studio 2019 16.11, this generated the following code for x64, and a similarly long sequence of instructions for Arm64: ```asm mov eax, ecx mov edx, ecx and eax, 65280 shl edx, 16 or eax, edx mov edx, ecx shl eax, 8 sar edx, 8 and edx, 65280 shr ecx, 24 or eax, edx or eax, ecx ret 0 ``` x64 and Arm64 both have instructions which carry out a byteswap, which should ideally be used instead. Visual Studio 2022 17.3 introduced automatic byteswap identification, and now outputs the following on x64 with `/O1`: ```asm bswap ecx mov eax, ecx ret 0 ``` On Arm64, the results are much the same: ```asm rev w0,w0 ret ``` This optimization isn't just doing a simple pattern match on your code, it's a generalised bit tracker. For example, this more obfustaced byteswap still gets optimized to a single `rev` on Arm64: ```c++ unsigned __int64 byteswap_wat_edition(unsigned __int64 x) { x = _byteswap_uint64(x); x = (x & 0xaaaaaaaaaaaaaaaa) >> 1 | (x & 0x5555555555555555) << 1; x = (x & 0xcccccccccccccccc) >> 2 | (x & 0x3333333333333333) << 2; x = (x & 0xf0f0f0f0f0f0f0f0) >> 4 | (x & 0x0f0f0f0f0f0f0f0f) << 4; return x; } ``` ## Loop Unswitching For sake of code simplicity, we may choose to write a branch inside a loop whose condition is invariable in each iteration. For example, if we want to pet all our cats, and if it’s feeding time, we also want to feed them, we could write this: ```cpp void process_cats(std::span<cat> cats, bool is_feeding_time) { for (auto&& c : cats) { if (is_feeding_time) { feed(c); } pet(c); } } ``` Since `is_feeding_time` never changes inside the function, it may be wasteful to carry out that check every iteration. Better hardware [branch predictors](https://en.wikipedia.org/wiki/Branch_predictor) will minimise this impact, but in many cases we can see performance wins by carrying out [loop unswitching](https://en.wikipedia.org/wiki/Loop_unswitching). This hoists the branch outside of the loop, generating code similar to this C++: ```cpp void process_cats(std::span<cat> cats, bool is_feeding_time) { if (is_feeding_time) { for (auto&& c : cats) { feed(c); pet(c); } } else { for (auto&& c : cats) { pet(c); } } } ``` As of Visual Studio 2022 version 17.1, MSVC may carry out loop unswitching at `/O2`. This is a [heuristic](https://en.wikipedia.org/wiki/Heuristic_(computer_science))-driven optimisation which may or may not be carried out depending on analysis results. This is because loop unswitching duplicates the loop body, which may affect instruction cache performance. You may pass the `/Qvec-report:1` flag to see a report on which loops were unswitched. Here is the x64 generated by Visual Studio 2019 version 16.11 with `/O2`: ```asm $LN15: mov QWORD PTR [rsp+8], rbx mov QWORD PTR [rsp+16], rsi push rdi sub rsp, 32 mov rbx, QWORD PTR [rcx] movzx esi, dl mov rax, QWORD PTR [rcx+8] lea rdi, QWORD PTR [rbx+rax*4] cmp rbx, rdi je SHORT $LN3@process_ca $LL4@process_ca: test sil, sil je SHORT $LN5@process_ca mov ecx, DWORD PTR [rbx] call void feed(cat) $LN5@process_ca: mov ecx, DWORD PTR [rbx] call void pet(cat) add rbx, 4 cmp rbx, rdi jne SHORT $LL4@process_ca $LN3@process_ca: mov rbx, QWORD PTR [rsp+48] mov rsi, QWORD PTR [rsp+56] add rsp, 32 pop rdi ret 0 ``` The key part to note here are the lines directly under the `$LL4@process_ca` label, which carry out the test on `is_feeding_time` and branch on the result, all inside the loop. Here is the code generated now with `/O2`: ```asm $LN22: mov QWORD PTR [rsp+8], rbx push rdi sub rsp, 32 mov rbx, QWORD PTR [rcx] mov rax, QWORD PTR [rcx+8] lea rdi, QWORD PTR [rbx+rax*4] cmp rbx, rdi je SHORT $LN14@process_ca test dl, dl je SHORT $LL11@process_ca npad 2 $LL4@process_ca: mov ecx, DWORD PTR [rbx] call void feed(cat) mov ecx, DWORD PTR [rbx] call void pet(cat) add rbx, 4 cmp rbx, rdi jne SHORT $LL4@process_ca mov rbx, QWORD PTR [rsp+48] add rsp, 32 pop rdi ret 0 $LL11@process_ca: mov ecx, DWORD PTR [rbx] call void pet(cat) add rbx, 4 cmp rbx, rdi jne SHORT $LL11@process_ca $LN14@process_ca: mov rbx, QWORD PTR [rsp+48] add rsp, 32 pop rdi ret 0 ``` The code is now longer because two versions of the loop body are now generated, under the `$LL4@process_ca` and `$LL11@process_ca` labels. But also note that the branch occurs in the entry block of the function and selects between the two loop body versions: ```asm cmp rbx, rdi je SHORT $LN14@process_ca test dl, dl je SHORT $LL11@process_ca ``` ## Min/Max Chains We have improved optimization of chains of `std::min` and `std::max` as of Visual Studio 2022 version 17.0. Say we have three blankets to choose from to give our cat. The one we pick shouldn't be too hard. It shouldn't be too soft. It should be just right. We could write a function to give us the Just Right blanket from three, by picking the middle one. It could look something like this: ```c++ using softness = float; softness just_right_blanket(softness a, softness b, softness c) { return std::max(std::min(a,b), std::min(std::max(a,b),c)); } ``` In VS2019, this code was generated for x64 with `/O2 /fp:fast`: ```asm comiss xmm1, xmm0 lea rcx, QWORD PTR a$[rsp] lea rax, QWORD PTR b$[rsp] lea rdx, QWORD PTR a$[rsp] movss DWORD PTR [rsp+16], xmm1 movss DWORD PTR [rsp+8], xmm0 cmovbe rax, rcx movss DWORD PTR [rsp+24], xmm2 lea rcx, QWORD PTR c$[rsp] comiss xmm2, DWORD PTR [rax] cmovae rcx, rax lea rax, QWORD PTR b$[rsp] comiss xmm0, xmm1 cmovbe rax, rdx movss xmm1, DWORD PTR [rax] comiss xmm1, DWORD PTR [rcx] cmovb rax, rcx movss xmm0, DWORD PTR [rax] ret 0 ``` Arm64 codegen is similarly inefficient. Both x64 and Arm64 have single instructions for scalar floating point min and max, which we now use in VS2022 at `/O1` and above with `/fp:fast`. Here is the x64 code now: ```asm movaps xmm3, xmm0 maxss xmm0, xmm1 minss xmm3, xmm1 minss xmm0, xmm2 maxss xmm0, xmm3 ret 0 ``` And for Arm64: ```asm fmax s16,s0,s1 fmin s17,s0,s1 fmin s18,s16,s2 fmax s0,s18,s17 ret ``` ## Backwards Loop Vectorization I run a cat shelter with 32 cats and want to count how many crunchies they leave behind in their bowls after mealtime. So I write a function which takes a pointer to the first bowl, and sum it like so (yes, I know I could use `std::accumulate`): ```c++ int count_leftovers(int* bowl_ptr) { int result = 0; for (int i = 0; i < 32; ++i, ++bowl_ptr) { result += *bowl_ptr; } return result; } ``` This all works and generates good code! But then I realise that my desk is actually at the far end of the room, so I need to walk all the way to the start of the line to begin counting. I decide to instead take a pointer to the *last* bowl and work backwards: ```c++ int count_leftovers(int* bowl_ptr) { int result = 0; // change ++bowl_ptr to --bowl_ptr for (int i = 0; i < 32; ++i, --bowl_ptr) { result += *bowl_ptr; } return result; } ``` Unfortunately, if I was using VS2019, this loop would not be [vectorized](https://en.wikipedia.org/wiki/Automatic_vectorization). Here is the code generated with `/O2`: ```asm xor eax, eax mov edx, eax mov r8d, eax mov r9d, eax add rcx, -8 lea r10d, QWORD PTR [rax+8] $LL4@count_left: add eax, DWORD PTR [rcx+8] add r9d, DWORD PTR [rcx+4] add r8d, DWORD PTR [rcx] add edx, DWORD PTR [rcx-4] lea rcx, QWORD PTR [rcx-16] sub r10, 1 jne SHORT $LL4@count_left lea ecx, DWORD PTR [rdx+r8] add ecx, r9d add eax, ecx ret 0 ``` The loop is [unrolled](https://en.wikipedia.org/wiki/Loop_unrolling) but it is not vectorized. As of VS2022 17.1, vectorization for backwards-strided loops is enabled at `/O1`. The code generated will depend a lot on the flags you use, particularly the [`/arch`](https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-170) flag for enabling use of [AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) instructions instead of the default [SSE2](https://en.wikipedia.org/wiki/SSE2) ones. Here is the code generated for `/O2 /arch:AVX2`: ```asm vpxor xmm2, xmm2, xmm2 vpxor xmm3, xmm3, xmm3 mov eax, 2 npad 3 $LL4@count_left: vmovd xmm1, DWORD PTR [rcx] vpinsrd xmm1, xmm1, DWORD PTR [rcx-4], 1 vpinsrd xmm1, xmm1, DWORD PTR [rcx-8], 2 vpinsrd xmm1, xmm1, DWORD PTR [rcx-12], 3 vmovd xmm0, DWORD PTR [rcx-16] vpinsrd xmm0, xmm0, DWORD PTR [rcx-20], 1 vpinsrd xmm0, xmm0, DWORD PTR [rcx-24], 2 vpinsrd xmm0, xmm0, DWORD PTR [rcx-28], 3 lea rcx, QWORD PTR [rcx-64] vinsertf128 ymm0, ymm1, xmm0, 1 vmovd xmm1, DWORD PTR [rcx+32] vpinsrd xmm1, xmm1, DWORD PTR [rcx+28], 1 vpinsrd xmm1, xmm1, DWORD PTR [rcx+24], 2 vpinsrd xmm1, xmm1, DWORD PTR [rcx+20], 3 vpaddd ymm2, ymm0, ymm2 vmovd xmm0, DWORD PTR [rcx+16] vpinsrd xmm0, xmm0, DWORD PTR [rcx+12], 1 vpinsrd xmm0, xmm0, DWORD PTR [rcx+8], 2 vpinsrd xmm0, xmm0, DWORD PTR [rcx+4], 3 vinsertf128 ymm0, ymm1, xmm0, 1 vpaddd ymm3, ymm0, ymm3 sub rax, 1 jne $LL4@count_left vpaddd ymm0, ymm3, ymm2 vphaddd ymm1, ymm0, ymm0 vphaddd ymm2, ymm1, ymm1 vextracti128 xmm0, ymm2, 1 vpaddd xmm0, xmm2, xmm0 vmovd eax, xmm0 vzeroupper ret 0 ``` This both unrolls and vectorizes the loop. Fully explaining AVX2 vector instructions is a job for a different blog post (maybe check out [this one](https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX)), but the basic idea is that all those [`vpinsrd`](https://www.felixcloutier.com/x86/pinsrb:pinsrd:pinsrq) instructions are loading the data from memory into vector registers, then the [`vpaddd`](https://www.felixcloutier.com/x86/paddb:paddw:paddd:paddq)/[`vphadd`](https://www.felixcloutier.com/x86/phaddw:phaddd) instructions carry out addition on big chunks of data all at the same time. ## Send us your feedback We are very much interested in your feedback to continue to improve our tools. The comments below are open. Feedback can also be shared through [Developer Community](https://developercommunity.visualstudio.com/cpp). You can also reach us on Twitter ([@VisualC](https://twitter.com/visualc)), or via email at [visualcpp@microsoft.com](mailto:visualcpp@microsoft.com).