TSanv3 Shadow - HackMD

# TSanv3 Shadow This document contains details about the well-known "shadow memory" that TSan uses for race detection. The information below outlines the underlying mechanism with comparisons to FastTrack, and highlights the overheads incurred by this component. ## Structure The [`Shadow` class](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_shadow.h#L57) stores information about a memory access for race detection. It consists of the following fields: - `access` - A bitmask containing the types of access based on [the AccessType enum](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_defs.h#L170-L180). - `sid` - The [slot](/SE-3gyaBSked_W2FvR8_eQ) ID (similar to thread ID) of the accessing thread. - `epoch` - The local epoch of the accessing thread. - `is_read` - Whether the corresponding access is a read or write operation. - `is_atomic` - Whether the corresponding access is an atomic operation. ## Usage ### Read/Write Epochs `Shadow` is an extension of the write epoch proposed in [FastTrack](https://dl.acm.org/doi/10.1145/1543135.1542490), in which only a thread's ID and local epoch needs to be recorded for a write operation. Unlike FastTrack, TSan also uses `Shadow` for storing read epochs. There is no adaptive vector clock for reads, i.e. only [a maximum of `kShadowCnt`](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp#L198) ([which is 4](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_defs.h#L77)) read epochs are stored per memory word (more elaboration on this below). ### Shadow Cells HB race detection algorithms store timestamps (either an epoch or a vector clock) for each variable. For data race detection, the target is conflicting accesses on a memory location, so it is necessary to maintain timestamps for each accessed memory location. It would be costly to use a data structure that maps memory locations to their corresponding timestamps. Instead, [TSan](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_platform_posix.cpp#L45-L52) [allocates](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/sanitizer_common/sanitizer_posix_libcdep.cpp#L323-L335) a [large region of memory](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_platform.h#L47-L89) so that each word in application memory has a corresponding *cell* in shadow memory to store its timestamps in `Shadow`s. [`MemToShadow`](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_platform.h#L925-L939) maps an application memory address to its *shadow cell* address. Each shadow cell [contains 4](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_defs.h#L77) shadow values. TSan refers to application memory with "user bytes". Every 8 user bytes share a shadow cell. It would be too costly to maintain a shadow cell for each individual user byte, while it is reasonable to use word-granularity (8 user bytes per shadow cell) as most modern programs run on 64-bit systems. For example, a program that accesses 2 different variables at 0xd00d4001 and 0xd00d4004 will fill the same corresponding shadow cell with 2 shadow values. Shadow values are stored with [relaxed memory order](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_shadow.h#L188) without CAS. There may be concerns of completeness when two threads store to the same shadow location concurrently. However, one can strongly argue that it is not worth incurring the overhead from CAS just to accomodate this possibly rare occurence. ### Intersection For sound data race detection, it is important to distinguish the different variables that use the same cell. The address and size of each access is [recorded](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_shadow.h#L69) and [retrieved](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_shadow.h#L84-L98) with a bitmask. As each shadow cell could potentially account for at most 8 user bytes (word), it is sufficient to represent the set of accessed bytes with 8 bits, according to the offset in the word and access size. For example, a program that accesses 2 bytes at 0xd00d4001 (resp. 4 bytes at 0xd00d4004) will record a bitmask of 00000110 (resp. 11110000). To check if 2 accesses overlap on the same memory locations, it suffices to check [whether two](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp#L206-L207) [bitmasks intersect](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp#L312-L314) at any bit positions. In the example above, the accesses do not overlap. Another example. If a program accesses 4 bytes at 0xcafe8882 and 2 bytes at 0xcafe8885, the access bitmasks would be 00111100 and 01100000, which intersect at bit position 5 (counting from the left). Hence, the accesses overlap and it is necessary to check for a data race. :::info It is interesting to consider that if a user word is occupied by 8 different variables of the application (each take up 1 byte), they would all be contending for the same 4 shadow values in a shadow cell. This leads to a severe decrease in completeness. Perhaps, this is not a serious concern in practice, i.e. unlikely to see such memory layout. ::: ### Additional Info The subsections above covered epochs and the access offset/size. Besides these, a shadow value contains several other information, namely access types, whether the access was a read or write, whether the access was atomic. More details on their purposes can be found in [Race Detection](/GTABsgldQyWrFy8XdqJOvA). ### Summary This section covered the various purposes and design concerns of the shadow memory. - Mapping user bytes directly to their shadow values. - To reduce the memory requirements of shadow memory, each user word (instead of byte) is mapped to a shadow cell. ## Overhead The causes of overhead incurred by the usage of shadow memory present a wide area of research opportunities. It is worth considering whether the usage of shadow memory is necessary or optimal, in terms of performance and memory usage. ### Memory Usage Although TSan maps a 32TB region for shadows, it is [with](https://github.com/llvm/llvm-project/blob/37d7b06da03a46e7bbd700e3d247fdb70e97f933/compiler-rt/lib/tsan/rtl/tsan_platform_posix.cpp#L45-L48) [the](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/sanitizer_common/sanitizer_posix_libcdep.cpp#L330) [`MAP_NORESERVE` flag](https://github.com/llvm/llvm-project/blob/987087df90026605fc8d03ebda5a1cd31b71e609/compiler-rt/lib/sanitizer_common/sanitizer_posix_libcdep.cpp#L319-L321), therefore no actual memory is used until a page is accessed by TSan. The memory usage thus depends on the memory layout of the application. If it is sparse, more memory used; dense means less memory used. :::success Memory usage is likely quite optimal, with just 32*4=128 shadow bytes per user word, i.e. 128/8=16 shadow bytes per user byte. ::: ### Duration There is a significant performance overhead from the loading of shadow values. This is due to shadow memory being global/shared memory, multiple threads reading from and writing to the same memory location results in high number of cache misses. The effect is more pronounced when the same memory region is accessed frequently by multiple threads in the original program. Shadow values are loaded/stored with relaxed memory order (not CAS). **There is no slowdown due to locking or any form of synchronization.** ### Reproducible Experiments Here are small reproducible steps for experiments that show the overhead incurred by loading from shadow memory. :::spoiler Diff ```diff diff --git a/compiler-rt/lib/tsan/rtl/tsan_defs.h b/compiler-rt/lib/tsan/rtl/tsan_defs.h index 1ffa3d6aec40..941a598e13b6 100644 --- a/compiler-rt/lib/tsan/rtl/tsan_defs.h +++ b/compiler-rt/lib/tsan/rtl/tsan_defs.h @@ -18,6 +18,8 @@ #include "sanitizer_common/sanitizer_mutex.h" #include "ubsan/ubsan_platform.h" +#define TSAN_NO_LOAD_SHADOW 0 + #ifndef TSAN_VECTORIZE # define TSAN_VECTORIZE __SSE4_2__ #endif diff --git a/compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp b/compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp index 8b20984a0100..309f855e109d 100644 --- a/compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp +++ b/compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp @@ -238,6 +238,7 @@ bool CheckRaces(ThreadState* thr, RawShadow* shadow_mem, Shadow cur, ALWAYS_INLINE bool ContainsSameAccess(RawShadow* unused0, Shadow unused1, m128 shadow, m128 access, AccessType typ) { + return false; // Note: we could check if there is a larger access of the same type, // e.g. we just allocated/memset-ed a block (so it contains 8 byte writes) // and now do smaller reads/writes, these can also be considered as "same @@ -319,7 +320,7 @@ bool CheckRaces(ThreadState* thr, RawShadow* shadow_mem, Shadow cur, const m128 no_race = _mm_or_si128(_mm_or_si128(not_intersect, same_sid), both_read_or_atomic); const int race_mask = _mm_movemask_epi8(_mm_cmpeq_epi32(no_race, zero)); - if (UNLIKELY(race_mask)) + // if (UNLIKELY(race_mask)) goto SHARED; STORE : { @@ -374,17 +375,23 @@ SHARED: const m128 shadow_epochs = _mm_and_si128(shadow, mask_epoch); const m128 concurrent = _mm_cmplt_epi32(thread_epochs, shadow_epochs); const int concurrent_mask = _mm_movemask_epi8(concurrent); - if (LIKELY(concurrent_mask == 0)) + // if (LIKELY(concurrent_mask == 0)) goto STORE; - DoReportRaceV(thr, shadow_mem, cur, concurrent_mask, shadow, typ); + // DoReportRaceV(thr, shadow_mem, cur, concurrent_mask, shadow, typ); return true; } +#if TSAN_NO_LOAD_SHADOW +# define LOAD_CURRENT_SHADOW(cur, shadow_mem) \ + const m128 access = _mm_set1_epi32(static_cast<u32>((cur).raw())); \ + const m128 shadow = _mm_set1_epi32(static_cast<u32>(Shadow::kEmpty)) +#else # define LOAD_CURRENT_SHADOW(cur, shadow_mem) \ const m128 access = _mm_set1_epi32(static_cast<u32>((cur).raw())); \ const m128 shadow = _mm_load_si128(reinterpret_cast<m128*>(shadow_mem)) #endif +#endif char* DumpShadow(char* buf, RawShadow raw) { if (raw == Shadow::kEmpty) { ``` ::: In the diff above, the `LOAD_CURRENT_SHADOW` is modified to not load from shadow memory, but instead initialize the `shadow` variable with constant `Shadow::kEmpty`. This patch is carefully designed to make the *"no load shadow"* and *"yes load shadow"* versions comparable. We want the "no" and "yes" versions to run almost the same code (take the same code path), while causing the slowdown of loading shadow values to manifest in "yes". This justifies the following changes: 1. `ContainsSameAccess` should always early return `false`, so that it does not do anything and also does not cause skipping the rest of the race detection routine. 2. `CheckRaces` should run as is, so that the `StoreShadow` within will be executed, to induce cache misses for any subsequent loads from the same shadow location. 3. `DoReportRaceV` should not be executed so that the "yes" version does not perform additional work when it detects a race. 4. Some if-statements were commented out so that both versions take the same code path. For "no load shadow" (resp. "yes load shadow"), set `TSAN_NO_LOAD_SHADOW` in tsan_defs.h to 1 (resp. 0). #### c_fft from OmpSCR Run the setup script in [this repo](https://github.com/focs-lab/tsan-benchmarks/tree/main) to build the OmpSCR benchmarks. Then, set the `LD_LIBRARY_PATH` (modify accordingly) and `TSAN_OPTIONS` environment variables as follows: ``` export LD_LIBRARY_PATH=~/llvm-project/build/lib/clang/19/lib/x86_64-unknown-linux-gnu:~/llvm-project/build/runtimes/runtimes-bins/openmp/tools/archer:~/llvm-project/build/runtimes/runtimes-bins/openmp/runtime/src export TSAN_OPTIONS="ignore_noninstrumented_modules=1 report_bugs=0" ``` Run the benchmark with perf as follows: ``` OMP_NUM_THREADS=32 perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses bin/OmpSCR/c_fft.par 2048 ``` With loading shadow: ``` Performance counter stats for 'bin/OmpSCR/c_fft.par 2048': 23,102,013,931 L1-dcache-loads:u 125,666,514 L1-dcache-load-misses:u # 0.54% of all L1-dcache accesses 39,495,799 LLC-loads:u 3,818,574 LLC-load-misses:u # 9.67% of all LL-cache accesses 8.430502780 seconds time elapsed ``` Without loading shadow: ``` Performance counter stats for 'bin/OmpSCR/c_fft.par 2048': 21,417,564,397 L1-dcache-loads:u 53,718,181 L1-dcache-load-misses:u # 0.25% of all L1-dcache accesses 8,333,281 LLC-loads:u 1,791,274 LLC-load-misses:u # 21.50% of all LL-cache accesses 6.889046348 seconds time elapsed ``` Above, we can observe that there is a **21.7% slowdown** due to loading from shadow memory. There are **4.75x LLC loads** when loading from shadow memory. #### v8 [Fetch](https://v8.dev/docs/source-code) and [build](https://v8.dev/docs/build-gn) v8. We will reproduce [this bug](https://logs.chromium.org/logs/v8/buildbucket/cr-buildbucket/8786697506901472385/+/u/Num_Fuzz_-_combined__flakes_/regress-crbug-1420860). In the v8 directory, checkout the buggy version. ``` git checkout 80fc53dd3ac gclient sync ``` Run `gn args out/build` and fill it with the following options: ``` dcheck_always_on = false is_component_build = false is_debug = false is_tsan = true target_cpu = "x64" ``` Overwrite the build system's TSan libraries with ours (modify the llvm-project path accordingly) and build d8. Always repeat this step after rebuilding llvm-project. ``` rm out/build/d8 cp ~/llvm-project/build/lib/clang/19/lib/x86_64-unknown-linux-gnu/libclang_rt.tsan* third_party/llvm-build/Release+Asserts/lib/clang/17/lib/x86_64-unknown-linux-gnu/ ninja -C out/build d8 ``` Run perf on top of d8 with the bug-reproducing flags. ``` perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses out/build/d8 --test test/mjsunit/mjsunit.js test/mjsunit/mjsunit_numfuzz.js test/mjsunit/regress/regress-crbug-1420860.js --random-seed=-452346216 --nohard-abort --exit-on-contradictory-flags --testing-d8-test-runner --no-fail --always-turbofan --assert-types --minor-mc --no-enable-bmi1 --no-lazy-feedback-allocation --no-regexp-tier-up --stress-flush-code --stack-size=434 --random-gc-interval=859 --fuzzer-random-seed=818416949 ``` With loading shadow: ``` Performance counter stats for 'out/build/d8 --test test/mjsunit/mjsunit.js test/mjsunit/mjsunit_numfuzz.js test/mjsunit/regress/regress-crbug-1420860.js --random-seed=-452346216 --nohard-abort --exit-on-contradictory-flags --testing-d8-test-runner --no-fail --always-turbofan --assert-types --minor-mc --no-enable-bmi1 --no-lazy-feedback-allocation --no-regexp-tier-up --stress-flush-code --stack-size=434 --random-gc-interval=859 --fuzzer-random-seed=818416949': 526,296,346 L1-dcache-loads:u 3,995,683 L1-dcache-load-misses:u # 0.76% of all L1-dcache accesses 1,356,441 LLC-loads:u 21,977 LLC-load-misses:u # 1.62% of all LL-cache accesses 0.524310696 seconds time elapsed ``` Without loading shadow: ``` Performance counter stats for 'out/build/d8 --test test/mjsunit/mjsunit.js test/mjsunit/mjsunit_numfuzz.js test/mjsunit/regress/regress-crbug-1420860.js --random-seed=-452346216 --nohard-abort --exit-on-contradictory-flags --testing-d8-test-runner --no-fail --always-turbofan --assert-types --minor-mc --no-enable-bmi1 --no-lazy-feedback-allocation --no-regexp-tier-up --stress-flush-code --stack-size=434 --random-gc-interval=859 --fuzzer-random-seed=818416949': 478,598,013 L1-dcache-loads:u 2,034,780 L1-dcache-load-misses:u # 0.43% of all L1-dcache accesses 509,010 LLC-loads:u 19,536 LLC-load-misses:u # 3.84% of all LL-cache accesses 0.443995114 seconds time elapsed ``` Above, we can observe a **18% slowdown** due to loading from shadow memory. There are **2.67x LLC loads** when loading from shadow memory. :::info TODO: Add benchbase too. ::: #### Summary Overall, there is a slowdown due to cache misses induced by loading from shadow memory as it is shared memory. The extent of the slowdown likely depends on the nature of the application under test. This slowdown could possibly be reduced when using a race detection algorithm that accesses shadow memory (or any shared memory) less frequently. :::warning It is interesting that by merely loading from shared memory, the program can slow down by ~20%. :::