ASPLOS 22 - HackMD

# ASPLOS 22 ## General **27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems** Proceedings: https://dl.acm.org/doi/proceedings/10.1145/3503222 --- ## Table of Contents 1. [Top Picks](#top-picks) 1. [Tutorials / Workshops](#tutorials--workshops) 1. [Keynotes](#keynotes) 1. [Session 1A: Accelerators](#session-1a-accelerators) 1. [Session 1B: Address and Memory](#session-1b-address-and-memory) 1. [Session 2A: GPU and Data Analytics](#session-2a-gpu-and-data-analytics) 1. [Session 2B: Privacy and Software Security](#session-2b-privacy-and-software-security) 1. [Session 3A: Hardware Security (1)](#session-3a-hardware-security-1) 1. [Session 3B: Misc.](#session-3b-misc) 1. [Session 4A: Systems for Machine Learning](#session-4a-systems-for-machine-learning) 1. [Session 4B: Operating Systems](#session-4b-operating-systems) 1. [Session 5A: Quantum Computing](#session-5a-quantum-computing) 1. [Session 5B: Data Center and Cloud Services](#session-5b-data-center-and-cloud-services) 1. [Session 6A: Accelerating Emerging Applications](#session-6a-accelerating-emerging-applications) 1. [Session 6B: Bugs (1)](#session-6b-bugs-1) 1. [Session 7A: Serverless Computing](#session-7a-serverless-computing) 1. [Session 7B: Bugs (2)](#session-7b-bugs-2) 1. [Session 8A: Non-traditional Computing and Reconfigurable Hardware](#session-8a-non-traditional-computing-and-reconfigurable-hardware) 1. [Session 8B: Synthesis and Compilation](#session-8b-synthesis-and-compilation) 1. [Session 9A: Hardware Security (2)](#session-9a-hardware-security-2) 1. [Session 9B: Smart Networking](#session-9b-smart-networking) --- ## Top Picks *@hero, @occamy*\ Cock et al.: **Enzian: an open, general, CPU/FPGA platform for systems software research** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507742))([notes](#enzian-an-open-general-cpufpga-platform-for-systems-software-research)) *@paulsc*\ Rao et al.: **SparseCore: stream ISA and processor specialization for sparse computation** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507705))([notes](#sparsecore-stream-isa-and-processor-specialization-for-sparse-computation)) *@nwistoff*\ Deutsch et al.: **DAGguise: Mitigating Memory Controller Side Channels** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507747))([notes](#dagguise-mitigating-memory-controller-side-channels)) *Virtual Memory / Address Translation*\ Suchi et al.: **CARAT CAKE: replacing paging via compiler/kernel cooperation** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507771))([notes](#carat-cake-replacing-paging-via-compilerkernel-cooperation)) *RedHub*\ Mahmod et al.: **Invisible bits: hiding secret messages in SRAM's analog domain** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507756))([notes](#invisible-bits-hiding-secret-messages-in-srams-analog-domain)) --- # Tutorials / Workshops --- # Keynotes ## Fast and Furious: How the web got turbo charged just in time... *Michael Franz (UCal, Irvine)* - JavaScript: transfer wire formats vs. source code to browser - Benefit of wire formats: protect JavaScript source - Benefit of sourcecode: More platform independent, enable JIT compilation and large performance optimisations - JIT compilation concepts - Focus on loops in the program, only pre-compile these - Dynamic loop discovery and trading (cf. `Dynamo`) - Static single assignments, `phi` functions - Optimise *Hot Loops*, inline calls, add *bailouts* - Trace Trees (cf. Gal & Franz) - Trace Regions (cf. Bebenita & Franz) Takeaway: It makes sense to transfer source code opposed to wire formats for JIT compilation. But, today, there is an inflation in JavaScript code. --- ## Software-Defined Memory Controllers: The Time Has Come *Kevin Loughlin (University of Michigan) et al.* This presentation was actually part of the WACI (Wild and Crazy Ideas) session, but apart from the rhymes, it did not seem too wild and crazy (it was rather a scientific contribution transformed into a poem). ### Idea - Programmable memory controllers, useful for perfomance & security applications (e.g. partitioning) --- # Session 1A: Accelerators [Top](#asplos-22) # Session 1B: Address and Memory [Top](#asplos-22) ## CARAT CAKE: replacing paging via compiler/kernel cooperation *Brian Suchy (Northwestern University) et al.* :+1: ### Idea - Software-based address translation - Work on physical addresses - Runtime manages memory - No more need for MMU - Allow arbitrary granularity (no longer constrained to 4K pages) ### CARAT CAKE 1. Kernel - Nautilus Kernel 2. Libraries - New Address Space (ASpace) Abstraction - Linux Syscall Interface - Allocator and Defragmenter - Trusted Hooks 3. Linux Compatible Process ### Evaluation 1. Benchmark Overheads - Paging overhead - Geomean: 1.7% - Worst: 16% (NAS EP) - Competitive Overhead 2. Memory Movement Stress Test - "Pepper Test" as a "worst case" - Results: Negligible overhead 3. Engineering Effort - Compiler: 3.7k LoC - Kernel: 4.2k LoC --- ## NVAlloc: rethinking heap metadata management in persistant memory allocators *Dang (Zhejinang University) et al.* ### Idea - Leaks are particularly dangerous for persistent memory (PM) - Make PM allocators aware of PM characteristics - Cache line reflush - Optimization: Interleaved Mapping. Consecutive Metadata does not need consecutive - Small random accesses - Frequently happens in large allocations - Optimization: Log-structured bookkeeping - Static slab fragmentation - Very problematic, cannot be resolved by restarting the system for PM - Memory under-utilization - Optimization: Slab Morphing. Transform slab to defragment. ### Evaluation - Allocation: - Outperforms existing allocators - Memory usage reduced (Fragbench) --- ## Every walk's a hit: making page walks single-access cache hits *Park (Uppsala University) et al.* ### Idea - Problem: Virtual Memory Translations don't scale - Page Walk Caches can save intermediate Page Table walk steps - Past Proposals - Address Translation with Prefetching (ASAP) - Elastic Cuckoo Page Tables (ECPT) - Problem: Require large contiguous memory - Page table walks with smaller contiguous memory - Flatten Page tables - Fit in a 2MB region - Reduce serial memory accesses from 4 to 2 (1.5 to 1 with Page walk cache) - Trade off more memory for higher performance - Different flattening schemes and combinations are possible - Small changes in HW and SW - Prioritize Keeping - PTE cache misses are expansive (DRAM latency) - Prioritize Page tables in L3 cache by modifying replacement policy ### Evaluation #### Performace - Geomean: 9% performance increase - Most of the improvement comes from prioritization #### Energy - L3 hits due to prioritization reduce DRAM accesses and thus energy consumption --- # Session 2A: GPU and Data Analytics [Top](#asplos-22) ## GPM: leveraging persistent memory from a GPU *Pandey (Indian Institute of Science Bangalore) et al.* ### Idea - Currently, PMs only benefit CPUs - GPUs can accelerate PM applications - Key Value Stores - GPUs do not have direct PM access, accesses need to go via CPU -> lots of data movement - In-kernel persistence: GPU can update only values in PM that are actually changed - GPM: GPU with Persistent Memory - GPM Hardware - Map NVM into GPU's virtual address space using UVA - Use fences to flush data into NVM - Use Case: GPMBench - Transactional (gpKVS, GPU-accelerated DB) - Long-running iterative (DNNN, CFD, BLK, HS) - Native (BFS, SRAD, PS) - Runtime Library (LibGPM) - CUDA library for GPM ### Evaluation - Transactional, BFS: speedup through reduced data movement ### Discussion - An application that does not scale well: Binominal options -> not enought parallelism in PM --- ## GPUReplay: a 50-KB GPU stack for client ML *Park (Purdue University) et al.* ### Idea - GPUs in mobile sector used for general applications - Large GPU SW stack - Security vulnerabilities - Fragmented deployment (different OSes, drivers etc.) - Slow startup - Reason: GPU stack engineered for graphics - JIT compilation - Dynamic resource management - Fine-grained multiplexing - GPU stack is overkill for ML applications - No need for the above features - Approach: Record CPU/GPU interactions during development, replay on end-device - Register I/Os - Interrupts - GPU page table - *no* inputs / outputs - Need to handle Non-determinism ### Evaluation - Mali Bifrost - Reduced complexity (~2 OoM) - Significantly lowered startup delays - Moderately lowered inference delays - Increased training performance (avoid DeepCL and OpenCL overhead) --- ## ValueExpert: exploring value patterns in GPU-accelerated applications *Zhou (Rice University) et al.* ### Idea - GPUs used for a wide number of applications - Existing profiler tools: hotspot analysis (# API calls) -> Do not evaluate root cause - Detect Redundant or useless computation and unnecessary data movement - Value Pattern Categorization - Corse-grained patters at each GPU operation (Redundant/Duplicate Values) - Record GPU memory changes before/after GPU kernel - Fine-grained patterns within each kernel (frequent values) - Transform unnecessarily large data types - Value Flow Graph - Capture read/write dependencies - Annotate profiling information to graph - ValueExpert Workflow - Online Analyser communicates with GPU, Offline Analyser collects data for GUI *Code available on GitHub* --- ## SparseCore: stream ISA and processor specialization for sparse computation *Rao (University of Southern California) et al.* ### Idea - Graph pattern matching inefficient due to index matching - Reduce data movement - ISA extension - Stream registers - Add several instructions to enable operations on stream data - Code examples for triangle computation, multiply ### Architecture - Data accessed from scratchpad and stream cache - Stream information embedded in stream mapping table and stream registers - Stream value generator -> stream value buffer -> stream value processing units ### Evaluation - Platform?? Baseline? --- ## JSONSki: streaming semi-structured Data with Bit-Parallel Fast-Forwarding *Jiang (UCR) et al.* ### Idea - Extend Streaming evaluation for JSON processing: - fast-forward certain cases - e.g. if when searching for a field, fast-forward over irrelevant field definitions - Needs to recognize nested structures - bit-parallel processing - Partition string into structural intervals based on meta characters ### Evaluation - Evaluate on practical applications - Significant performance gain over other JSON processors - Memory consumption comparable to JPStream (and better than preprocessing-based processors) # Session 2B: Privacy and Software Security [Top](#asplos-22) # Session 3A: Hardware Security (1) [Top](#asplos-22) ## Pinned Loads: Taming Speculative Loads in Secure Processors ### Idea - *Pinned load* loads and guarantees no squashing or eviction - only after it we know there are no exceptions or misprediction - Defer invalidations to pinned loads - When core receives invalidation and line is pinned, it returns a `pinned` message which is acked by `abort` message - Prevent Store Starvation - Core does not pin to-be-invalidated cache lines - Prevent evictions of pinned lines - Prevent deadlocks - Core cannot pin more lines than a set can hold ### Design - Late Pinning: Receive data, then pin - Early Pinning: Cache Shadow Table tracks # pinned in-flight loads - Allows issuing pinned loads in parallel, but added HW costs ### Evaluation - single theraded (SPEC17) and parallel (SPLASH2 + PARSEC) --- ## DAGguise: Mitigating Memory Controller Side Channels *Deutsch (MIT) et al.* ## Existing Approaches - Partitioning (e.g. round-robin) can lead to poor performance) - Traffic shaping (e.g. Camouflage paper) - Add traffic - Insecure, expensive profiling ## Idea - Memory Shaper: Shape accesses to secret-independent DAC to make victim-access patterns indistinguishable while allowing for dynamic sharing of the memory controller by delaying or adding requests - Formally verified and good performance - Offline profiling ## Evaluation - gem5 and DRAMSim2 - 12% perf improvement vs static partitioning ## Questions - Performance comparison to static partitioning -> shown in the paper. - State in memory controller? -> Memory requests by DAG close rows. # Session 3B: Misc. [Top](#asplos-22) # Session 4A: Systems for Machine Learning [Top](#asplos-22) # Session 4B: Operating Systems [Top](#asplos-22) ## Clio: a hardware-software co-designed disaggregated memory system *Guo et al.* ### Idea - Hardware Memory disaggregation fro improved performance, scalability, flexibility, cost - RDMA -> low performance due to caching limitations - Clio: - virtual memory interface - memory nodes (CLIO node) - Eliminate state from memory node hardware - Move state to compute node - Reduce state in system protocol - Asymetric memory request protocol: - Network ordering: reorder requests for reuse - Congestion and Flow Control: Retry in case of rare packet losses - Remove state form critical path - Splitting fast path and slow path: slow for metadata, fast for data - Optimize state to bundled design - Hash-based page table - Support distributed system ### Evaluation - Prototype on FPGA - boundend access latency - scales well with memory size --- ## Enzian: an open, general, CPU/FPGA platform for systems software research *Cock (ETHZ) et al.* - Generic HW platform for novel applications - Remove restrictions of existing platforms - Existing cache coherent protocols - FPGAs - PCIEe - CCIX - ECI - Unit costs less important then design costs - BW > Capacity - ECI - Open Nathan-compatible coherence protocol - Lower latency than PCIe - 2 Socket assymetric NUMA system --- ## Efficient and scalable core multiplexing with M3V *Asmussen (Barkhausen) et al.* - Secure and fast inter-core/tile communication - Better integration of HW accelerators - Communication between applications on a fast-path, bypassing the OS - Enable strong isolation between tiles by virtualizing the DTU (?) ### Eval - gem5 - FPGA: Use Rocket/BOOM cores as RISC-V tiles --- ## FlexOS: towards flexible OS isolation *Lefeuvre (Uni Manchester) et al.* ### Idea - Specify OS security/performance trade-off during runtime - Deployment for heterogeneous hardware, quickly isolate vulneravble libraries, incremental verification of code bases ### Approach 1. Focus on single-purpose appliances (e.g. cloud microservices) 2. Full-system understanding of compartmentalization 3. Abstract away technical details of isolation mechanisms 4. Maintain performance ### Design - Based on Unikraft ### Eval - Hardening on/off, Compartment selection for different libraries. Result: even distribution of performance overheads with different security guarantees --- ## Adelie: continuous address space layout re-randomization for Linux Drivers *Nikolaev (PennState) et al.* ### Motivation Device drivers are a target for kernel attacks ### Attacks - Control-flow attacks - Return-Oriented Programming - Mitigated by ASLR and KASLR - These are limited, attacks still possible ### Idea - Transform all modules to KASLR, extend KASLR - PIE patch for Linux kernel, not usable for kernel modules - Use a more general PIC model for modules - Extend KASLR to 64 bits and to support multiple mappings to code during ongoing re-randomization - Differentiate between movable and immovable module parts - Wrap externally accessible, movable function by a immovable one --- # Session 5A: Quantum Computing [Top](#asplos-22) # Session 5B: Data Center and Cloud Services [Top](#asplos-22) ## Astraea: Towards QoS-Aware and Resource-Efficient Multi-stage GPU Services *Zhang (Shanghai Jiao Tong) et al.* ### Idea - GPU Microservices instead of Monoliths - Latency of GPU microservices due to communication overhead lead to QoS violations - Naive approach: offline profiling (limited) - Goal: max throughput, ensure QoS - Using ML, predict - memory usage - FLOPS -> min. # GPUs - Reduce intra-GPU communication overhead by global memory-based allocation ### Eval - Based on Intel Xeon CPUs and Ubuntu - Average speedup ~37% Future: extend to both GPU and CPU benchmarks --- ## Memory-harvesting VMs in cloud platforms *Fuerst (Microsoft Research) et al.* ### Idea - Server memory is expensive but heavily unused - Make available unused resources as services - Build on existion CPU Harvesting VMs, virtuatlization techniques - Challenges: VM Creation time, NUMA spanning, fragmentation - No impact on regular guests - Applications: Data analytics (MH-Hadoop), function as a service/serverless (MH-FaaS) --- ## IOCost: block IO control for containers in datacenters :+1: *Heo (Meta) et al.* ### Idea - Partition Block I/O in containers - Challenges: memory heterogeneity, workload heterogeneity, datacenter requirements (ease of use, work conservation, memory-awareness) - offline profiling of device occupancy by block I/O - unused I/O budget can be donated --- ## TMO: Transparent Memory Offloading in Datacenters *Weiner (Meta) et al.* ### Idea - DRAM costs very high -> Compressed DRAM - Most applications' memory is cold - Challenge: When are memory accesses critical and should be kept in DRAM? - TMO: Monitor pressure stall information (kernel) to track impact of lack of resources -> iteratively reduce memory until performance suffers --- # Session 6A: Accelerating Emerging Applications [Top](#asplos-22) # Session 6B: Bugs (1) [Top](#asplos-22) ## RSSD: defend against ransomware with hardware-isolated network-storage codesign and post-attack analysis :+1: *Reidys (Uni Illinois) et al.* ### Idea - Ransom: Encrypt data and request money to decrypt - Mitigations: Malware detecition, data backup - Exploit out-of-place SSD - GC attack - dump data to trigger GC - Timing attack - hind behind user software to trigger gc - Trim attack - use trim command - Ransomware-aware SSD (RSSD) - Offload retained data to remote server - Limited storage capacity -> compress data and transfer to remote server - Dedicated HW for transfers (DMA) to protect against host - Use idle cycles to transfer w/o interference ### Eval - Implement enhanced SSD on FPGA --- ## Creating Concise and Efficient Dynamic Analyses with ALDA *Cheng (Giorgia Tech) et al.* --- # Session 7A: Serverless Computing [Top](#asplos-22) # Session 7B: Bugs (2) [Top](#asplos-22) # Session 8A: Non-traditional Computing and Reconfigurable Hardware [Top](#asplos-22) # Session 8B: Synthesis and Compilation [Top](#asplos-22) ## CirFix: automatically repairing defects in hardware design code :+1: *Ahmad (Uni Michigan) et al.* ### Problems - Software-based APR is not amendable to traditional HW testbenches - Fault localization approaches from software-based APR do not scale to HW ### Idea - Identify repairs by utilising fitness function summing bit-errors - Fault localization: Compare wire values and oracle simulation to get mismatches - Ignore implicated mismatches (propagation) ### Evaluation - Evaluate on defected design - Seed defects, easy and hard ones - What fractions of defects can CirFix detect/repair? - found 21/32, repaired 16/32 - similar performance for easy and hard defects --- ## Vector instruction selection for digital signal processors using program synthesis *Ahmad (Adobe) et al.* - Pattern-Matching is brittle, super-optimistaion is slow - Optimization scope of sliding windows may be too small - Rake: instruciton-selection algorithm - Unifying function: Uber-instruction - Higher-level IR --- ## HeteroGen: transpiling C to heterogeneous HLS code with automated test generation and program repair *Zhang (Uni Cal LA) et al.* - Problem: differences between HLS-C and standard C/C++, manual transformation required - e.g custom floating point format, remove dynamic data structures - Key idea: Use automated program repair (Oracle) to fix errors when porting - Two optimisations: error dependencies to reduce search space, linting --- ## Tree Traversal Synthesis using Domain-Specific Symbolic Compilation *Chen (Uni Santa Barbara) et al.* # Session 9A: Hardware Security (2) [Top](#asplos-22) ## SRAM has no chill: exploiting power domain separation to steal on-chip secrets *Mahmod () et al.* - Volt boot attack: - read SRAM contents - no off-time limit, no temperature dependency - e.g. attack caches, CPU registers, iRAMs --- ## Randomized Row-Swap: Mitigating Row Hammer By Breaking Spatial Correlation Between Aggressor and Victim Rows ### Problem - Increasing density -> incresasing inter-cell interference ### Previous Work - Targeted Row Revresh in DDR4 - Track Aggressor rows - mitigative action by refreshing victim rows - broken ### Idea - Use `Graphene` to track aggressor rows (cf. MICRO 20) - Randomised row swap: swap aggressor rows before bit-flips occur - Implement using *hot row tracking* --- ## ShEF: shielded enclaves for cloud FPGAs *Zhao (Uni Stanford) et al.* ### Motivation - Secure acceleration for the cloud - TEEs don't support accelerators - Cloud accelerators no securety - On-Prem Accelerators no cloud ### Idea - end2end framework for cloud FPGAs - Threat model: untrusted FPGA/cloud provider - Currrent FPGA root of trust only available for owners of the FPGA (e.g. burning key) - Chip manufacturer embeds device-specific key in fpga - Firmware implementing protocols - Security kernel for remote attestation and secure boot - RTL shield modules - enable customized security ### Eval - Local UltraScale+ - Cloud UltraScale+ (AWS F1) --- ## Invisible bits: hiding secret messages in SRAM's analog domain :+1: *Mahmod (Virginia Tech) et al.* Steganography: Information hiding technique ### Idea - Hide information in plain sight to allow plausible deniablilty - plausible-deniable covert channel - Influence power-on state by exploiting ageing burns - Accelerate aging by increasing voltage/temperature (*baking*) - Read power-on state - Improve accuracy by using ECC - Encrypt message to remove correlation -> plausible deniability ### Eval # Session 9B: Smart Networking [Top](#asplos-22)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.