# ASPLOS 22
## General
**27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems**
Proceedings: https://dl.acm.org/doi/proceedings/10.1145/3503222
---
## Table of Contents
1. [Top Picks](#top-picks)
1. [Tutorials / Workshops](#tutorials--workshops)
1. [Keynotes](#keynotes)
1. [Session 1A: Accelerators](#session-1a-accelerators)
1. [Session 1B: Address and Memory](#session-1b-address-and-memory)
1. [Session 2A: GPU and Data Analytics](#session-2a-gpu-and-data-analytics)
1. [Session 2B: Privacy and Software Security](#session-2b-privacy-and-software-security)
1. [Session 3A: Hardware Security (1)](#session-3a-hardware-security-1)
1. [Session 3B: Misc.](#session-3b-misc)
1. [Session 4A: Systems for Machine Learning](#session-4a-systems-for-machine-learning)
1. [Session 4B: Operating Systems](#session-4b-operating-systems)
1. [Session 5A: Quantum Computing](#session-5a-quantum-computing)
1. [Session 5B: Data Center and Cloud Services](#session-5b-data-center-and-cloud-services)
1. [Session 6A: Accelerating Emerging Applications](#session-6a-accelerating-emerging-applications)
1. [Session 6B: Bugs (1)](#session-6b-bugs-1)
1. [Session 7A: Serverless Computing](#session-7a-serverless-computing)
1. [Session 7B: Bugs (2)](#session-7b-bugs-2)
1. [Session 8A: Non-traditional Computing and Reconfigurable Hardware](#session-8a-non-traditional-computing-and-reconfigurable-hardware)
1. [Session 8B: Synthesis and Compilation](#session-8b-synthesis-and-compilation)
1. [Session 9A: Hardware Security (2)](#session-9a-hardware-security-2)
1. [Session 9B: Smart Networking](#session-9b-smart-networking)
---
## Top Picks
*@hero, @occamy*\
Cock et al.: **Enzian: an open, general, CPU/FPGA platform for systems software research** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507742))([notes](#enzian-an-open-general-cpufpga-platform-for-systems-software-research))
*@paulsc*\
Rao et al.: **SparseCore: stream ISA and processor specialization for sparse computation** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507705))([notes](#sparsecore-stream-isa-and-processor-specialization-for-sparse-computation))
*@nwistoff*\
Deutsch et al.: **DAGguise: Mitigating Memory Controller Side Channels** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507747))([notes](#dagguise-mitigating-memory-controller-side-channels))
*Virtual Memory / Address Translation*\
Suchi et al.: **CARAT CAKE: replacing paging via compiler/kernel cooperation** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507771))([notes](#carat-cake-replacing-paging-via-compilerkernel-cooperation))
*RedHub*\
Mahmod et al.: **Invisible bits: hiding secret messages in SRAM's analog domain** ([PDF](https://dl.acm.org/doi/pdf/10.1145/3503222.3507756))([notes](#invisible-bits-hiding-secret-messages-in-srams-analog-domain))
---
# Tutorials / Workshops
---
# Keynotes
## Fast and Furious: How the web got turbo charged just in time...
*Michael Franz (UCal, Irvine)*
- JavaScript: transfer wire formats vs. source code to browser
- Benefit of wire formats: protect JavaScript source
- Benefit of sourcecode: More platform independent, enable JIT compilation and large performance optimisations
- JIT compilation concepts
- Focus on loops in the program, only pre-compile these
- Dynamic loop discovery and trading (cf. `Dynamo`)
- Static single assignments, `phi` functions
- Optimise *Hot Loops*, inline calls, add *bailouts*
- Trace Trees (cf. Gal & Franz)
- Trace Regions (cf. Bebenita & Franz)
Takeaway: It makes sense to transfer source code opposed to wire formats for JIT compilation. But, today, there is an inflation in JavaScript code.
---
## Software-Defined Memory Controllers: The Time Has Come
*Kevin Loughlin (University of Michigan) et al.*
This presentation was actually part of the WACI (Wild and Crazy Ideas) session, but apart from the rhymes, it did not seem too wild and crazy (it was rather a scientific contribution transformed into a poem).
### Idea
- Programmable memory controllers, useful for perfomance & security applications (e.g. partitioning)
---
# Session 1A: Accelerators
[Top](#asplos-22)
# Session 1B: Address and Memory
[Top](#asplos-22)
## CARAT CAKE: replacing paging via compiler/kernel cooperation
*Brian Suchy (Northwestern University) et al.*
:+1:
### Idea
- Software-based address translation
- Work on physical addresses
- Runtime manages memory
- No more need for MMU
- Allow arbitrary granularity (no longer constrained to 4K pages)
### CARAT CAKE
1. Kernel
- Nautilus Kernel
2. Libraries
- New Address Space (ASpace) Abstraction
- Linux Syscall Interface
- Allocator and Defragmenter
- Trusted Hooks
3. Linux Compatible Process
### Evaluation
1. Benchmark Overheads
- Paging overhead
- Geomean: 1.7%
- Worst: 16% (NAS EP)
- Competitive Overhead
2. Memory Movement Stress Test
- "Pepper Test" as a "worst case"
- Results: Negligible overhead
3. Engineering Effort
- Compiler: 3.7k LoC
- Kernel: 4.2k LoC
---
## NVAlloc: rethinking heap metadata management in persistant memory allocators
*Dang (Zhejinang University) et al.*
### Idea
- Leaks are particularly dangerous for persistent memory (PM)
- Make PM allocators aware of PM characteristics
- Cache line reflush
- Optimization: Interleaved Mapping. Consecutive Metadata does not need consecutive
- Small random accesses
- Frequently happens in large allocations
- Optimization: Log-structured bookkeeping
- Static slab fragmentation
- Very problematic, cannot be resolved by restarting the system for PM
- Memory under-utilization
- Optimization: Slab Morphing. Transform slab to defragment.
### Evaluation
- Allocation:
- Outperforms existing allocators
- Memory usage reduced (Fragbench)
---
## Every walk's a hit: making page walks single-access cache hits
*Park (Uppsala University) et al.*
### Idea
- Problem: Virtual Memory Translations don't scale
- Page Walk Caches can save intermediate Page Table walk steps
- Past Proposals
- Address Translation with Prefetching (ASAP)
- Elastic Cuckoo Page Tables (ECPT)
- Problem: Require large contiguous memory
- Page table walks with smaller contiguous memory
- Flatten Page tables
- Fit in a 2MB region
- Reduce serial memory accesses from 4 to 2 (1.5 to 1 with Page walk cache)
- Trade off more memory for higher performance
- Different flattening schemes and combinations are possible
- Small changes in HW and SW
- Prioritize Keeping
- PTE cache misses are expansive (DRAM latency)
- Prioritize Page tables in L3 cache by modifying replacement policy
### Evaluation
#### Performace
- Geomean: 9% performance increase
- Most of the improvement comes from prioritization
#### Energy
- L3 hits due to prioritization reduce DRAM accesses and thus energy consumption
---
# Session 2A: GPU and Data Analytics
[Top](#asplos-22)
## GPM: leveraging persistent memory from a GPU
*Pandey (Indian Institute of Science Bangalore) et al.*
### Idea
- Currently, PMs only benefit CPUs
- GPUs can accelerate PM applications
- Key Value Stores
- GPUs do not have direct PM access, accesses need to go via CPU -> lots of data movement
- In-kernel persistence: GPU can update only values in PM that are actually changed
- GPM: GPU with Persistent Memory
- GPM Hardware
- Map NVM into GPU's virtual address space using UVA
- Use fences to flush data into NVM
- Use Case: GPMBench
- Transactional (gpKVS, GPU-accelerated DB)
- Long-running iterative (DNNN, CFD, BLK, HS)
- Native (BFS, SRAD, PS)
- Runtime Library (LibGPM)
- CUDA library for GPM
### Evaluation
- Transactional, BFS: speedup through reduced data movement
### Discussion
- An application that does not scale well: Binominal options -> not enought parallelism in PM
---
## GPUReplay: a 50-KB GPU stack for client ML
*Park (Purdue University) et al.*
### Idea
- GPUs in mobile sector used for general applications
- Large GPU SW stack
- Security vulnerabilities
- Fragmented deployment (different OSes, drivers etc.)
- Slow startup
- Reason: GPU stack engineered for graphics
- JIT compilation
- Dynamic resource management
- Fine-grained multiplexing
- GPU stack is overkill for ML applications
- No need for the above features
- Approach: Record CPU/GPU interactions during development, replay on end-device
- Register I/Os
- Interrupts
- GPU page table
- *no* inputs / outputs
- Need to handle Non-determinism
### Evaluation
- Mali Bifrost
- Reduced complexity (~2 OoM)
- Significantly lowered startup delays
- Moderately lowered inference delays
- Increased training performance (avoid DeepCL and OpenCL overhead)
---
## ValueExpert: exploring value patterns in GPU-accelerated applications
*Zhou (Rice University) et al.*
### Idea
- GPUs used for a wide number of applications
- Existing profiler tools: hotspot analysis (# API calls) -> Do not evaluate root cause
- Detect Redundant or useless computation and unnecessary data movement
- Value Pattern Categorization
- Corse-grained patters at each GPU operation (Redundant/Duplicate Values)
- Record GPU memory changes before/after GPU kernel
- Fine-grained patterns within each kernel (frequent values)
- Transform unnecessarily large data types
- Value Flow Graph
- Capture read/write dependencies
- Annotate profiling information to graph
- ValueExpert Workflow
- Online Analyser communicates with GPU, Offline Analyser collects data for GUI
*Code available on GitHub*
---
## SparseCore: stream ISA and processor specialization for sparse computation
*Rao (University of Southern California) et al.*
### Idea
- Graph pattern matching inefficient due to index matching
- Reduce data movement
- ISA extension
- Stream registers
- Add several instructions to enable operations on stream data
- Code examples for triangle computation, multiply
### Architecture
- Data accessed from scratchpad and stream cache
- Stream information embedded in stream mapping table and stream registers
- Stream value generator -> stream value buffer -> stream value processing units
### Evaluation
- Platform?? Baseline?
---
## JSONSki: streaming semi-structured Data with Bit-Parallel Fast-Forwarding
*Jiang (UCR) et al.*
### Idea
- Extend Streaming evaluation for JSON processing:
- fast-forward certain cases
- e.g. if when searching for a field, fast-forward over irrelevant field definitions
- Needs to recognize nested structures
- bit-parallel processing
- Partition string into structural intervals based on meta characters
### Evaluation
- Evaluate on practical applications
- Significant performance gain over other JSON processors
- Memory consumption comparable to JPStream (and better than preprocessing-based processors)
# Session 2B: Privacy and Software Security
[Top](#asplos-22)
# Session 3A: Hardware Security (1)
[Top](#asplos-22)
## Pinned Loads: Taming Speculative Loads in Secure Processors
### Idea
- *Pinned load* loads and guarantees no squashing or eviction
- only after it we know there are no exceptions or misprediction
- Defer invalidations to pinned loads
- When core receives invalidation and line is pinned, it returns a `pinned` message which is acked by `abort` message
- Prevent Store Starvation
- Core does not pin to-be-invalidated cache lines
- Prevent evictions of pinned lines
- Prevent deadlocks
- Core cannot pin more lines than a set can hold
### Design
- Late Pinning: Receive data, then pin
- Early Pinning: Cache Shadow Table tracks # pinned in-flight loads
- Allows issuing pinned loads in parallel, but added HW costs
### Evaluation
- single theraded (SPEC17) and parallel (SPLASH2 + PARSEC)
---
## DAGguise: Mitigating Memory Controller Side Channels
*Deutsch (MIT) et al.*
## Existing Approaches
- Partitioning (e.g. round-robin) can lead to poor performance)
- Traffic shaping (e.g. Camouflage paper)
- Add traffic
- Insecure, expensive profiling
## Idea
- Memory Shaper: Shape accesses to secret-independent DAC to make victim-access patterns indistinguishable while allowing for dynamic sharing of the memory controller by delaying or adding requests
- Formally verified and good performance
- Offline profiling
## Evaluation
- gem5 and DRAMSim2
- 12% perf improvement vs static partitioning
## Questions
- Performance comparison to static partitioning -> shown in the paper.
- State in memory controller? -> Memory requests by DAG close rows.
# Session 3B: Misc.
[Top](#asplos-22)
# Session 4A: Systems for Machine Learning
[Top](#asplos-22)
# Session 4B: Operating Systems
[Top](#asplos-22)
## Clio: a hardware-software co-designed disaggregated memory system
*Guo et al.*
### Idea
- Hardware Memory disaggregation fro improved performance, scalability, flexibility, cost
- RDMA -> low performance due to caching limitations
- Clio:
- virtual memory interface
- memory nodes (CLIO node)
- Eliminate state from memory node hardware
- Move state to compute node
- Reduce state in system protocol
- Asymetric memory request protocol:
- Network ordering: reorder requests for reuse
- Congestion and Flow Control: Retry in case of rare packet losses
- Remove state form critical path
- Splitting fast path and slow path: slow for metadata, fast for data
- Optimize state to bundled design
- Hash-based page table
- Support distributed system
### Evaluation
- Prototype on FPGA
- boundend access latency
- scales well with memory size
---
## Enzian: an open, general, CPU/FPGA platform for systems software research
*Cock (ETHZ) et al.*
- Generic HW platform for novel applications
- Remove restrictions of existing platforms
- Existing cache coherent protocols
- FPGAs
- PCIEe
- CCIX
- ECI
- Unit costs less important then design costs
- BW > Capacity
- ECI
- Open Nathan-compatible coherence protocol
- Lower latency than PCIe
- 2 Socket assymetric NUMA system
---
## Efficient and scalable core multiplexing with M3V
*Asmussen (Barkhausen) et al.*
- Secure and fast inter-core/tile communication
- Better integration of HW accelerators
- Communication between applications on a fast-path, bypassing the OS
- Enable strong isolation between tiles by virtualizing the DTU (?)
### Eval
- gem5
- FPGA: Use Rocket/BOOM cores as RISC-V tiles
---
## FlexOS: towards flexible OS isolation
*Lefeuvre (Uni Manchester) et al.*
### Idea
- Specify OS security/performance trade-off during runtime
- Deployment for heterogeneous hardware, quickly isolate vulneravble libraries, incremental verification of code bases
### Approach
1. Focus on single-purpose appliances (e.g. cloud microservices)
2. Full-system understanding of compartmentalization
3. Abstract away technical details of isolation mechanisms
4. Maintain performance
### Design
- Based on Unikraft
### Eval
- Hardening on/off, Compartment selection for different libraries. Result: even distribution of performance overheads with different security guarantees
---
## Adelie: continuous address space layout re-randomization for Linux Drivers
*Nikolaev (PennState) et al.*
### Motivation
Device drivers are a target for kernel attacks
### Attacks
- Control-flow attacks
- Return-Oriented Programming
- Mitigated by ASLR and KASLR
- These are limited, attacks still possible
### Idea
- Transform all modules to KASLR, extend KASLR
- PIE patch for Linux kernel, not usable for kernel modules
- Use a more general PIC model for modules
- Extend KASLR to 64 bits and to support multiple mappings to code during ongoing re-randomization
- Differentiate between movable and immovable module parts
- Wrap externally accessible, movable function by a immovable one
---
# Session 5A: Quantum Computing
[Top](#asplos-22)
# Session 5B: Data Center and Cloud Services
[Top](#asplos-22)
## Astraea: Towards QoS-Aware and Resource-Efficient Multi-stage GPU Services
*Zhang (Shanghai Jiao Tong) et al.*
### Idea
- GPU Microservices instead of Monoliths
- Latency of GPU microservices due to communication overhead lead to QoS violations
- Naive approach: offline profiling (limited)
- Goal: max throughput, ensure QoS
- Using ML, predict
- memory usage
- FLOPS -> min. # GPUs
- Reduce intra-GPU communication overhead by global memory-based allocation
### Eval
- Based on Intel Xeon CPUs and Ubuntu
- Average speedup ~37%
Future: extend to both GPU and CPU benchmarks
---
## Memory-harvesting VMs in cloud platforms
*Fuerst (Microsoft Research) et al.*
### Idea
- Server memory is expensive but heavily unused
- Make available unused resources as services
- Build on existion CPU Harvesting VMs, virtuatlization techniques
- Challenges: VM Creation time, NUMA spanning, fragmentation
- No impact on regular guests
- Applications: Data analytics (MH-Hadoop), function as a service/serverless (MH-FaaS)
---
## IOCost: block IO control for containers in datacenters
:+1:
*Heo (Meta) et al.*
### Idea
- Partition Block I/O in containers
- Challenges: memory heterogeneity, workload heterogeneity, datacenter requirements (ease of use, work conservation, memory-awareness)
- offline profiling of device occupancy by block I/O
- unused I/O budget can be donated
---
## TMO: Transparent Memory Offloading in Datacenters
*Weiner (Meta) et al.*
### Idea
- DRAM costs very high -> Compressed DRAM
- Most applications' memory is cold
- Challenge: When are memory accesses critical and should be kept in DRAM?
- TMO: Monitor pressure stall information (kernel) to track impact of lack of resources -> iteratively reduce memory until performance suffers
---
# Session 6A: Accelerating Emerging Applications
[Top](#asplos-22)
# Session 6B: Bugs (1)
[Top](#asplos-22)
## RSSD: defend against ransomware with hardware-isolated network-storage codesign and post-attack analysis
:+1:
*Reidys (Uni Illinois) et al.*
### Idea
- Ransom: Encrypt data and request money to decrypt
- Mitigations: Malware detecition, data backup
- Exploit out-of-place SSD
- GC attack
- dump data to trigger GC
- Timing attack
- hind behind user software to trigger gc
- Trim attack
- use trim command
- Ransomware-aware SSD (RSSD)
- Offload retained data to remote server
- Limited storage capacity -> compress data and transfer to remote server
- Dedicated HW for transfers (DMA) to protect against host
- Use idle cycles to transfer w/o interference
### Eval
- Implement enhanced SSD on FPGA
---
## Creating Concise and Efficient Dynamic Analyses with ALDA
*Cheng (Giorgia Tech) et al.*
---
# Session 7A: Serverless Computing
[Top](#asplos-22)
# Session 7B: Bugs (2)
[Top](#asplos-22)
# Session 8A: Non-traditional Computing and Reconfigurable Hardware
[Top](#asplos-22)
# Session 8B: Synthesis and Compilation
[Top](#asplos-22)
## CirFix: automatically repairing defects in hardware design code
:+1:
*Ahmad (Uni Michigan) et al.*
### Problems
- Software-based APR is not amendable to traditional HW testbenches
- Fault localization approaches from software-based APR do not scale to HW
### Idea
- Identify repairs by utilising fitness function summing bit-errors
- Fault localization: Compare wire values and oracle simulation to get mismatches
- Ignore implicated mismatches (propagation)
### Evaluation
- Evaluate on defected design
- Seed defects, easy and hard ones
- What fractions of defects can CirFix detect/repair?
- found 21/32, repaired 16/32
- similar performance for easy and hard defects
---
## Vector instruction selection for digital signal processors using program synthesis
*Ahmad (Adobe) et al.*
- Pattern-Matching is brittle, super-optimistaion is slow
- Optimization scope of sliding windows may be too small
- Rake: instruciton-selection algorithm
- Unifying function: Uber-instruction
- Higher-level IR
---
## HeteroGen: transpiling C to heterogeneous HLS code with automated test generation and program repair
*Zhang (Uni Cal LA) et al.*
- Problem: differences between HLS-C and standard C/C++, manual transformation required
- e.g custom floating point format, remove dynamic data structures
- Key idea: Use automated program repair (Oracle) to fix errors when porting
- Two optimisations: error dependencies to reduce search space, linting
---
## Tree Traversal Synthesis using Domain-Specific Symbolic Compilation
*Chen (Uni Santa Barbara) et al.*
# Session 9A: Hardware Security (2)
[Top](#asplos-22)
## SRAM has no chill: exploiting power domain separation to steal on-chip secrets
*Mahmod () et al.*
- Volt boot attack:
- read SRAM contents
- no off-time limit, no temperature dependency
- e.g. attack caches, CPU registers, iRAMs
---
## Randomized Row-Swap: Mitigating Row Hammer By Breaking Spatial Correlation Between Aggressor and Victim Rows
### Problem
- Increasing density -> incresasing inter-cell interference
### Previous Work
- Targeted Row Revresh in DDR4
- Track Aggressor rows
- mitigative action by refreshing victim rows
- broken
### Idea
- Use `Graphene` to track aggressor rows (cf. MICRO 20)
- Randomised row swap: swap aggressor rows before bit-flips occur
- Implement using *hot row tracking*
---
## ShEF: shielded enclaves for cloud FPGAs
*Zhao (Uni Stanford) et al.*
### Motivation
- Secure acceleration for the cloud
- TEEs don't support accelerators
- Cloud accelerators no securety
- On-Prem Accelerators no cloud
### Idea
- end2end framework for cloud FPGAs
- Threat model: untrusted FPGA/cloud provider
- Currrent FPGA root of trust only available for owners of the FPGA (e.g. burning key)
- Chip manufacturer embeds device-specific key in fpga
- Firmware implementing protocols
- Security kernel for remote attestation and secure boot
- RTL shield modules
- enable customized security
### Eval
- Local UltraScale+
- Cloud UltraScale+ (AWS F1)
---
## Invisible bits: hiding secret messages in SRAM's analog domain
:+1:
*Mahmod (Virginia Tech) et al.*
Steganography: Information hiding technique
### Idea
- Hide information in plain sight to allow plausible deniablilty
- plausible-deniable covert channel
- Influence power-on state by exploiting ageing burns
- Accelerate aging by increasing voltage/temperature (*baking*)
- Read power-on state
- Improve accuracy by using ECC
- Encrypt message to remove correlation -> plausible deniability
### Eval
# Session 9B: Smart Networking
[Top](#asplos-22)