# Reading Note – Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA
###### tags: `paper`
## Introduction
* Paper: [here]()
* Author: Ashish Venkat (UVirginia), Harsha Basavaraj, Dean M. Tullsen (UCSD)
* Published at HPCA 2019
* TL;DR:
**new techniques in FTL(Flash translation layer) and crash recovery to preserve order for SSD crash** and improve the latency of SSD recovery, SSD operation and its application.
## Problem formulation
ISA herogenieity
* pros: improves execution efficiency in both chip multiprocessor (CMP) and data movement
* cons:
* compromise when picking an ISA: eg. choose x86 only for one key feature
* deployment of heterogenuous ISA is hard
* fat binaries
* expensive binary translation and state transformation because of encoding schemes and ABI of ISAs
> composite-ISA: cores implementing **custom ISAs**derived from a single large superset ISA
* potential: save licencing, testing, overheads above
* result: greatly increased flexibility in creating cores that mix and match specific sets of features -> outperform heterogeneous ISA
## Composite ISA
* 26 custom ISAs: baseline superset ISA that offers a wide range of customizable features: register depth (programmable registers supported), register width, addressing mode complexity, predication, and vector support
* baseline: an extension of x86
* compiler techniques to extend x86
* Achievement: 18% in performance and reduce the energy-delay product (EDP) by 35% over single-ISA heterogeneous designs
## Related work
* Kumar- **Single-ISA heterogeneous multi-core** architectures for multithreaded workload performance:
* cores of different sizes and features --> allow application dynamically identify and migrate
* Heterogeneous-ISA architectures: cores that are already microarchitecturally heterogeneous to further implement diverse instruction sets
* ISA affinity
## ISA feature set derivation -- superset ISA
* like x86 but with customizable dimensions: register depth, register width, opcode and addressing mode complexity, predication, and data-parallel execution; discuss impact on code generation, processor performance, power, area implication
* dimension effects:
* register depth
* compiler virtual registers --> swapping between memory a lot, lower instruction-level parallelism (redundancy elimination and re-materialization)
* customize cores with different depth to alleviate register pressure
* register width
* larger --> cache larger
* sub-register can be replacement,sub-register coalescing
* opcodes variation
* reduced set of opcodes and addressing modes --> simplify instruciton decode engine
* their design: follow x86 existing variable length encoding and 2-phase decoding scheme, prevent binary translation costs associated with multi-vendor heterogeneous-ISA designs
* prediction
* partial predication, full predication, sconditional execution with one condition code register
* x86 already partial predication via CMOVxx instructions that are predicated on condition codes
* data-parallelism execution:
* SSE2, overhead: 1:n encoding of macro-op to micro-op
## compiler and runtime strategy
* migrate without full binary translation and/or state transformation.
* use LLVM MC infrastructure to efficiently encode the right set of features
* Migration Strategy: for downgrade translation
* from x86 to microx86: addressing mode-- translating any instruction that directly operates on memory into a set of simpler instructions that adhere to the ld-compute-st format.
* register context block: for register depth change
## Decoder design
* TL;DR: minimal changes to decoder
* instruction encoding:

* decoder
* ILD(instruction-length decoder)