# Reading Note – Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA ###### tags: `paper` ## Introduction * Paper: [here]() * Author: Ashish Venkat (UVirginia), Harsha Basavaraj, Dean M. Tullsen (UCSD) * Published at HPCA 2019 * TL;DR: **new techniques in FTL(Flash translation layer) and crash recovery to preserve order for SSD crash** and improve the latency of SSD recovery, SSD operation and its application. ## Problem formulation ISA herogenieity * pros: improves execution efficiency in both chip multiprocessor (CMP) and data movement * cons: * compromise when picking an ISA: eg. choose x86 only for one key feature * deployment of heterogenuous ISA is hard * fat binaries * expensive binary translation and state transformation because of encoding schemes and ABI of ISAs > composite-ISA: cores implementing **custom ISAs**derived from a single large superset ISA * potential: save licencing, testing, overheads above * result: greatly increased flexibility in creating cores that mix and match specific sets of features -> outperform heterogeneous ISA ## Composite ISA * 26 custom ISAs: baseline superset ISA that offers a wide range of customizable features: register depth (programmable registers supported), register width, addressing mode complexity, predication, and vector support * baseline: an extension of x86 * compiler techniques to extend x86 * Achievement: 18% in performance and reduce the energy-delay product (EDP) by 35% over single-ISA heterogeneous designs ## Related work * Kumar- **Single-ISA heterogeneous multi-core** architectures for multithreaded workload performance: * cores of different sizes and features --> allow application dynamically identify and migrate * Heterogeneous-ISA architectures: cores that are already microarchitecturally heterogeneous to further implement diverse instruction sets * ISA affinity ## ISA feature set derivation -- superset ISA * like x86 but with customizable dimensions: register depth, register width, opcode and addressing mode complexity, predication, and data-parallel execution; discuss impact on code generation, processor performance, power, area implication * dimension effects: * register depth * compiler virtual registers --> swapping between memory a lot, lower instruction-level parallelism (redundancy elimination and re-materialization) * customize cores with different depth to alleviate register pressure * register width * larger --> cache larger * sub-register can be replacement,sub-register coalescing * opcodes variation * reduced set of opcodes and addressing modes --> simplify instruciton decode engine * their design: follow x86 existing variable length encoding and 2-phase decoding scheme, prevent binary translation costs associated with multi-vendor heterogeneous-ISA designs * prediction * partial predication, full predication, sconditional execution with one condition code register * x86 already partial predication via CMOVxx instructions that are predicated on condition codes * data-parallelism execution: * SSE2, overhead: 1:n encoding of macro-op to micro-op ## compiler and runtime strategy * migrate without full binary translation and/or state transformation. * use LLVM MC infrastructure to efficiently encode the right set of features * Migration Strategy: for downgrade translation * from x86 to microx86: addressing mode-- translating any instruction that directly operates on memory into a set of simpler instructions that adhere to the ld-compute-st format. * register context block: for register depth change ## Decoder design * TL;DR: minimal changes to decoder * instruction encoding: ![](https://i.imgur.com/W8MTrts.png) * decoder * ILD(instruction-length decoder)