# Advanced Compiler Design: Final @ntu-csie acd final (114-1) [toc] --- ### Question 1 [Warm-up] Choose the wrong answer below. * Mag-7 (Google, Meta, Amazon, Microsoft, NVidia, Apple, Tesla) are doing their own chips. Compiler people are in high demand. * Current Mediatek phones are ARM-based. But Mediatek has been investing in RISC-V. * Training a foundation model takes at least thousands of H100, e.g. Taiwan has many H100s, so we'll finally train a large foundation model. * Advanced compiler course should cover advanced topics. ### Question 2 [Warm-up] Given the following program: ```clike= FOR j = 1 TO n FOR i = 1 TO n A[i,j] = A[i,j] + B[i-1,j]; (S1) B[i,j] = A[i,j-1] * B[i,j]; (S2) ``` Is this program parallelizable that allows each core execute its own iteration? (Mandarin: 這個程式可否透過 Affine Partition 來做平行化?) * YES * NO ### Question 3 [Warm-up] Choose the wrong answer below. * LLVM's Intermediate Representation uses SSA form. * The LLVM pass "dead code elimination" (DCE) should always happen at the end of compilation, to optimize away the dead code in the most effective way. * Blocking is not a uni-modular transform. * Looping (that is, a loop code) is one reason to cause spatial locality. ### Question 4 To convert the following loops to a form where the loop indexes are each incremented by 1, we'll convert the following code into which one below? ```clike= for (i = -3; i <= 10; i = i + 2) X[i] = X[i+1]; ``` * `for (i = 0; i < 6; i++) X[-3+2*i] = X[-2+2*i];` * `for (i = 0; i <= 6; i++) X[-3+2*i] = X[-2+2*i];` * `for (i = 0; i <= 5; i++) X[-3+2*i] = X[-2+2*i];` * `for (i = 0; i < 5; i++) X[-3+2*i] = X[-2+2*i];` ### Question 5 Some students claimed that DCE is not an important optimization since only marginal performance gain was observed. In fact, DCE could be very important when combining with "other optimizations". Which one below is NOT one of the "other optimizations"? * Strength reduction * Procedure inlining * Strip dead debug info --- ## Problem Group: SSA Construction and Dominance ![image](https://hackmd.io/_uploads/r1NGNh4QZx.png) ### Question 6: Dominance Relationships Which of the following dominance relationships is CORRECT? * `B4` dominates `B8` * `B2` dominates `B8` * `B1` dominates `B5` * `B3` dominates `B8` * `B6` dominates `B9` ### Question 7: Dominance Frontier What is the dominance frontier `DF(B2)`? * `{B4, B8}` * `{B6, B7, B8}` * `{B8, B9}` * `{B8}` * `∅` (empty set) ### Question 8: φ-Function Placement for Variable x In minimal SSA form, where must φ-functions for variable `x` be placed? Recall the definitions of `x`: > `B0`: `x = 10` (initial) > `B2`: `x = x + 1` > `B5`: `x = y` * `B8` only * `B4` and `B8` * `B2`, `B5`, and `B8` * `B8` and `B9` * No φ-function needed for `x` ### Question 9: SSA Value Numbering After SSA construction, assume constant propagation determines `p = true` and `q = false`. After optimization, what is the final value of `z`? * `z = 21` * `z = 22` * `z = 10` * Cannot be determined statically * `z = 11` ### Question 10: Dead Code After Constant Propagation With `p = true` and `q = false`, which blocks become dead code? * `B3`, `B5` only * `B3`, `B5`, `B6`, `B7` * `B6` only * No blocks are dead * `B3`, `B5`, `B6` only --- ## Problem Group: Loop Optimization The following code is used for Questions 11-15: ```clike= // Assume n > 0 int compute(int *A, int *B, int n, int k) { int sum = 0; int offset = k * 4; // Line 1: Loop invariant int scale = k * k; // Line 2: Loop invariant for (int i = 0; i < n; i++) { int idx = i * 8; // Line 3 int addr = offset + idx; // Line 4 sum += A[addr] * scale; // Line 5 B[i * 8] = sum; // Line 6 } return sum; } ``` Corresponding CFG in SSA form: ![image](https://hackmd.io/_uploads/HJcvU3VX-x.png) ### Question 11: Loop Invariant Code Motion Which expressions are loop-invariant (i.e., could be hoisted before the loop if computed inside)? * `offset`, `scale`, and `idx` * `offset` and `scale` only * `offset`, `scale`, and `addr` * All expressions in the loop body are loop-variant * Only `scale` (due to potential aliasing with `offset`) ### Question 12: Strength Reduction Assuming two's-complement arithmetic with no overflow traps, after applying strength reduction to eliminate `i * 8` computations, which auxiliary variables and increments are needed? * `t1` for `i * 8` (array index), `t2` for `i * 8` (B index), increment both by 8; two auxiliary variables * `t1` for `i * 8`, increment by 1; one auxiliary variable * No strength reduction possible (multiplication by constant is already optimized) * `t1` for `i * 8`, `t2` for `addr`, increment `t1` by 8, `t2` by 8; two auxiliary variables * `t1` for `i * 8`, increment by 8; one auxiliary variable ### Question 13: Combined Optimization Effects After applying: (1) loop invariant code motion for `offset` and `scale`, (2) strength reduction for `i * 8`, and (3) common subexpression elimination, how many multiplications remain inside the loop body per iteration? * 0 multiplications * 2 multiplications * 3 multiplications * 1 multiplication * 4 multiplications ### Question 14: Memory Aliasing and Optimization Safety If the compiler cannot prove that arrays `A` and `B` don't overlap, which optimization becomes UNSAFE? * Hoisting `offset = k * 4` before the loop * Hoisting `scale = k * k` before the loop * Strength reduction of `i * 8` * Reordering or caching the load `A[addr]` across iterations * All optimizations remain safe ### Question 15: Optimization Pass Effects Consider applying the following classical optimizations to the code in sequence: > 1. Constant Folding (CF) > 2. Common Subexpression Elimination (CSE) > 3. Dead Code Elimination (DCE) Assume `k = 2` is known at compile time. Which statement correctly describes what each pass achieves? * CF computes `offset = 8`, `scale = 4`; CSE has no effect; DCE removes `idx` * CF has no effect (k is a runtime parameter); CSE eliminates one `i * 8`; DCE removes nothing * CF computes `offset = 8`, `scale = 4`; CSE merges the two `i * 8` computations into one shared value; DCE removes nothing * CF computes `offset = 8` only; CSE merges `i * 8`; DCE removes `scale` as dead * CF, CSE, and DCE all have no effect on this code --- ## Problem Group: Sea of Nodes The following C code is used for Questions 16-20: ```clike= int example(int a, int b, int c) { int x = a + b; int y = a + b; // Same as x int z; if (c > 0) { z = x * 2; } else { z = y * 3; } return z + x; } ``` Sea of Nodes representation as below: ![image](https://hackmd.io/_uploads/HkHaI3NX-e.png) ### Question 16: Pinned Nodes In the Sea of Nodes representation above, which nodes are PINNED (cannot float to different positions)? * ADD nodes only * START, IF, REGION, RETURN only * START, IF, REGION, PHI, RETURN, and projection nodes (PrjT/PrjF) * All nodes are pinned * Only START and RETURN are pinned ### Question 17: Common Subexpression Elimination The original code has `x = a + b` and `y = a + b`. In the Sea of Nodes representation, how is CSE represented? * Two separate ADD nodes, one for `x` and one for `y` * A single ADD node with multiple outgoing data edges * Two ADD nodes connected by an equivalence edge * A COPY node connecting `x` to `y` * CSE cannot be represented in Sea of Nodes ### Question 18: PHI Node Behavior In the Sea of Nodes representation, which statement about the PHI node is CORRECT? * PHI selects between `x*2` and `y*3` based on runtime control flow * PHI always returns the larger of its two inputs * PHI computes the average of its inputs * PHI is evaluated before the IF node executes * PHI creates a copy of both inputs for later use ### Question 19: Global Code Motion Assuming all integer arithmetic is non-trapping (no overflow exceptions), the multiplication `x * 2` (in the true branch) could theoretically be moved to which position by Global Code Motion? * After computing `x = a + b` but before the IF node (speculative hoist) * Only after the REGION node (post-merge) * It must stay within the true branch control region * It can be moved to any position since it's a pure operation * After the RETURN node ### Question 20: Memory Dependencies How would Sea of Nodes represent the memory dependencies? ![image](https://hackmd.io/_uploads/H1ilw24m-x.png) * Memory operations access an implicit global state with no explicit memory edges or SSA versions * Memory operations are connected via explicit memory edges forming an SSA chain * Memory operations are unordered since they're pure computations * Memory is not representable in Sea of Nodes * Each LOAD/STORE has implicit dependencies through control edges only --- ## Problem Group: SelectionDAG, LLVM Pass, Loop Vectorizer, SLP ### Question 21: SelectionDAG Consider the following SelectionDAG, which ONE of the following statements is INCORRECT? ![image](https://hackmd.io/_uploads/S1eQuhEXWx.png) * Node t5, an "ADD" operation, depends on the results of node t2 and node t4, which means the input operands of node t5 are the outputs of node t2 and node t4. * A blue dashed-edge from node t7 to node t0 indicates the dependency between two nodes. In this scenario, it means that node t0 must occur before node t7 and no other nodes are allowed to be in between. * A red edge from node t8 to node t7 indicates the dependency between two nodes. In this scenario, node t7 must occur before node t8 and no other nodes are allowed to be in between. * Node t6 has a concrete Machine Value Type (MVT) which is a 32-bit integer. ### Question 22: LLVM Loop Vectorizer When applying LLVM's Loop Vectorizer, several intricate steps and considerations ensure correctness, performance, and maintainability of the transformed code. Which TWO of the following statements about LLVM loop vectorizer are TRUE? * The loop vectorizer must perform both memory and scalar legality checks to ensure that reordering or widening instructions does not introduce data races or break dependencies crucial to the original semantics. * After legality checks, the vectorizer uses VPlans and associated VPRecipes to represent candidate transformations at different vectorization factors (VFs), allowing it to compare multiple transformation plans to choose the best VF for each instruction within the loop. * Converting PHI nodes that span multiple incoming edges into vector form often requires the insertion of additional instructions (e.g., vector select or VMERGE instructions), potentially increasing the cost of vectorization if many such PHIs are present, which may cause vectorization not profitable. * If there is a pass executed before the loop vectorizer that performs "if-conversions," converting PHI nodes spanning multiple incoming edges into several select instructions, then the vectorization cost model does not need to consider this case, since these PHIs are already eliminated by the previous pass. ### Question 23: Loop-Closed SSA Now we focus on the properties of Loop-Closed SSA (LCSSA). Which TWO of the following statements about LLVM loop vectorizer are TRUE? * Programs written in LLVM IR are always in Loop-Closed SSA (LCSSA) form but not necessarily in SSA form. * Dominance frontier is computed when constructing SSA form, whereas it's unnecessary when we want to insert the loop-closed PHI nodes for LCSSA form. * If a loop is in LCSSA form, in any loop transformation, we only need to update the loop closing PHI nodes for the changes to take effect. * A program is in Loop Closed SSA Form if it is in SSA form and all values that are defined in a loop are used only inside this loop. ### Question 24: Superword Level Parallelism According to the origin of Superword Level Parallelism (SLP), "Exploiting Superword Level Parallelism with Multimedia Instruction Sets", SLP was initially developed targeting basic blocks instead of loop nests. To get more familiar, we begin with some terminology constantly used throughout the discussion of SLP. Which ONE of the following statements is INCORRECT? * Isomorphic statements are those that contain the same operations in the same order. * Isomorphic statements can be executed in parallel by a technique we call statement packing. Source operands in corresponding positions will be packed into registers and the operators will be replaced by their SIMD counterparts. * The performance benefit of statement packing is determined by the speedup gained from parallelization plus the cost of packing and unpacking. * Packed statements that contain adjacent memory references are particularly well suited for SLP execution since it requires no reshuffling with a register. ### Question 25: SLP Algorithm Phases In "Exploiting Superword Level Parallelism with Multimedia Instruction Sets", SLP algorithm can be divided into several distinct phases. Now we focus on three of the core phases, including identifying adjacent memory references, combination, and scheduling. Considering the following figure, which ONE of the following statements is INCORRECT? ![image](https://hackmd.io/_uploads/Hkivu2NXWe.png) * After identifying adjacent memory references, statement 1 and 2 can form a pair, while statement 2 and 6 can form a pair. * After combination, statement 1, 2 and 3 can be combined to form a larger group, while statement 3, 4, 5 can be combined to form a larger group. * After scheduling, dependency analysis will be performed to ensure statements within a group can be executed safely in parallel. However, it can be the case that executing two groups produces a dependency violation. * After all of the phases of SLP algorithm, the above statements can be scheduled and new SIMD operations are emitted. --- ## Problem Group: LLVM Backend ### Question 26: RISC-V Pattern Optimization Consider this actual pattern from RISC-V backend in LLVM (llvm/lib/Target/RISCV/RISCVInstrInfo.td): ```cpp // Check if (add r, imm) can be optimized to (ADDI (ADDI r, imm0), imm1), // in which imm = imm0 + imm1 and both imm0 and imm1 are simm12. We make imm0 // as large as possible and imm1 as small as possible so that we might be able // to use c.addi for the small immediate. def AddiPair : PatLeaf<(imm), [{ if (!N->hasOneUse()) return false; // The immediate operand must be in range [-4096,-2049] or [2048,4094]. int64_t Imm = N->getSExtValue(); return (-4096 <= Imm && Imm <= -2049) || (2048 <= Imm && Imm <= 4094); }]>; // Return imm - (imm < 0 ? -2048 : 2047). def AddiPairImmSmall : SDNodeXForm<imm, [{ int64_t Imm = N->getSExtValue(); int64_t Adj = N->getSExtValue() < 0 ? -2048 : 2047; return CurDAG->getSignedTargetConstant(Imm - Adj, SDLoc(N), N->getValueType(0)); }]>; // Return -2048 if immediate is negative or 2047 if positive. These are the largest simm12 values. def AddiPairImmLarge : SDNodeXForm<imm, [{ int64_t Imm = N->getSExtValue() < 0 ? -2048 : 2047; return CurDAG->getSignedTargetConstant(Imm, SDLoc(N), N->getValueType(0)); }]>; /// Simple optimization def : Pat<(XLenVT (add GPR:$rs1, (AddiPair:$rs2))), (ADDI (XLenVT (ADDI GPR:$rs1, (AddiPairImmLarge AddiPair:$rs2))), (AddiPairImmSmall GPR:$rs2))>; ``` Which statement is TRUE about this pattern? * The pattern matches any immediate addition operation and optimizes it into two ADDI instructions without considering the immediate value range. * AddiPairImmSmall and AddiPairImmLarge transforms are interchangeable, and their order in the pattern doesn't affect the generated code. * This pattern represents a peephole optimization that breaks a large immediate addition into two ADDI instructions with immediates fitting within simm12 range. * The pattern's hasOneUse() check is purely for performance optimization and doesn't affect the correctness of the transformation. * AddiPair matches any immediate value that can't fit in a single ADDI instruction's immediate field. ### Question 27: TableGen and LLVM Backends What best describes the relationship between TableGen, LLVM backends, and target-specific code generation? * During LLVM compilation, TableGen dynamically loads target description files (.td) and interprets them to generate optimized machine code based on the current compilation context and optimization flags, allowing for runtime adaptation of code generation strategies. * TableGen provides a domain-specific language that enables target description unification across different compiler components, where the same .td files may be processed multiple times with different backends to generate register allocators, instruction selectors, and assembly printers. * The primary purpose of TableGen is to act as a runtime database system that manages target-specific information, allowing LLVM backends to query and update target properties during different phases of compilation to enable dynamic optimization decisions. * TableGen files serve as a centralized configuration system that allows modification of target properties without recompiling LLVM, where changes to instruction patterns or register definitions take effect immediately through a runtime reloading mechanism. * When LLVM processes TableGen files, it creates an in-memory representation of target descriptions that can be modified by optimization passes, enabling dynamic instruction selection based on profiling data collected during compilation. ### Question 28: LLVM Backend Implementation Sequence When implementing an LLVM backend for a new target architecture, which of the following best describes the correct sequence and key considerations during implementation? * First define the target's data layout and calling conventions in C++, then implement instruction selection patterns directly in C++, finally add register classes in TableGen. This maximizes flexibility by keeping most code in C++. * First implement instruction encoding and assembly printing, then define register classes in TableGen, finally add instruction selection patterns. The low-level implementation guides higher-level design decisions. * First write the instruction selection patterns in TableGen, then implement instruction encoding, finally define register classes and calling conventions. The instruction patterns determine what registers and conventions are needed. * First define register classes and target description in TableGen, then implement instruction selection patterns and constraints in TableGen, finally handle instruction encoding and assembly printing. The implementation progresses from target-independent to increasingly target-specific representations. * First define register classes in C++, then implement instruction patterns in a mix of C++ and TableGen, finally add instruction encoding. This hybrid approach provides maximum control over the implementation. ### Question 29: Calling Conventions in TableGen Consider the following TableGen definitions for calling conventions: ```cpp def CC_Example : CallingConv<[ CCIfType<[i8, i16], CCPromoteToType<i32>>, CCIfType<[i32], CCAssignToReg<[R0, R1, R2, R3]>>, CCIfType<[i32], CCAssignToStack<4, 4>>, CCIfType<[f32], CCAssignToReg<[F0, F1, F2, F3]>> ]>; def CSR_Base : CalleeSavedRegs<(add X1, X2, X3)>; def CSR_Extended : CalleeSavedRegs<(add CSR_Base, (sequence "X%u", 8, 11))>; ``` Which of the following statements correctly describes how LLVM will handle this calling convention? * The CCPromoteToType<i32> directive will automatically extend all integer arguments to 64 bits, and all callee-saved registers in CSR_Extended will be saved before any function calls. * Float arguments will always be passed in float registers F0-F3 regardless of the number of integer arguments already assigned, and integer arguments will never be placed on the stack. * The sequence directive in CSR_Extended will create callee-saved registers X8 through X11 that must be preserved across all function calls, and this will override the argument passing rules in CC_Example. * Integer arguments smaller than i32 will be sign-extended, placed in R0-R3, and any remaining arguments will be placed on the stack with 8-byte alignment regardless of their original type. * Arguments of type i8 and i16 will be promoted to i32, the first 4 integer arguments will use R0-R3, additional integer arguments will be 4-byte aligned on the stack, and float arguments will use F0-F3. ### Question 30: Instruction Encoding Consider a simple RISC-like architecture with the following 32-bit ADD instruction format: ``` 31 26 25 21 20 16 15 11 10 0 +--------+--------+---------+---------+----------+ | op | rd | rs1 | rs2 | unused | +--------+--------+---------+---------+----------+ 6b 5b 5b 5b 11b ``` where: - op = 0b110001 (ADD instruction opcode) - rd = destination register (5 bits) - rs1 = first source register (5 bits) - rs2 = second source register (5 bits) - unused = must be zero Given this instruction format and the following LLVM backend implementation: ```cpp // In TableGen def ADD : MyInst<(outs GPR:$rd), (ins GPR:$rs1, GPR:$rs2)> { bits<5> rd; bits<5> rs1; bits<5> rs2; let Inst{31-26} = 0b110001; let Inst{25-21} = rd; let Inst{20-16} = rs1; let Inst{15-11} = rs2; let Inst{10-0} = 0b00000000000; } // In C++ unsigned MyTargetMCCodeEmitter::getBinaryCodeForInstr(const MCInst &MI, SmallVectorImpl<MCFixup> &Fixups, const MCSubtargetInfo &STI) const { uint32_t Binary = 0; const MCInstrDesc &Desc = MCII.get(MI.getOpcode()); // Get the encoding bits from instruction descriptor Binary = getBaseOpcodeValue(Desc); // Gets 0b110001 << 26 | 0b00000000000 // Encode each operand based on instruction format if (MI.getNumOperands() > 0) { Binary |= encodeOperand(MI.getOperand(2), Fixups) << 11; // rs2 Binary |= encodeOperand(MI.getOperand(1), Fixups) << 16; // rs1 Binary |= encodeOperand(MI.getOperand(0), Fixups) << 21; // rd } return Binary; } ``` Which of the following statements correctly describes the instruction encoding process? * The fixed bits (opcode and unused bits) are encoded by the MCInst class, while TableGen only defines the positions where operands should be placed in the final instruction. * The instruction's binary encoding is assembled in reverse order (starting with rs2, then rs1, then rd) because lower bit positions must be set first to avoid overwriting previously encoded fields. * getBaseOpcodeValue retrieves the fixed bits defined in TableGen (opcode and unused bits), and then register operands are encoded and shifted into their correct positions based on the instruction format. * The MCInstrDesc contains the actual register numbers for rd, rs1, and rs2, while encodeOperand only determines whether they are GPR registers. * TableGen's Inst fields are only used for documentation, and the actual encoding is completely determined by the shift values (21, 16, 11) in getBinaryCodeForInstr.